Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-14332

Repeated channel/timeout errors from Jenkins slave

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Blocker Blocker
    • jenkins-1.509.4 with remoting-2.36
      ssh-slaves-1.2

      The issue appears on my custom build of the Jenkins core, but seems it could be reproduced on newest versions as well.

      We've experienced a network overloading, which has let to the exception in the PingThread on Jenkins master, which has closed the communication channel. However, the slave stills online and takes jobs, but any remote action fails (see logs above) => All scheduled builds fail with an error

      The issue affects ssh-slaves only:

      • Linux SSH slaves are "online", but all jobs on the fail with the error above
      • Windows services have reconnected automatically...
      • Windows JNLP slaves have reconnected as well

          [JENKINS-14332] Repeated channel/timeout errors from Jenkins slave

          I had the same issue.

          Downgrading the Jenkins master kernel from 3.14.19-17.43.amzn1.x86_64 to 3.4.62-53.42.amzn1.x86_64 solved it.

          Poul Henriksen added a comment - I had the same issue. Downgrading the Jenkins master kernel from 3.14.19-17.43.amzn1.x86_64 to 3.4.62-53.42.amzn1.x86_64 solved it.

          Bert Jan Schrijver added a comment - - edited

          Same thing for us: Amazon Linux master with EC2 slaves plugin and Amazon Linux slaves. Builds were randomly hanging and slaves were timing out.
          We downgraded the kernel on the master this morning from 3.14.23-22.44.amzn1.x86_64 to 3.4.73-64.112.amzn1.x86_64 and haven't seen any issues since.
          Slaves are still running 3.14 kernel (3.14.27-25.47.amzn1.x86_64).
          I'll report back later.

          Bert Jan Schrijver added a comment - - edited Same thing for us: Amazon Linux master with EC2 slaves plugin and Amazon Linux slaves. Builds were randomly hanging and slaves were timing out. We downgraded the kernel on the master this morning from 3.14.23-22.44.amzn1.x86_64 to 3.4.73-64.112.amzn1.x86_64 and haven't seen any issues since. Slaves are still running 3.14 kernel (3.14.27-25.47.amzn1.x86_64). I'll report back later.

          I'm on Amazon EC2 w/ Ubuntu 14.04.1 LTS.

          I've downgraded the kernel (manually) to 3.4.

          wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4.105-quantal/linux-headers-3.4.105-0304105-generic_3.4.105-0304105.201412012335_amd64.deb
          wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4.105-quantal/linux-headers-3.4.105-0304105_3.4.105-0304105.201412012335_all.deb
          wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4.105-quantal/linux-image-3.4.105-0304105-generic_3.4.105-0304105.201412012335_amd64.deb
          dpkg -i linux-*
          

          Once I had the 3.4 kernel installed, I had to make it the default kernel on boot, so I followed the instructions here: http://statusq.org/archives/2012/10/24/4584/

          I'm trying out some builds now to see how Jenkins behaves...

          Jonathan Langevin added a comment - I'm on Amazon EC2 w/ Ubuntu 14.04.1 LTS. I've downgraded the kernel (manually) to 3.4. wget http: //kernel.ubuntu.com/~kernel-ppa/mainline/v3.4.105-quantal/linux-headers-3.4.105-0304105-generic_3.4.105-0304105.201412012335_amd64.deb wget http: //kernel.ubuntu.com/~kernel-ppa/mainline/v3.4.105-quantal/linux-headers-3.4.105-0304105_3.4.105-0304105.201412012335_all.deb wget http: //kernel.ubuntu.com/~kernel-ppa/mainline/v3.4.105-quantal/linux-image-3.4.105-0304105-generic_3.4.105-0304105.201412012335_amd64.deb dpkg -i linux-* Once I had the 3.4 kernel installed, I had to make it the default kernel on boot, so I followed the instructions here: http://statusq.org/archives/2012/10/24/4584/ I'm trying out some builds now to see how Jenkins behaves...

          Downgrading the kernel on the master has definitely fixed it for us.
          Running for a week now without any trouble.

          Bert Jan Schrijver added a comment - Downgrading the kernel on the master has definitely fixed it for us. Running for a week now without any trouble.

          Sean Abbott added a comment -

          I have the same issue. kernel is 3.14. Part of the problem is that when the slave.jar file fails, it does NOT cause the agent to display as offline for the master, so the master keeps trying to send jobs:

          Expanded the channel window size to 4MB
          [05/11/15 18:02:23] [SSH] Starting slave process: cd "/var/lib/jenkins" && java -Dfile.encoding=UTF8 -jar slave.jar

          <===[JENKINS REMOTING CAPACITY]===>ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins.
          java.lang.IllegalStateException: Already connected
          at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:448)
          at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:366)
          at hudson.plugins.sshslaves.SSHLauncher.startSlave(SSHLauncher.java:945)
          at hudson.plugins.sshslaves.SSHLauncher.access$400(SSHLauncher.java:133)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:696)
          at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
          [05/11/15 18:02:24] Launch failed - cleaning up connection
          [05/11/15 18:02:24] [SSH] Connection closed.

          Even though the log reports the connection closed, jenkins still reports it as up.

          My jenkins master is on 1.596.2.

          Sean Abbott added a comment - I have the same issue. kernel is 3.14. Part of the problem is that when the slave.jar file fails, it does NOT cause the agent to display as offline for the master, so the master keeps trying to send jobs: Expanded the channel window size to 4MB [05/11/15 18:02:23] [SSH] Starting slave process: cd "/var/lib/jenkins" && java -Dfile.encoding=UTF8 -jar slave.jar <=== [JENKINS REMOTING CAPACITY] ===>ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins. java.lang.IllegalStateException: Already connected at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:448) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:366) at hudson.plugins.sshslaves.SSHLauncher.startSlave(SSHLauncher.java:945) at hudson.plugins.sshslaves.SSHLauncher.access$400(SSHLauncher.java:133) at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711) at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:696) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) [05/11/15 18:02:24] Launch failed - cleaning up connection [05/11/15 18:02:24] [SSH] Connection closed. Even though the log reports the connection closed, jenkins still reports it as up. My jenkins master is on 1.596.2.

          Sean Abbott added a comment -

          I was able to connect to the same slave from another jenkins master using the same kernel and jenkins version with no issues...

          Sean Abbott added a comment - I was able to connect to the same slave from another jenkins master using the same kernel and jenkins version with no issues...

          Hi,

          I just do a test on last AWS Linux machine (kernel : 3.14.35), with the same machine on both master and slave.
          And the problem gone ...
          Jenkins version used is the last stable : 1.609.1

          Regards

          Guillaume Boucherie added a comment - Hi, I just do a test on last AWS Linux machine (kernel : 3.14.35), with the same machine on both master and slave. And the problem gone ... Jenkins version used is the last stable : 1.609.1 Regards

          Jesse Glick added a comment -

          Related to JENKINS-1948 perhaps?

          Jesse Glick added a comment - Related to JENKINS-1948 perhaps?

          In AWS, we are using Ubuntu 14.04.4 LTS. EC2-plugin version is 1.36. We are also seeing similar errors where the agent would disconnect from Jenkins randomly with the error below.

           

          ERROR: SEVERE ERROR occurs
          org.jenkinsci.lib.envinject.EnvInjectException: hudson.remoting.ChannelClosedException: channel is already closed
          at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:79)
          at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:80)
          at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:42)
          at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:572)
          at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:492)
          at hudson.model.Run.execute(Run.java:1741)
          at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          at hudson.model.ResourceController.execute(ResourceController.java:98)
          at hudson.model.Executor.run(Executor.java:410)
          Caused by: hudson.remoting.ChannelClosedException: channel is already closed
          at hudson.remoting.Channel.send(Channel.java:578)
          at hudson.remoting.Request.call(Request.java:130)
          at hudson.remoting.Channel.call(Channel.java:780)
          at hudson.FilePath.act(FilePath.java:1102)
          at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:48)
          ... 8 more
          Caused by: java.io.IOException
          at hudson.remoting.Channel.close(Channel.java:1163)
          at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:121)
          at hudson.remoting.PingThread.ping(PingThread.java:130)
          at hudson.remoting.PingThread.run(PingThread.java:86)
          Caused by: java.util.concurrent.TimeoutException: Ping started at 1493347954228 hasn't completed by 1493348194229
          
          

           

          Srikanth Vadlamani added a comment - In AWS, we are using Ubuntu 14.04.4 LTS. EC2-plugin version is 1.36. We are also seeing similar errors where the agent would disconnect from Jenkins randomly with the error below.   ERROR: SEVERE ERROR occurs org.jenkinsci.lib.envinject.EnvInjectException: hudson.remoting.ChannelClosedException: channel is already closed at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:79) at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:80) at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:42) at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:572) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:492) at hudson.model.Run.execute(Run.java:1741) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:410) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:578) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:780) at hudson.FilePath.act(FilePath.java:1102) at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:48) ... 8 more Caused by: java.io.IOException at hudson.remoting.Channel.close(Channel.java:1163) at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:121) at hudson.remoting.PingThread.ping(PingThread.java:130) at hudson.remoting.PingThread.run(PingThread.java:86) Caused by: java.util.concurrent.TimeoutException: Ping started at 1493347954228 hasn't completed by 1493348194229  

          because there is not recent info here and seems similar to JENKINS-53810 I will close it.

          Ivan Fernandez Calvo added a comment - because there is not recent info here and seems similar to JENKINS-53810 I will close it.

            ifernandezcalvo Ivan Fernandez Calvo
            olamy Olivier Lamy
            Votes:
            33 Vote for this issue
            Watchers:
            51 Start watching this issue

              Created:
              Updated:
              Resolved: