Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26947

Maven job stuck when slave channel get disconnected

      I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

      while(response==null && !channel.isInClosed())
        // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
        // but in production I've observed that in rare occasion it can block forever, even after a channel
        // is gone. So be defensive against that.
        wait(30*1000);
      

      When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

      To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

      tc qdisc add dev docker0 root netem
      tc qdisc change dev docker0 root netem corrupt 1
      

      Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

      To remove netem settings, just run tc qdisc del dev docker0 root.

      I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.

        1. Dockerfile
          0.4 kB
        2. launch.sh
          0.0 kB
        3. stacktrace.txt
          44 kB

          [JENKINS-26947] Maven job stuck when slave channel get disconnected

          Yoann Dubreuil created issue -

          Daniel Beck added a comment -

          Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service?

          Daniel Beck added a comment - Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service?

          You must be on the path of the network stream to be able to change the packet content. Even if you are able to get there, the SSH protocol protects the content of the stream. You would only be able to trigger a disconnection, but at this stage, I bet a lot of other network services are in danger.

          So for me, it's not a security issue.

          Yoann Dubreuil added a comment - You must be on the path of the network stream to be able to change the packet content. Even if you are able to get there, the SSH protocol protects the content of the stream. You would only be able to trigger a disconnection, but at this stage, I bet a lot of other network services are in danger. So for me, it's not a security issue.
          Yoann Dubreuil made changes -
          Description Original: I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

          {{
          while(response==null && !channel.isInClosed())
            // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
            // but in production I've observed that in rare occasion it can block forever, even after a channel
            // is gone. So be defensive against that.
            wait(30*1000);
          }}

          When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

          To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

          tc qdisc add dev docker0 root netem
          tc qdisc change dev docker0 root netem corrupt 1

          Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

          To remove netem settings, just run tc qdisc del dev docker0 root.

          I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.
          New: I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

          {code}
          while(response==null && !channel.isInClosed())
            // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
            // but in production I've observed that in rare occasion it can block forever, even after a channel
            // is gone. So be defensive against that.
            wait(30*1000);
          {code}

          When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

          To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

          {code}
          tc qdisc add dev docker0 root netem
          tc qdisc change dev docker0 root netem corrupt 1
          {code}

          Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

          To remove netem settings, just run tc qdisc del dev docker0 root.

          I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.

          James Nord added a comment -

          Have you disabled the PIngThread at all?

          AFAICT netem does not kill the connection - the remote end will be retransmitting the packets - and as such the channel is not closed.
          The PingThread should eventually notice this (10 minutes interval + 4 minute timeout) so after at most 14 minutes the connection should be killed and this thread unblock.

          James Nord added a comment - Have you disabled the PIngThread at all? AFAICT netem does not kill the connection - the remote end will be retransmitting the packets - and as such the channel is not closed. The PingThread should eventually notice this (10 minutes interval + 4 minute timeout) so after at most 14 minutes the connection should be killed and this thread unblock.

          No, I did not disable the ping thread. In fact, I did nothing special, just started a fresh Jenkins instance and connected it to this docker slave. I took the thread dump 30 minutes after the disconnection. Will relaunch the test this afternoon to see if it would ever times out or not.

          Yoann Dubreuil added a comment - No, I did not disable the ping thread. In fact, I did nothing special, just started a fresh Jenkins instance and connected it to this docker slave. I took the thread dump 30 minutes after the disconnection. Will relaunch the test this afternoon to see if it would ever times out or not.

          The executor thread is still blocked 2 hours after slave gets disconnected.

          Stack of the blocked thread is:

          "Executor #0 for dumb-docker-ssh : executing gameoflife #41 / waiting for hudson.remoting.Channel@7b736159:Channel to Maven [java, -cp, /home/jenkins/bis/maven31-agent.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/boot/plexus-classworlds-2.5.1.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/conf/logging, jenkins.maven3.agent.Maven31Main, /home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2, /home/jenkins/bis/slave.jar, /home/jenkins/bis/maven31-interceptor.jar, /home/jenkins/bis/maven3-interceptor-commons.jar, 42104]" daemon prio=10 tid=0x00007f11f0187800 nid=0x6d4f in Object.wait() [0x00007f1245eb2000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	- waiting on <0x00000000879416d8> (a hudson.remoting.UserRequest)
          	at hudson.remoting.Request.call(Request.java:146)
          	- locked <0x00000000879416d8> (a hudson.remoting.UserRequest)
          	at hudson.remoting.Channel.call(Channel.java:751)
          	at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161)
          	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533)
          	at hudson.model.Run.execute(Run.java:1745)
          	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529)
          	at hudson.model.ResourceController.execute(ResourceController.java:89)
          	at hudson.model.Executor.run(Executor.java:240)
          

          Yoann Dubreuil added a comment - The executor thread is still blocked 2 hours after slave gets disconnected. Stack of the blocked thread is: "Executor #0 for dumb-docker-ssh : executing gameoflife #41 / waiting for hudson.remoting.Channel@7b736159:Channel to Maven [java, -cp, /home/jenkins/bis/maven31-agent.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/boot/plexus-classworlds-2.5.1.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/conf/logging, jenkins.maven3.agent.Maven31Main, /home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2, /home/jenkins/bis/slave.jar, /home/jenkins/bis/maven31-interceptor.jar, /home/jenkins/bis/maven3-interceptor-commons.jar, 42104]" daemon prio=10 tid=0x00007f11f0187800 nid=0x6d4f in Object .wait() [0x00007f1245eb2000] java.lang. Thread .State: TIMED_WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0x00000000879416d8> (a hudson.remoting.UserRequest) at hudson.remoting.Request.call(Request.java:146) - locked <0x00000000879416d8> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:751) at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533) at hudson.model.Run.execute(Run.java:1745) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)

          James Nord added a comment - - edited

          possibly a duplicate of JENKINS-10840

          Soemthing strange is going on with Docker and tc.

          with 2 freestyle builds I see a failure and the salve is disconnected with.

          java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed
          	at hudson.FilePath.act(FilePath.java:985)
          	at hudson.FilePath.act(FilePath.java:967)
          	at hudson.FilePath.exists(FilePath.java:1435)
          	at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46)
          	at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37)
          	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:745)
          Caused by: hudson.remoting.ChannelClosedException: channel is already closed
          	at hudson.remoting.Channel.send(Channel.java:549)
          	at hudson.remoting.Request.call(Request.java:129)
          	at hudson.remoting.Channel.call(Channel.java:751)
          	at hudson.FilePath.act(FilePath.java:978)
          	... 9 more
          Caused by: java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
          Caused by: java.io.EOFException
          	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325)
          	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          ERROR: Socket connection to SSH server was lost
          java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt
          	at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766)
          	at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489)
          	at java.lang.Thread.run(Thread.java:745)
          

          But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application.

          The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels.

          James Nord added a comment - - edited possibly a duplicate of JENKINS-10840 Soemthing strange is going on with Docker and tc. with 2 freestyle builds I see a failure and the salve is disconnected with. java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:985) at hudson.FilePath.act(FilePath.java:967) at hudson.FilePath.exists(FilePath.java:1435) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:549) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:751) at hudson.FilePath.act(FilePath.java:978) ... 9 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) ERROR: Socket connection to SSH server was lost java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489) at java.lang.Thread.run(Thread.java:745) But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application. The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels.

          James Nord added a comment -

          FWIW the original report has nothing to do with packet corruption - just the channel dying.

          You can get the same results with a "kill -9" on the slave.

          James Nord added a comment - FWIW the original report has nothing to do with packet corruption - just the channel dying. You can get the same results with a "kill -9" on the slave.

          Yes that's right. I found the problem when playing with netem, hence the bug report.

          It's a bug in the Maven plugin. When upstream channel is closed, Maven channel stays around. Will post a PR shortly.

          Yoann Dubreuil added a comment - Yes that's right. I found the problem when playing with netem, hence the bug report. It's a bug in the Maven plugin. When upstream channel is closed, Maven channel stays around. Will post a PR shortly.

            Unassigned Unassigned
            ydubreuil Yoann Dubreuil
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: