[JENKINS-26947] Maven job stuck when slave channel get disconnected

Type: Bug
Resolution: Unresolved
Priority: Minor
Component/s: maven-plugin, remoting
Labels:
Environment:
Linux

Similar Issues:
Powered by SuggestiMate

Show

I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

while(response==null && !channel.isInClosed())
  // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
  // but in production I've observed that in rare occasion it can block forever, even after a channel
  // is gone. So be defensive against that.
  wait(30*1000);

When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

tc qdisc add dev docker0 root netem
tc qdisc change dev docker0 root netem corrupt 1

Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

To remove netem settings, just run tc qdisc del dev docker0 root.

I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Dockerfile
0.4 kB
2015-02-12 22:36
launch.sh
0.0 kB
2015-02-12 22:36
stacktrace.txt
44 kB
2015-02-12 22:36

depends on

JENKINS-22252 Maven 3.2.1: IllegalAccessError on AbstractMapBasedMultimap

Closed

is related to

JENKINS-10840 Maven "module" shows as running after build is aborted.

Open

links to

PR 39

Yoann Dubreuil created issue - 2015-02-12 22:36

Daniel Beck added a comment - 2015-02-13 09:33

Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service?

Daniel Beck added a comment - 2015-02-13 09:33 Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service?

Yoann Dubreuil added a comment - 2015-02-13 19:06

You must be on the path of the network stream to be able to change the packet content. Even if you are able to get there, the SSH protocol protects the content of the stream. You would only be able to trigger a disconnection, but at this stage, I bet a lot of other network services are in danger.

So for me, it's not a security issue.

Yoann Dubreuil added a comment - 2015-02-13 19:06 You must be on the path of the network stream to be able to change the packet content. Even if you are able to get there, the SSH protocol protects the content of the stream. You would only be able to trigger a disconnection, but at this stage, I bet a lot of other network services are in danger. So for me, it's not a security issue.

Yoann Dubreuil made changes - 2015-02-24 22:32

Description

Original: I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

{{
while(response==null && !channel.isInClosed())
  // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
  // but in production I've observed that in rare occasion it can block forever, even after a channel
  // is gone. So be defensive against that.
  wait(30*1000);
}}

When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

tc qdisc add dev docker0 root netem
tc qdisc change dev docker0 root netem corrupt 1

Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

To remove netem settings, just run tc qdisc del dev docker0 root.

I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.

New: I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146):

{code}
while(response==null && !channel.isInClosed())
  // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
  // but in production I've observed that in rare occasion it can block forever, even after a channel
  // is gone. So be defensive against that.
  wait(30*1000);
{code}

When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait.

To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this:

{code}
tc qdisc add dev docker0 root netem
tc qdisc change dev docker0 root netem corrupt 1
{code}

Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem...

To remove netem settings, just run tc qdisc del dev docker0 root.

I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master.

James Nord added a comment - 2015-02-25 10:25

Have you disabled the PIngThread at all?

AFAICT netem does not kill the connection - the remote end will be retransmitting the packets - and as such the channel is not closed.
The PingThread should eventually notice this (10 minutes interval + 4 minute timeout) so after at most 14 minutes the connection should be killed and this thread unblock.

James Nord added a comment - 2015-02-25 10:25 Have you disabled the PIngThread at all? AFAICT netem does not kill the connection - the remote end will be retransmitting the packets - and as such the channel is not closed. The PingThread should eventually notice this (10 minutes interval + 4 minute timeout) so after at most 14 minutes the connection should be killed and this thread unblock.

Yoann Dubreuil added a comment - 2015-02-25 10:34

No, I did not disable the ping thread. In fact, I did nothing special, just started a fresh Jenkins instance and connected it to this docker slave. I took the thread dump 30 minutes after the disconnection. Will relaunch the test this afternoon to see if it would ever times out or not.

Yoann Dubreuil added a comment - 2015-02-25 10:34 No, I did not disable the ping thread. In fact, I did nothing special, just started a fresh Jenkins instance and connected it to this docker slave. I took the thread dump 30 minutes after the disconnection. Will relaunch the test this afternoon to see if it would ever times out or not.

Yoann Dubreuil added a comment - 2015-02-25 14:23

The executor thread is still blocked 2 hours after slave gets disconnected.

Stack of the blocked thread is:

"Executor #0 for dumb-docker-ssh : executing gameoflife #41 / waiting for hudson.remoting.Channel@7b736159:Channel to Maven [java, -cp, /home/jenkins/bis/maven31-agent.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/boot/plexus-classworlds-2.5.1.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/conf/logging, jenkins.maven3.agent.Maven31Main, /home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2, /home/jenkins/bis/slave.jar, /home/jenkins/bis/maven31-interceptor.jar, /home/jenkins/bis/maven3-interceptor-commons.jar, 42104]" daemon prio=10 tid=0x00007f11f0187800 nid=0x6d4f in Object.wait() [0x00007f1245eb2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000879416d8> (a hudson.remoting.UserRequest)
	at hudson.remoting.Request.call(Request.java:146)
	- locked <0x00000000879416d8> (a hudson.remoting.UserRequest)
	at hudson.remoting.Channel.call(Channel.java:751)
	at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161)
	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533)
	at hudson.model.Run.execute(Run.java:1745)
	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529)
	at hudson.model.ResourceController.execute(ResourceController.java:89)
	at hudson.model.Executor.run(Executor.java:240)

Yoann Dubreuil added a comment - 2015-02-25 14:23 The executor thread is still blocked 2 hours after slave gets disconnected. Stack of the blocked thread is: "Executor #0 for dumb-docker-ssh : executing gameoflife #41 / waiting for hudson.remoting.Channel@7b736159:Channel to Maven [java, -cp, /home/jenkins/bis/maven31-agent.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/boot/plexus-classworlds-2.5.1.jar:/home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2/conf/logging, jenkins.maven3.agent.Maven31Main, /home/jenkins/bis/tools/hudson.tasks.Maven_MavenInstallation/3.2.2, /home/jenkins/bis/slave.jar, /home/jenkins/bis/maven31-interceptor.jar, /home/jenkins/bis/maven3-interceptor-commons.jar, 42104]" daemon prio=10 tid=0x00007f11f0187800 nid=0x6d4f in Object .wait() [0x00007f1245eb2000] java.lang. Thread .State: TIMED_WAITING (on object monitor) at java.lang. Object .wait(Native Method) - waiting on <0x00000000879416d8> (a hudson.remoting.UserRequest) at hudson.remoting.Request.call(Request.java:146) - locked <0x00000000879416d8> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:751) at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533) at hudson.model.Run.execute(Run.java:1745) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)

James Nord added a comment - 2015-02-25 15:06 - edited

possibly a duplicate of JENKINS-10840

Soemthing strange is going on with Docker and tc.

with 2 freestyle builds I see a failure and the salve is disconnected with.

java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed
	at hudson.FilePath.act(FilePath.java:985)
	at hudson.FilePath.act(FilePath.java:967)
	at hudson.FilePath.exists(FilePath.java:1435)
	at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46)
	at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
	at hudson.remoting.Channel.send(Channel.java:549)
	at hudson.remoting.Request.call(Request.java:129)
	at hudson.remoting.Channel.call(Channel.java:751)
	at hudson.FilePath.act(FilePath.java:978)
	... 9 more
Caused by: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325)
	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
ERROR: Socket connection to SSH server was lost
java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt
	at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766)
	at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489)
	at java.lang.Thread.run(Thread.java:745)

But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application.

The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels.

James Nord added a comment - 2015-02-25 15:06 - edited possibly a duplicate of JENKINS-10840 Soemthing strange is going on with Docker and tc. with 2 freestyle builds I see a failure and the salve is disconnected with. java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:985) at hudson.FilePath.act(FilePath.java:967) at hudson.FilePath.exists(FilePath.java:1435) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:549) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:751) at hudson.FilePath.act(FilePath.java:978) ... 9 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) ERROR: Socket connection to SSH server was lost java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489) at java.lang.Thread.run(Thread.java:745) But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application. The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels.

James Nord added a comment - 2015-03-03 09:46

FWIW the original report has nothing to do with packet corruption - just the channel dying.

You can get the same results with a "kill -9" on the slave.

James Nord added a comment - 2015-03-03 09:46 FWIW the original report has nothing to do with packet corruption - just the channel dying. You can get the same results with a "kill -9" on the slave.

Yoann Dubreuil added a comment - 2015-03-03 10:01

Yes that's right. I found the problem when playing with netem, hence the bug report.

It's a bug in the Maven plugin. When upstream channel is closed, Maven channel stays around. Will post a PR shortly.

Yoann Dubreuil added a comment - 2015-03-03 10:01 Yes that's right. I found the problem when playing with netem, hence the bug report. It's a bug in the Maven plugin. When upstream channel is closed, Maven channel stays around. Will post a PR shortly.

Assignee:: Unassigned

Reporter:: Yoann Dubreuil

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2015-02-12 22:36

Updated:: 2018-03-04 23:33

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Daniel Beck added a comment - 2015-02-13 09:33

Expand comment: Daniel Beck added a comment - 2015-02-13 09:33

Collapse comment: Yoann Dubreuil added a comment - 2015-02-13 19:06

Expand comment: Yoann Dubreuil added a comment - 2015-02-13 19:06

Collapse comment: James Nord added a comment - 2015-02-25 10:25

Expand comment: James Nord added a comment - 2015-02-25 10:25

Collapse comment: Yoann Dubreuil added a comment - 2015-02-25 10:34

Expand comment: Yoann Dubreuil added a comment - 2015-02-25 10:34

Collapse comment: Yoann Dubreuil added a comment - 2015-02-25 14:23

Expand comment: Yoann Dubreuil added a comment - 2015-02-25 14:23

Collapse comment: James Nord added a comment - 2015-02-25 15:06, Edited by James Nord - 2015-02-25 15:07

Expand comment: James Nord added a comment - 2015-02-25 15:06, Edited by James Nord - 2015-02-25 15:07

Collapse comment: James Nord added a comment - 2015-03-03 09:46

Expand comment: James Nord added a comment - 2015-03-03 09:46

Collapse comment: Yoann Dubreuil added a comment - 2015-03-03 10:01

Expand comment: Yoann Dubreuil added a comment - 2015-03-03 10:01

People

Dates