-
Bug
-
Resolution: Unresolved
-
Critical
-
None
Whenever a remote connection to a NIO JNLP slave is abruptly terminated, Channel.terminate() does not cleanup exported FastPipedOutputStream, which causes FastPipedInputStream to leak, which causes FastPipedInputStream to loop forever waiting for buffer to be filled.
This causes a job hang and Channel leaks along with all their internals.
Example:
"Channel reader thread: Channel to Maven [/home/ec2-user/devhome/current/jdk/bin/java, -Xmx1g, -XX:MaxPermSize=1g, -Djava.awt.headless=true, -cp, /home/ec2-user/maven31-agent.jar:/home/ec2-user/devhome/current/maven/boot/plexus-classworlds-2.5.1.jar:/home/ec2-user/devhome/current/maven/conf/logging, jenkins.maven3.agent.Maven31Main, /home/ec2-user/devhome/current/maven, /tmp/slave.jar, /home/ec2-user/maven31-interceptor.jar, /home/ec2-user/maven3-interceptor-commons.jar, 26853]" prio=10 tid=0x00002b143d5a6800 nid=0xf0cb in Object.wait() [0x00002b1434e0c000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0000000702df6a88> (a [B) at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175) - locked <0x0000000702df6a88> (a [B) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) - locked <0x0000000702df5118> (a java.io.BufferedInputStream) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) "Executor #0 for Primary Koji Slave Build Machine (i-26ed9ecc) : executing Koji - WildFly #13 / waiting for hudson.remoting.Channel@5f9da360:Channel to Maven [/home/ec2-user/devhome/current/jdk/bin/java, -Xmx1g, -XX:MaxPermSize=1g, -Djava.awt.headless=true, -cp, /home/ec2-user/maven31-agent.jar:/home/ec2-user/devhome/current/maven/boot/plexus-classworlds-2.5.1.jar:/home/ec2-user/devhome/current/maven/conf/logging, jenkins.maven3.agent.Maven31Main, /home/ec2-user/devhome/current/maven, /tmp/slave.jar, /home/ec2-user/maven31-interceptor.jar, /home/ec2-user/maven3-interceptor-commons.jar, 26853]" prio=10 tid=0x00002b144b97f000 nid=0xf0bf in Object.wait() [0x00002b1436624000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0000000702df6400> (a hudson.remoting.UserRequest) at hudson.remoting.Request.call(Request.java:146) - locked <0x0000000702df6400> (a hudson.remoting.UserRequest) at hudson.remoting.Channel.call(Channel.java:751) at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161) at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533) at hudson.model.Run.execute(Run.java:1745) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)
Slave computer log:
JNLP agent connected from /10.40.1.117 <===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 2.47 This is a Unix slave Slave successfully connected and online JNLP agent connected ERROR: Connection terminated java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@f4c5cfc[name=Primary Koji Slave Build Machine (i-26ed9ecc)] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:564) ... 6 more
Discussion:
The problem, ultimately, lies in how the Channel is terminated.
- NioTransport.abort(e) calls ByteArrayReceiver.terminate(e)
- ByteArrayReceiver.terminate(e) calls CommandReceiver.terminate(e)
- CommandReceiver is constructed as anonymous type in Channel constructor in transportSetup and redirects the call to Channel.this.terminate(e);
- Channel.this.terminate(e) closes does NOT cleanup any of the exported Streams in the ExportTable.
- Since channel doesn't clean up FastPipedOutputStream from the ExportTable the FastPipeInputStream.source() always returns a reference to FastPipedOutputStream and the pipe IN side of the pipe will wait on buffer forever.
- Since FastPipeInputStream.read waits for data in buffer forever, the SynchronousCommanTransport$ReaderThread never terminates.
- is related to
-
JENKINS-25672 Distributed deadlock with remote classloading
-
- Open
-
-
JENKINS-25626 Allow an option to start a JNLP agent and terminate SSH connection
-
- Closed
-
The issue is as follows:
The problem is, if Channel A abruptly terminates, Channel B will be stuck in all currently pending read operations (see stack traces above). The only way to terminate those read operations is to interrupt the threads that are blocked in said reads. The only way this can be achieved now is through a manual job abort (or, presumably, a Build Timeout plugin).
There is no easy or short fix for this without a major and altogether backwards-incompatible remoting rewrite.
The problem is two-fold: