Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25697

Channel A with ProxyInput/OutputStream to another Channel B will hang and leak if Channel B is terminated

      Whenever a remote connection to a NIO JNLP slave is abruptly terminated, Channel.terminate() does not cleanup exported FastPipedOutputStream, which causes FastPipedInputStream to leak, which causes FastPipedInputStream to loop forever waiting for buffer to be filled.

      This causes a job hang and Channel leaks along with all their internals.

      Example:

      "Channel reader thread: Channel to Maven [/home/ec2-user/devhome/current/jdk/bin/java, -Xmx1g, -XX:MaxPermSize=1g, -Djava.awt.headless=true, -cp, /home/ec2-user/maven31-agent.jar:/home/ec2-user/devhome/current/maven/boot/plexus-classworlds-2.5.1.jar:/home/ec2-user/devhome/current/maven/conf/logging, jenkins.maven3.agent.Maven31Main, /home/ec2-user/devhome/current/maven, /tmp/slave.jar, /home/ec2-user/maven31-interceptor.jar, /home/ec2-user/maven3-interceptor-commons.jar, 26853]" prio=10 tid=0x00002b143d5a6800 nid=0xf0cb in Object.wait() [0x00002b1434e0c000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x0000000702df6a88> (a [B)
      	at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
      	- locked <0x0000000702df6a88> (a [B)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
      	- locked <0x0000000702df5118> (a java.io.BufferedInputStream)
      	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
      	at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
      	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
      	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
      	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      
      "Executor #0 for Primary Koji Slave Build Machine (i-26ed9ecc) : executing Koji - WildFly #13 / waiting for hudson.remoting.Channel@5f9da360:Channel to Maven [/home/ec2-user/devhome/current/jdk/bin/java, -Xmx1g, -XX:MaxPermSize=1g, -Djava.awt.headless=true, -cp, /home/ec2-user/maven31-agent.jar:/home/ec2-user/devhome/current/maven/boot/plexus-classworlds-2.5.1.jar:/home/ec2-user/devhome/current/maven/conf/logging, jenkins.maven3.agent.Maven31Main, /home/ec2-user/devhome/current/maven, /tmp/slave.jar, /home/ec2-user/maven31-interceptor.jar, /home/ec2-user/maven3-interceptor-commons.jar, 26853]" prio=10 tid=0x00002b144b97f000 nid=0xf0bf in Object.wait() [0x00002b1436624000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x0000000702df6400> (a hudson.remoting.UserRequest)
      	at hudson.remoting.Request.call(Request.java:146)
      	- locked <0x0000000702df6400> (a hudson.remoting.UserRequest)
      	at hudson.remoting.Channel.call(Channel.java:751)
      	at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161)
      	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:840)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533)
      	at hudson.model.Run.execute(Run.java:1745)
      	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529)
      	at hudson.model.ResourceController.execute(ResourceController.java:89)
      	at hudson.model.Executor.run(Executor.java:240)
      

      Slave computer log:

      JNLP agent connected from /10.40.1.117
      <===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 2.47
      This is a Unix slave
      Slave successfully connected and online
      JNLP agent connected
      ERROR: Connection terminated
      java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@f4c5cfc[name=Primary Koji Slave Build Machine (i-26ed9ecc)]
      	at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211)
      	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Connection reset by peer
      	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
      	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
      	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
      	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
      	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
      	at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      	at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:564)
      	... 6 more
      

      Discussion:

      The problem, ultimately, lies in how the Channel is terminated.

      1. NioTransport.abort(e) calls ByteArrayReceiver.terminate(e)
      2. ByteArrayReceiver.terminate(e) calls CommandReceiver.terminate(e)
      3. CommandReceiver is constructed as anonymous type in Channel constructor in transportSetup and redirects the call to Channel.this.terminate(e);
      4. Channel.this.terminate(e) closes does NOT cleanup any of the exported Streams in the ExportTable.
      5. Since channel doesn't clean up FastPipedOutputStream from the ExportTable the FastPipeInputStream.source() always returns a reference to FastPipedOutputStream and the pipe IN side of the pipe will wait on buffer forever.
      6. Since FastPipeInputStream.read waits for data in buffer forever, the SynchronousCommanTransport$ReaderThread never terminates.

          [JENKINS-25697] Channel A with ProxyInput/OutputStream to another Channel B will hang and leak if Channel B is terminated

          Arcadiy Ivanov added a comment - - edited

          The issue is as follows:

          1. First JNLP NIO creates a Channel A.
          2. Then AbstractMavenProcessFactory.newProcess creates a new Channel B by invoking the following:
                      Channel ch = Channels.forProcess("Channel to Maven " + Arrays.toString(cmds),
                              Computer.threadPoolForRemoting, new BufferedInputStream(con.in), new BufferedOutputStream(con.out),
                              listener.getLogger(), proc);
          

          The problem is, if Channel A abruptly terminates, Channel B will be stuck in all currently pending read operations (see stack traces above). The only way to terminate those read operations is to interrupt the threads that are blocked in said reads. The only way this can be achieved now is through a manual job abort (or, presumably, a Build Timeout plugin).

          There is no easy or short fix for this without a major and altogether backwards-incompatible remoting rewrite.
          The problem is two-fold:

          1. Chaining channels requires upstream channels to track downstream ones. Given the current APIs and that channels are dependent on other channels via arbitrary I/O streams that are neither channel- nor dependency-aware altogether, it is virtually impossible.
          2. Even if the first hurdle can be overcome, all threads blocked in IOPs have to be tracked to be interruptable asynchronously with implementation complexity on par with Java's InterruptibleChannel.

          Arcadiy Ivanov added a comment - - edited The issue is as follows: First JNLP NIO creates a Channel A. Then AbstractMavenProcessFactory.newProcess creates a new Channel B by invoking the following: Channel ch = Channels.forProcess("Channel to Maven " + Arrays.toString(cmds), Computer.threadPoolForRemoting, new BufferedInputStream(con.in), new BufferedOutputStream(con.out), listener.getLogger(), proc); The problem is, if Channel A abruptly terminates, Channel B will be stuck in all currently pending read operations (see stack traces above). The only way to terminate those read operations is to interrupt the threads that are blocked in said reads. The only way this can be achieved now is through a manual job abort (or, presumably, a Build Timeout plugin). There is no easy or short fix for this without a major and altogether backwards-incompatible remoting rewrite. The problem is two-fold: Chaining channels requires upstream channels to track downstream ones. Given the current APIs and that channels are dependent on other channels via arbitrary I/O streams that are neither channel- nor dependency-aware altogether, it is virtually impossible. Even if the first hurdle can be overcome, all threads blocked in IOPs have to be tracked to be interruptable asynchronously with implementation complexity on par with Java's InterruptibleChannel.

          Arcadiy Ivanov added a comment - - edited

          It is possible that evicting all exported FastPipedOutputStream objects on Channel.abort from the channel's ExportTable may eventually unblock all FastPipedInputStream, but since WeakReference is used to track the sides of the pipe and it's possible for FastPipedOutputStream hard references to leak virtually anywhere, such fix may not be reliable if it would work at all.

          Arcadiy Ivanov added a comment - - edited It is possible that evicting all exported FastPipedOutputStream objects on Channel.abort from the channel's ExportTable may eventually unblock all FastPipedInputStream, but since WeakReference is used to track the sides of the pipe and it's possible for FastPipedOutputStream hard references to leak virtually anywhere, such fix may not be reliable if it would work at all.

          Eh, is there really such a need for rewriting remoting?

          Isn't the correct thing to add a Channel.Listener to the launcher's channel

          launcher.getChannel().addListener(...);
          

          so that when the backing channel is terminated the registered listener gets called and can force clear up Channel B... (note that you would need to add a second listener to Channel B that removes the first listener from Channel A when Channel B is closed to prevent a memory leak... but nothing that requires a complete remoting rewrite from what I can see

          Stephen Connolly added a comment - Eh, is there really such a need for rewriting remoting? Isn't the correct thing to add a Channel.Listener to the launcher's channel launcher.getChannel().addListener(...); so that when the backing channel is terminated the registered listener gets called and can force clear up Channel B... (note that you would need to add a second listener to Channel B that removes the first listener from Channel A when Channel B is closed to prevent a memory leak... but nothing that requires a complete remoting rewrite from what I can see

          Oleg Nenashev added a comment -

          Yes, I also agree that it does not require a significant rewrite. But the issue is still there, I'd guess

          Oleg Nenashev added a comment - Yes, I also agree that it does not require a significant rewrite. But the issue is still there, I'd guess

            Unassigned Unassigned
            arcivanov Arcadiy Ivanov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: