Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25218

Channel hangs due to the infinite loop in FifoBuffer within the lock

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core, remoting
    • None

      I noticed the following "dead lock" that prevents NioChannelHub from serving any channels, which breaks all the slaves.

      NioChannelHub thread is blocked:
      
          "NioChannelHub keys=2 gen=185197: Computer.threadPoolForRemoting [#3]" daemon prio=10 tid=0x00007f872c021800 nid=0x1585 waiting for monitor entry [0x00007f86ce2ba000]
             java.lang.Thread.State: BLOCKED (on object monitor)
      	    at hudson.remoting.Channel.terminate(Channel.java:792)
      	    - waiting to lock <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.Channel$2.terminate(Channel.java:483)
      	    at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72)
      	    at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:203)
      	    at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:597)
      	    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      	    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	    at java.lang.Thread.run(Thread.java:662)
      
      ... because of this guy:
      
          "Computer.threadPoolForRemoting [#216] for mac" daemon prio=10 tid=0x00007f86dc0d6800 nid=0x3f34 in Object.wait() [0x00007f87442f1000]
             java.lang.Thread.State: WAITING (on object monitor)
      	    at java.lang.Object.wait(Native Method)
      	    - waiting on <0x00007f874ef76810> (a org.jenkinsci.remoting.nio.FifoBuffer)
      	    at java.lang.Object.wait(Object.java:485)
      	    at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336)
      	    - locked <0x00007f874ef76810> (a org.jenkinsci.remoting.nio.FifoBuffer)
      	    at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:215)
      	    at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83)
      	    at hudson.remoting.Channel.send(Channel.java:545)
      	    - locked <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.Request$2.run(Request.java:342)
      	    - locked <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	    at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	    at java.lang.Thread.run(Thread.java:662)
      

      Full thread dump is here

          [JENKINS-25218] Channel hangs due to the infinite loop in FifoBuffer within the lock

          Also seeing similar live-locks with the following thread holding the lock:

          "Computer.threadPoolForRemoting [#173620] : IO ID=7851 : seq#=7850" #2080996 daemon prio=5 os_prio=0 tid=0x0000000088306800 nid=0x15e0 in Object.wait() [0x00000000f209f000]
             java.lang.Thread.State: WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at java.lang.Object.wait(Object.java:502)
              at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336)
              - locked <0x000000044d3408b8> (a org.jenkinsci.remoting.nio.FifoBuffer)
              at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:220)
              at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83)
              at hudson.remoting.Channel.send(Channel.java:576)
              - locked <0x000000044d3407a0> (a hudson.remoting.Channel)
              at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:260)
              at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
              at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
          

          The NioChannelHub thread has the same stack trace (modulo line numbers)

          Stephen Connolly added a comment - Also seeing similar live-locks with the following thread holding the lock: "Computer.threadPoolForRemoting [#173620] : IO ID=7851 : seq#=7850" #2080996 daemon prio=5 os_prio=0 tid=0x0000000088306800 nid=0x15e0 in Object .wait() [0x00000000f209f000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:502) at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336) - locked <0x000000044d3408b8> (a org.jenkinsci.remoting.nio.FifoBuffer) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:220) at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83) at hudson.remoting.Channel.send(Channel.java:576) - locked <0x000000044d3407a0> (a hudson.remoting.Channel) at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:260) at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) The NioChannelHub thread has the same stack trace (modulo line numbers)

          Oleg Nenashev added a comment -

          I have found the cause of this issue in the FifoBuffer implementation. It happens when....

          1) FifoBuffer is full (sz == limit)
          2) We want to write a new message to the buffer
          3) So we get into this code... https://github.com/jenkinsci/remoting/blob/master/src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java#L342-L348

                      synchronized (lock) {
                          while ((chunk = Math.min(len,writable()))==0) {
                              if (closed)
                                  throw new IOException("closed during write operation");
          
                              lock.wait(100);
          }
          

          4) There is an infinite cycle, which waits till we have some space for writing data, but we will never get space if the buffer is full

          • receive() operation won't happen since we synchronize the code by lock
          • we won't be able to close the channel, because "closed" field modification is guarded by lock as well. Sad panda

          Fixing the issue right now

          Oleg Nenashev added a comment - I have found the cause of this issue in the FifoBuffer implementation. It happens when.... 1) FifoBuffer is full (sz == limit) 2) We want to write a new message to the buffer 3) So we get into this code... https://github.com/jenkinsci/remoting/blob/master/src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java#L342-L348 synchronized (lock) { while ((chunk = Math.min(len,writable()))==0) { if (closed) throw new IOException("closed during write operation"); lock.wait(100); } 4) There is an infinite cycle, which waits till we have some space for writing data, but we will never get space if the buffer is full receive() operation won't happen since we synchronize the code by lock we won't be able to close the channel, because "closed" field modification is guarded by lock as well. Sad panda Fixing the issue right now

          Oleg Nenashev added a comment - - edited

          Partial fix has been applied in JENKINS-32825, which is released in 2.54. Jenkins cores should get this version starting from 1.651.x.But we're still at risk in particular corner cases

          I'm working on additional fixes around the behavior

          Oleg Nenashev added a comment - - edited Partial fix has been applied in JENKINS-32825 , which is released in 2.54. Jenkins cores should get this version starting from 1.651.x.But we're still at risk in particular corner cases I'm working on additional fixes around the behavior

          Oleg Nenashev added a comment -

          Created additional pull request: https://github.com/jenkinsci/remoting/pull/100

          Oleg Nenashev added a comment - Created additional pull request: https://github.com/jenkinsci/remoting/pull/100

          Oleg Nenashev added a comment -

          Additional fixes have been integrated towards remoting-2.62.4 and remoting-3.3

          Oleg Nenashev added a comment - Additional fixes have been integrated towards remoting-2.62.4 and remoting-3.3

          Code changed in jenkins
          User: Oleg Nenashev
          Path:
          pom.xml
          http://jenkins-ci.org/commit/jenkins/ef588be4f264b5ba285110f472f031e2bd771c71
          Log:
          Update Jenkins remoting to 3.3 (#2671)

          • JENKINS-25218 - Hardening of FifoBuffer operation logic. The change improves the original fix in `remoting-2.54`.
          • JENKINS-39547 - Corrupt agent JAR cache causes agents to malfunction.

          Improvements:

          • JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination.
          • ProxyException now retains any suppressed exceptions.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: pom.xml http://jenkins-ci.org/commit/jenkins/ef588be4f264b5ba285110f472f031e2bd771c71 Log: Update Jenkins remoting to 3.3 (#2671) JENKINS-25218 - Hardening of FifoBuffer operation logic. The change improves the original fix in `remoting-2.54`. JENKINS-39547 - Corrupt agent JAR cache causes agents to malfunction. Improvements: JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination. ProxyException now retains any suppressed exceptions.

          Chris Phillips added a comment - - edited

          I have just been investigating a problem in our jenkins setup that I think might be related to this. We're using the EC2 plugin and running builds that generate quite large logs (230 MB). At some point during the build, the master loses track of the log and just starts logging the same block of text from the log over and over as long as I let it. The build completes successfully on the slave and nothing bad appears in the Node log in the jenkins UI. But the master continues to fill up the filesystem with the same repeated text forever. I changed the build to log much less and now this isn't happening. We're running 2.46.2. Could this potentially be one of the edge cases? https://issues.jenkins-ci.org/browse/JENKINS-44483

          Chris Phillips added a comment - - edited I have just been investigating a problem in our jenkins setup that I think might be related to this. We're using the EC2 plugin and running builds that generate quite large logs (230 MB). At some point during the build, the master loses track of the log and just starts logging the same block of text from the log over and over as long as I let it. The build completes successfully on the slave and nothing bad appears in the Node log in the jenkins UI. But the master continues to fill up the filesystem with the same repeated text forever. I changed the build to log much less and now this isn't happening. We're running 2.46.2. Could this potentially be one of the edge cases?  https://issues.jenkins-ci.org/browse/JENKINS-44483

          Oleg Nenashev added a comment -

          As we discussed in JENKINS-44483 , it is likely a Pipeline-specific issue.

          Regarding this ticket, I am going to close it since there were extra patches applied. If somebody still sees it with remoting 3.5+, please let me know

          Oleg Nenashev added a comment - As we discussed in JENKINS-44483 , it is likely a Pipeline-specific issue. Regarding this ticket, I am going to close it since there were extra patches applied. If somebody still sees it with remoting 3.5+, please let me know

            oleg_nenashev Oleg Nenashev
            kohsuke Kohsuke Kawaguchi
            Votes:
            7 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: