Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25218

Channel hangs due to the infinite loop in FifoBuffer within the lock

    XMLWordPrintable

Details

    Description

      I noticed the following "dead lock" that prevents NioChannelHub from serving any channels, which breaks all the slaves.

      NioChannelHub thread is blocked:
      
          "NioChannelHub keys=2 gen=185197: Computer.threadPoolForRemoting [#3]" daemon prio=10 tid=0x00007f872c021800 nid=0x1585 waiting for monitor entry [0x00007f86ce2ba000]
             java.lang.Thread.State: BLOCKED (on object monitor)
      	    at hudson.remoting.Channel.terminate(Channel.java:792)
      	    - waiting to lock <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.Channel$2.terminate(Channel.java:483)
      	    at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72)
      	    at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:203)
      	    at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:597)
      	    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      	    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	    at java.lang.Thread.run(Thread.java:662)
      
      ... because of this guy:
      
          "Computer.threadPoolForRemoting [#216] for mac" daemon prio=10 tid=0x00007f86dc0d6800 nid=0x3f34 in Object.wait() [0x00007f87442f1000]
             java.lang.Thread.State: WAITING (on object monitor)
      	    at java.lang.Object.wait(Native Method)
      	    - waiting on <0x00007f874ef76810> (a org.jenkinsci.remoting.nio.FifoBuffer)
      	    at java.lang.Object.wait(Object.java:485)
      	    at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336)
      	    - locked <0x00007f874ef76810> (a org.jenkinsci.remoting.nio.FifoBuffer)
      	    at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:215)
      	    at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83)
      	    at hudson.remoting.Channel.send(Channel.java:545)
      	    - locked <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.Request$2.run(Request.java:342)
      	    - locked <0x00007f874ef76658> (a hudson.remoting.Channel)
      	    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	    at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	    at java.lang.Thread.run(Thread.java:662)
      

      Full thread dump is here

      Attachments

        Issue Links

          Activity

            oleg_nenashev Oleg Nenashev added a comment -

            As we discussed in JENKINS-44483 , it is likely a Pipeline-specific issue.

            Regarding this ticket, I am going to close it since there were extra patches applied. If somebody still sees it with remoting 3.5+, please let me know

            oleg_nenashev Oleg Nenashev added a comment - As we discussed in JENKINS-44483 , it is likely a Pipeline-specific issue. Regarding this ticket, I am going to close it since there were extra patches applied. If somebody still sees it with remoting 3.5+, please let me know
            chris_phillips Chris Phillips added a comment - - edited

            I have just been investigating a problem in our jenkins setup that I think might be related to this. We're using the EC2 plugin and running builds that generate quite large logs (230 MB). At some point during the build, the master loses track of the log and just starts logging the same block of text from the log over and over as long as I let it. The build completes successfully on the slave and nothing bad appears in the Node log in the jenkins UI. But the master continues to fill up the filesystem with the same repeated text forever. I changed the build to log much less and now this isn't happening. We're running 2.46.2. Could this potentially be one of the edge cases? https://issues.jenkins-ci.org/browse/JENKINS-44483

            chris_phillips Chris Phillips added a comment - - edited I have just been investigating a problem in our jenkins setup that I think might be related to this. We're using the EC2 plugin and running builds that generate quite large logs (230 MB). At some point during the build, the master loses track of the log and just starts logging the same block of text from the log over and over as long as I let it. The build completes successfully on the slave and nothing bad appears in the Node log in the jenkins UI. But the master continues to fill up the filesystem with the same repeated text forever. I changed the build to log much less and now this isn't happening. We're running 2.46.2. Could this potentially be one of the edge cases?  https://issues.jenkins-ci.org/browse/JENKINS-44483

            Code changed in jenkins
            User: Oleg Nenashev
            Path:
            pom.xml
            http://jenkins-ci.org/commit/jenkins/ef588be4f264b5ba285110f472f031e2bd771c71
            Log:
            Update Jenkins remoting to 3.3 (#2671)

            • JENKINS-25218 - Hardening of FifoBuffer operation logic. The change improves the original fix in `remoting-2.54`.
            • JENKINS-39547 - Corrupt agent JAR cache causes agents to malfunction.

            Improvements:

            • JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination.
            • ProxyException now retains any suppressed exceptions.
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: pom.xml http://jenkins-ci.org/commit/jenkins/ef588be4f264b5ba285110f472f031e2bd771c71 Log: Update Jenkins remoting to 3.3 (#2671) JENKINS-25218 - Hardening of FifoBuffer operation logic. The change improves the original fix in `remoting-2.54`. JENKINS-39547 - Corrupt agent JAR cache causes agents to malfunction. Improvements: JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination. ProxyException now retains any suppressed exceptions.
            oleg_nenashev Oleg Nenashev added a comment -

            Additional fixes have been integrated towards remoting-2.62.4 and remoting-3.3

            oleg_nenashev Oleg Nenashev added a comment - Additional fixes have been integrated towards remoting-2.62.4 and remoting-3.3
            oleg_nenashev Oleg Nenashev added a comment -

            Created additional pull request: https://github.com/jenkinsci/remoting/pull/100

            oleg_nenashev Oleg Nenashev added a comment - Created additional pull request: https://github.com/jenkinsci/remoting/pull/100
            oleg_nenashev Oleg Nenashev added a comment - - edited

            Partial fix has been applied in JENKINS-32825, which is released in 2.54. Jenkins cores should get this version starting from 1.651.x.But we're still at risk in particular corner cases

            I'm working on additional fixes around the behavior

            oleg_nenashev Oleg Nenashev added a comment - - edited Partial fix has been applied in JENKINS-32825 , which is released in 2.54. Jenkins cores should get this version starting from 1.651.x.But we're still at risk in particular corner cases I'm working on additional fixes around the behavior
            oleg_nenashev Oleg Nenashev added a comment -

            I have found the cause of this issue in the FifoBuffer implementation. It happens when....

            1) FifoBuffer is full (sz == limit)
            2) We want to write a new message to the buffer
            3) So we get into this code... https://github.com/jenkinsci/remoting/blob/master/src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java#L342-L348

                        synchronized (lock) {
                            while ((chunk = Math.min(len,writable()))==0) {
                                if (closed)
                                    throw new IOException("closed during write operation");
            
                                lock.wait(100);
            }
            

            4) There is an infinite cycle, which waits till we have some space for writing data, but we will never get space if the buffer is full

            • receive() operation won't happen since we synchronize the code by lock
            • we won't be able to close the channel, because "closed" field modification is guarded by lock as well. Sad panda

            Fixing the issue right now

            oleg_nenashev Oleg Nenashev added a comment - I have found the cause of this issue in the FifoBuffer implementation. It happens when.... 1) FifoBuffer is full (sz == limit) 2) We want to write a new message to the buffer 3) So we get into this code... https://github.com/jenkinsci/remoting/blob/master/src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java#L342-L348 synchronized (lock) { while ((chunk = Math.min(len,writable()))==0) { if (closed) throw new IOException("closed during write operation"); lock.wait(100); } 4) There is an infinite cycle, which waits till we have some space for writing data, but we will never get space if the buffer is full receive() operation won't happen since we synchronize the code by lock we won't be able to close the channel, because "closed" field modification is guarded by lock as well. Sad panda Fixing the issue right now

            Also seeing similar live-locks with the following thread holding the lock:

            "Computer.threadPoolForRemoting [#173620] : IO ID=7851 : seq#=7850" #2080996 daemon prio=5 os_prio=0 tid=0x0000000088306800 nid=0x15e0 in Object.wait() [0x00000000f209f000]
               java.lang.Thread.State: WAITING (on object monitor)
                at java.lang.Object.wait(Native Method)
                at java.lang.Object.wait(Object.java:502)
                at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336)
                - locked <0x000000044d3408b8> (a org.jenkinsci.remoting.nio.FifoBuffer)
                at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:220)
                at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83)
                at hudson.remoting.Channel.send(Channel.java:576)
                - locked <0x000000044d3407a0> (a hudson.remoting.Channel)
                at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:260)
                at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
                at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
                at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)
            

            The NioChannelHub thread has the same stack trace (modulo line numbers)

            stephenconnolly Stephen Connolly added a comment - Also seeing similar live-locks with the following thread holding the lock: "Computer.threadPoolForRemoting [#173620] : IO ID=7851 : seq#=7850" #2080996 daemon prio=5 os_prio=0 tid=0x0000000088306800 nid=0x15e0 in Object .wait() [0x00000000f209f000] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:502) at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336) - locked <0x000000044d3408b8> (a org.jenkinsci.remoting.nio.FifoBuffer) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:220) at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83) at hudson.remoting.Channel.send(Channel.java:576) - locked <0x000000044d3407a0> (a hudson.remoting.Channel) at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:260) at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) The NioChannelHub thread has the same stack trace (modulo line numbers)

            People

              oleg_nenashev Oleg Nenashev
              kohsuke Kohsuke Kawaguchi
              Votes:
              7 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: