Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53569

Remoting deadlock observed after upgrading to 3.26

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Critical
    • Resolution: Fixed
    • remoting, swarm-plugin
    • None
    • Server: Jenkins 2.138.1 LTS (Remoting 3.25)
      Client: Swarm Client 3.14 (Remoting 3.26)
    • Remoting 3.27, Jenkins 2.144

    Description

      After upgrading my Jenkins master and the Swarm Client to the latest stable versions, I am seeing a new deadlock on the Swarm Client side when trying to connect to the master.

      The relevant output from jstack:

      Found one Java-level deadlock:
      =============================
      "pool-1-thread-3":
        waiting to lock monitor 0x0000000000d12970 (object 0x0000000784a00fc0, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer),
        which is held by "Thread-2"
      "Thread-2":
        waiting for ownable synchronizer 0x0000000784a4ac68, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
        which is held by "pool-1-thread-3"
      
      Java stack information for the threads listed above:
      ===================================================
      "pool-1-thread-3":
              at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134)
              - waiting to lock <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:929)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:663)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:669)
              at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
              at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48)
              at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
              at hudson.remoting.Engine$1$$Lambda$5/613009671.run(Unknown Source)
              at java.lang.Thread.run(Thread.java:748)
      "Thread-2":
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x0000000784a4ac68> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
              at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.remove(ProtocolStack.java:755)
              at org.jenkinsci.remoting.protocol.FilterLayer.completed(FilterLayer.java:108)
              - locked <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.complete(ConnectionHeadersFilterLayer.java:363)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:499)
              - locked <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:692)
              at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157)
              at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
              at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
              at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
              at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:179)
              at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157)
              at hudson.remoting.Engine.innerRun(Engine.java:573)
              at hudson.remoting.Engine.run(Engine.java:474)
      
      Found 1 deadlock.
      

      After encountering this deadlock, the Swarm Client never finishes connecting to the master. The master is unable to use the Swarm Client as a node when it reaches this hung state.

      Attachments

        Issue Links

          Activity

            basil Basil Crow added a comment -

            I've attached to the bug full jstack output as well as the standard out of the Swarm Client and FINEST level log files from the Swarm Client. Note that compared to a successful connection, "Connected" is never printed to standard out. Even though the deadlock happens in the Swarm Client, the stack trace implicates Remoting.

            basil Basil Crow added a comment - I've attached to the bug full jstack output as well as the standard out of the Swarm Client and FINEST level log files from the Swarm Client. Note that compared to a successful connection, "Connected" is never printed to standard out. Even though the deadlock happens in the Swarm Client, the stack trace implicates Remoting.
            basil Basil Crow added a comment -

            I looked through the recent commits and didn't find anything remotely related to locking and thread notifications besides JENKINS-51841. Could that change be related to this issue?

            basil Basil Crow added a comment - I looked through the recent commits and didn't find anything remotely related to locking and thread notifications besides JENKINS-51841 . Could that change be related to this issue?
            jthompson Jeff Thompson added a comment -

            basil it is unlikely that any of the recent changes would have caused the behavior you are seeing. The one you reference shouldn't have caused this as it was a re-factoring or rearrangement to allow the remoting-kafka plugin access to some pieces.

            I don't have any insight into what might be going on in your system. I'm not familiar with any other reports like this. I'll try to take a little deeper look at your report when I get the chance.

            jthompson Jeff Thompson added a comment - basil it is unlikely that any of the recent changes would have caused the behavior you are seeing. The one you reference shouldn't have caused this as it was a re-factoring or rearrangement to allow the remoting-kafka plugin access to some pieces. I don't have any insight into what might be going on in your system. I'm not familiar with any other reports like this. I'll try to take a little deeper look at your report when I get the chance.
            jthompson Jeff Thompson added a comment -

            I haven't had a chance to examine your report any further, but I ran across something somewhere else and wondered if it might be similar to yours. From what I've read, this issue JENKINS-42187 can possibly appear to cause hangs relating to Docker and swarms. It sounds like your environment might be similar so I thought I pass this along to see if it provided any help to you.  

             

            jthompson Jeff Thompson added a comment - I haven't had a chance to examine your report any further, but I ran across something somewhere else and wondered if it might be similar to yours. From what I've read, this issue JENKINS-42187  can possibly appear to cause hangs relating to Docker and swarms. It sounds like your environment might be similar so I thought I pass this along to see if it provided any help to you.    
            jthompson Jeff Thompson added a comment -

            No, it doesn't look like that Docker issue has anything to do with it. I got a little time to take a look at this and yes, it's a regular old Java threading deadlock. I'm not yet certain of the sequence that is causing this deadlock, or isn't causing it in other cases. I have an idea for a change, which may solve the problem and doesn't seem to cause any other problems that are covered by the automated tests. Unfortunately as usual they don't cover threading, locking, and deadlocking very well.

            jthompson Jeff Thompson added a comment - No, it doesn't look like that Docker issue has anything to do with it. I got a little time to take a look at this and yes, it's a regular old Java threading deadlock. I'm not yet certain of the sequence that is causing this deadlock, or isn't causing it in other cases. I have an idea for a change, which may solve the problem and doesn't seem to cause any other problems that are covered by the automated tests. Unfortunately as usual they don't cover threading, locking, and deadlocking very well.
            jthompson Jeff Thompson added a comment -

            Released Remoting 3.27, which contains a fix to avoid this deadlock. The potential deadlock has been around for a while and wasn't specific to 3.26. Something may have tweaked the timing in some environments that made it occur more. This should go into a weekly release soon.

            jthompson Jeff Thompson added a comment - Released Remoting 3.27, which contains a fix to avoid this deadlock. The potential deadlock has been around for a while and wasn't specific to 3.26. Something may have tweaked the timing in some environments that made it occur more. This should go into a weekly release soon.
            basil Basil Crow added a comment -

            Thank you! I appreciate this.

            basil Basil Crow added a comment - Thank you! I appreciate this.

            People

              jthompson Jeff Thompson
              basil Basil Crow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: