Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63520

Agent remoting deadlock after reboot

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
    • Environment:
    • Similar Issues:

      Description

      When we upgrade and reboot the Jenkins agents, sometimes they hang on startup. We have about 50 agents and we upgrade/reboot them twice a day. About 1/100 times an agent will get stuck on startup.

      On the Jenkins master, we see this error message from the hung agent's logs:

      ERROR: Connection terminated
      java.nio.channels.ClosedChannelException
      	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
      	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
      	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      From the hung agent, we see the attached jstack thread dump with deadlock found. It looks like two threads are waiting on each other which causes the deadlock. After encountering this deadlock, the agent never finishes connecting to the master. The master is unable to use the agent as a node when it reaches this hung state.

      Could the fact that the java versions are different contribute to this problem? The master has version 1.8.0_252-8u252-b09-1~18.04-b09 whereas the agents have java version 1.8.0_265-8u265-b01-0.

        Attachments

          Activity

          Hide
          docwhat Christian Höltje added a comment -

          The agents that are broken have this as their log:

          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Agent discovery successful
            Agent address: jenkins.example.com
            Agent port:    33123
            Identity:      xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example.com:33123
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP4-connect
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Remote identity confirmed: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
          

          Note how it just stops after remote identity is confirmed.

          Agents that are NOT hung continue on with this:

          Aug 27, 2020 8:00:03 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connected
          
          Show
          docwhat Christian Höltje added a comment - The agents that are broken have this as their log: Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: jenkins.example.com Agent port: 33123 Identity: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example.com:33123 Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx Note how it just stops after remote identity is confirmed. Agents that are NOT hung continue on with this: Aug 27, 2020 8:00:03 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
          Hide
          anhuong Anh Uong added a comment -

          Command we used to get the jstack dump to find deadlock: sudo -u <user> -H jstack -l <pid>

          Show
          anhuong Anh Uong added a comment - Command we used to get the jstack dump to find deadlock: sudo -u <user> -H jstack -l <pid>
          Hide
          anhuong Anh Uong added a comment -

          We are still seeing this issue on our production server. Note that we have upgraded our Java version and they are now the same on the agent and master so that is not the problem.

          We dug around in the source code and we are wondering how it's possible that this.nextRecv.layer could be the same object as this.nextSend.layer?

          Because that's the only way the synchronize lock in the FilterLayer could be the same one (0x000000076eb4fb58).

          Found one Java-level deadlock:
          =============================
          "pool-1-thread-3":
            waiting to lock monitor 0x00007f4a68003738 (object 0x000000076eb4fb58, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer),
            which is held by "Thread-1"
          "Thread-1":
            waiting for ownable synchronizer 0x000000076eb7d9e0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
            which is held by "pool-1-thread-3"
          
          Java stack information for the threads listed above:
          ===================================================
          "pool-1-thread-3":
          	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134)
          	- waiting to lock <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:913)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:662)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668)
          	at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
          	at hudson.remoting.Engine$1$$Lambda$7/869651373.run(Unknown Source)
          	at java.lang.Thread.run(Thread.java:748)
          "Thread-1":
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for  <0x000000076eb7d9e0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
          	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextSend(ProtocolStack.java:841)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:685)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:518)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
          	- locked <0x000000076eb40348> (a java.lang.Object)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691)
          	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:497)
          	- locked <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691)
          	at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156)
          	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
          	at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
          	at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
          	at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:181)
          	at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157)
          	at hudson.remoting.Engine.innerRun(Engine.java:743)
          	at hudson.remoting.Engine.run(Engine.java:518)
          
          Show
          anhuong Anh Uong added a comment - We are still seeing this issue on our production server. Note that we have upgraded our Java version and they are now the same on the agent and master so that is not the problem. We dug around in the source code and we are wondering how it's possible that this.nextRecv.layer could be the same object as this.nextSend.layer ? Because that's the only way the synchronize lock in the FilterLayer could be the same one ( 0x000000076eb4fb58 ). Found one Java-level deadlock: ============================= "pool-1-thread-3": waiting to lock monitor 0x00007f4a68003738 (object 0x000000076eb4fb58, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer), which is held by "Thread-1" "Thread-1": waiting for ownable synchronizer 0x000000076eb7d9e0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "pool-1-thread-3" Java stack information for the threads listed above: =================================================== "pool-1-thread-3": at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134) - waiting to lock <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:913) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:662) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668) at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) at hudson.remoting.Engine$1$$Lambda$7/869651373.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) "Thread-1": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000076eb7d9e0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextSend(ProtocolStack.java:841) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:685) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:518) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) - locked <0x000000076eb40348> (a java.lang.Object) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:497) - locked <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:181) at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157) at hudson.remoting.Engine.innerRun(Engine.java:743) at hudson.remoting.Engine.run(Engine.java:518)
          Hide
          ulrich_post Uli Post added a comment -

          We see this very same issue (identical deadlock reported in Java thread dump) every few days on Windows slaves starting automatically after reboot.
          Is there a chance for a fix of this issue?

          Show
          ulrich_post Uli Post added a comment - We see this very same issue (identical deadlock reported in Java thread dump) every few days on Windows slaves starting automatically after reboot. Is there a chance for a fix of this issue?
          Hide
          jthompson Jeff Thompson added a comment -

          Do you only ever see this at startup? At the exact same point as initially described?

          It might be possible to address this as a startup sequencing issue.

          Show
          jthompson Jeff Thompson added a comment - Do you only ever see this at startup? At the exact same point as initially described? It might be possible to address this as a startup sequencing issue.
          Hide
          ulrich_post Uli Post added a comment - - edited

          We see this issue solely at startup (after reboot) time.
          And it is obviously subjected to a race condition occurring only sporadically.

          Show
          ulrich_post Uli Post added a comment - - edited We see this issue solely at startup (after reboot) time. And it is obviously subjected to a race condition occurring only sporadically.
          Hide
          jthompson Jeff Thompson added a comment -

          I've got a PR up that might resolve this issue. See https://github.com/jenkinsci/remoting/pull/445

          It's still building, but will hopefully complete eventually. If you can provide some testing that would help motivate a release with this change.

          Show
          jthompson Jeff Thompson added a comment - I've got a PR up that might resolve this issue. See https://github.com/jenkinsci/remoting/pull/445 It's still building, but will hopefully complete eventually. If you can provide some testing that would help motivate a release with this change.
          Hide
          jthompson Jeff Thompson added a comment -
          Show
          jthompson Jeff Thompson added a comment - There's an incremental build available at https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/remoting/4.8-rc2905.dc979cc66ac0/  
          Hide
          ulrich_post Uli Post added a comment -

          Many thanks, Jeff!
          I will check how we can roll that out to all our Windows slaves and test from now on.
          It will probably take a few weeks until a statement that you fix has resolved that issue in a stable fashion.
          Cheers, Uli

          Show
          ulrich_post Uli Post added a comment - Many thanks, Jeff! I will check how we can roll that out to all our Windows slaves and test from now on. It will probably take a few weeks until a statement that you fix has resolved that issue in a stable fashion. Cheers, Uli
          Hide
          jthompson Jeff Thompson added a comment -

          It would be best if you can try it first in a test environment or something. Given what we know of the issue so far, the fix makes sense. I have a limited testing environment and haven't seen the issue, but it continued to work fine in my tests.

          Show
          jthompson Jeff Thompson added a comment - It would be best if you can try it first in a test environment or something. Given what we know of the issue so far, the fix makes sense. I have a limited testing environment and haven't seen the issue, but it continued to work fine in my tests.
          Hide
          ulrich_post Uli Post added a comment -

          Fix candidate is now rolled out to a set of 7 test machines without any problem so far.
          After a few weeks without any deadlock we have a high confidence that this issue is resolved.

          Show
          ulrich_post Uli Post added a comment - Fix candidate is now rolled out to a set of 7 test machines without any problem so far. After a few weeks without any deadlock we have a high confidence that this issue is resolved.
          Hide
          jthompson Jeff Thompson added a comment -

          Uli Post, what have you observed? Is the proposed fix working well for you?

          Show
          jthompson Jeff Thompson added a comment - Uli Post , what have you observed? Is the proposed fix working well for you?
          Hide
          ulrich_post Uli Post added a comment -

          Hi Jeff, we have not seen this issue during the last 2 weeks.

          Show
          ulrich_post Uli Post added a comment - Hi Jeff, we have not seen this issue during the last 2 weeks.
          Hide
          bhartshorn Brandon added a comment -

          Hi Jeff & Uli,

          Thanks for the fix and testing. I'm part of the team that reported the issue, but we don't have bandwidth to test a custom build at the moment. Is it looking this patch will make it into a release soon?

          Show
          bhartshorn Brandon added a comment - Hi Jeff & Uli, Thanks for the fix and testing. I'm part of the team that reported the issue, but we don't have bandwidth to test a custom build at the moment. Is it looking this patch will make it into a release soon?
          Hide
          jthompson Jeff Thompson added a comment -

          It's expected to be in the next release (4.8) as soon as we can get the release out.

          Show
          jthompson Jeff Thompson added a comment - It's expected to be in the next release (4.8) as soon as we can get the release out.
          Hide
          bhartshorn Brandon added a comment -

          I see 4.8 made it into weekly ~2 weeks ago, and this issue has the lts-candidate label. I also see that LTS 2.289.1 is due in less than a week, so obviously won't include this. Can I put in a request that it be backported soon? Again, many thanks!

          Show
          bhartshorn Brandon added a comment - I see 4.8 made it into weekly ~2 weeks ago, and this issue has the lts-candidate label. I also see that LTS 2.289.1 is due in less than a week, so obviously won't include this. Can I put in a request that it be backported soon? Again, many thanks!

            People

            Assignee:
            jthompson Jeff Thompson
            Reporter:
            anhuong Anh Uong
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated: