Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63520

Agent remoting deadlock after reboot

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • remoting

    Description

      When we upgrade and reboot the Jenkins agents, sometimes they hang on startup. We have about 50 agents and we upgrade/reboot them twice a day. About 1/100 times an agent will get stuck on startup.

      On the Jenkins master, we see this error message from the hung agent's logs:

      ERROR: Connection terminated
      java.nio.channels.ClosedChannelException
      	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
      	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
      	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      From the hung agent, we see the attached jstack thread dump with deadlock found. It looks like two threads are waiting on each other which causes the deadlock. After encountering this deadlock, the agent never finishes connecting to the master. The master is unable to use the agent as a node when it reaches this hung state.

      Could the fact that the java versions are different contribute to this problem? The master has version 1.8.0_252-8u252-b09-1~18.04-b09 whereas the agents have java version 1.8.0_265-8u265-b01-0.

      Attachments

        Activity

          The agents that are broken have this as their log:

          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Agent discovery successful
            Agent address: jenkins.example.com
            Agent port:    33123
            Identity:      xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example.com:33123
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP4-connect
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Remote identity confirmed: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
          

          Note how it just stops after remote identity is confirmed.

          Agents that are NOT hung continue on with this:

          Aug 27, 2020 8:00:03 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connected
          
          docwhat Christian Höltje added a comment - The agents that are broken have this as their log: Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: jenkins.example.com Agent port: 33123 Identity: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example.com:33123 Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx Note how it just stops after remote identity is confirmed. Agents that are NOT hung continue on with this: Aug 27, 2020 8:00:03 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
          anhuong Anh Uong added a comment -

          Command we used to get the jstack dump to find deadlock: sudo -u <user> -H jstack -l <pid>

          anhuong Anh Uong added a comment - Command we used to get the jstack dump to find deadlock: sudo -u <user> -H jstack -l <pid>
          anhuong Anh Uong added a comment -

          We are still seeing this issue on our production server. Note that we have upgraded our Java version and they are now the same on the agent and master so that is not the problem.

          We dug around in the source code and we are wondering how it's possible that this.nextRecv.layer could be the same object as this.nextSend.layer?

          Because that's the only way the synchronize lock in the FilterLayer could be the same one (0x000000076eb4fb58).

          Found one Java-level deadlock:
          =============================
          "pool-1-thread-3":
            waiting to lock monitor 0x00007f4a68003738 (object 0x000000076eb4fb58, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer),
            which is held by "Thread-1"
          "Thread-1":
            waiting for ownable synchronizer 0x000000076eb7d9e0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
            which is held by "pool-1-thread-3"
          
          Java stack information for the threads listed above:
          ===================================================
          "pool-1-thread-3":
          	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134)
          	- waiting to lock <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:913)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:662)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668)
          	at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
          	at hudson.remoting.Engine$1$$Lambda$7/869651373.run(Unknown Source)
          	at java.lang.Thread.run(Thread.java:748)
          "Thread-1":
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for  <0x000000076eb7d9e0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
          	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextSend(ProtocolStack.java:841)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:685)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:518)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
          	- locked <0x000000076eb40348> (a java.lang.Object)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691)
          	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:497)
          	- locked <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691)
          	at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156)
          	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
          	at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
          	at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
          	at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:181)
          	at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157)
          	at hudson.remoting.Engine.innerRun(Engine.java:743)
          	at hudson.remoting.Engine.run(Engine.java:518)
          
          anhuong Anh Uong added a comment - We are still seeing this issue on our production server. Note that we have upgraded our Java version and they are now the same on the agent and master so that is not the problem. We dug around in the source code and we are wondering how it's possible that this.nextRecv.layer could be the same object as this.nextSend.layer ? Because that's the only way the synchronize lock in the FilterLayer could be the same one ( 0x000000076eb4fb58 ). Found one Java-level deadlock: ============================= "pool-1-thread-3": waiting to lock monitor 0x00007f4a68003738 (object 0x000000076eb4fb58, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer), which is held by "Thread-1" "Thread-1": waiting for ownable synchronizer 0x000000076eb7d9e0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "pool-1-thread-3" Java stack information for the threads listed above: =================================================== "pool-1-thread-3": at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134) - waiting to lock <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:913) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:662) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668) at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) at hudson.remoting.Engine$1$$Lambda$7/869651373.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) "Thread-1": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000076eb7d9e0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextSend(ProtocolStack.java:841) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:685) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:518) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) - locked <0x000000076eb40348> (a java.lang.Object) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:497) - locked <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:181) at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157) at hudson.remoting.Engine.innerRun(Engine.java:743) at hudson.remoting.Engine.run(Engine.java:518)
          ulrich_post Uli Post added a comment -

          We see this very same issue (identical deadlock reported in Java thread dump) every few days on Windows slaves starting automatically after reboot.
          Is there a chance for a fix of this issue?

          ulrich_post Uli Post added a comment - We see this very same issue (identical deadlock reported in Java thread dump) every few days on Windows slaves starting automatically after reboot. Is there a chance for a fix of this issue?
          jthompson Jeff Thompson added a comment -

          Do you only ever see this at startup? At the exact same point as initially described?

          It might be possible to address this as a startup sequencing issue.

          jthompson Jeff Thompson added a comment - Do you only ever see this at startup? At the exact same point as initially described? It might be possible to address this as a startup sequencing issue.
          ulrich_post Uli Post added a comment - - edited

          We see this issue solely at startup (after reboot) time.
          And it is obviously subjected to a race condition occurring only sporadically.

          ulrich_post Uli Post added a comment - - edited We see this issue solely at startup (after reboot) time. And it is obviously subjected to a race condition occurring only sporadically.
          jthompson Jeff Thompson added a comment -

          I've got a PR up that might resolve this issue. See https://github.com/jenkinsci/remoting/pull/445

          It's still building, but will hopefully complete eventually. If you can provide some testing that would help motivate a release with this change.

          jthompson Jeff Thompson added a comment - I've got a PR up that might resolve this issue. See https://github.com/jenkinsci/remoting/pull/445 It's still building, but will hopefully complete eventually. If you can provide some testing that would help motivate a release with this change.
          jthompson Jeff Thompson added a comment - There's an incremental build available at https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/remoting/4.8-rc2905.dc979cc66ac0/  
          ulrich_post Uli Post added a comment -

          Many thanks, Jeff!
          I will check how we can roll that out to all our Windows slaves and test from now on.
          It will probably take a few weeks until a statement that you fix has resolved that issue in a stable fashion.
          Cheers, Uli

          ulrich_post Uli Post added a comment - Many thanks, Jeff! I will check how we can roll that out to all our Windows slaves and test from now on. It will probably take a few weeks until a statement that you fix has resolved that issue in a stable fashion. Cheers, Uli
          jthompson Jeff Thompson added a comment -

          It would be best if you can try it first in a test environment or something. Given what we know of the issue so far, the fix makes sense. I have a limited testing environment and haven't seen the issue, but it continued to work fine in my tests.

          jthompson Jeff Thompson added a comment - It would be best if you can try it first in a test environment or something. Given what we know of the issue so far, the fix makes sense. I have a limited testing environment and haven't seen the issue, but it continued to work fine in my tests.
          ulrich_post Uli Post added a comment -

          Fix candidate is now rolled out to a set of 7 test machines without any problem so far.
          After a few weeks without any deadlock we have a high confidence that this issue is resolved.

          ulrich_post Uli Post added a comment - Fix candidate is now rolled out to a set of 7 test machines without any problem so far. After a few weeks without any deadlock we have a high confidence that this issue is resolved.
          jthompson Jeff Thompson added a comment -

          ulrich_post, what have you observed? Is the proposed fix working well for you?

          jthompson Jeff Thompson added a comment - ulrich_post , what have you observed? Is the proposed fix working well for you?
          ulrich_post Uli Post added a comment -

          Hi Jeff, we have not seen this issue during the last 2 weeks.

          ulrich_post Uli Post added a comment - Hi Jeff, we have not seen this issue during the last 2 weeks.
          bhartshorn Brandon added a comment -

          Hi Jeff & Uli,

          Thanks for the fix and testing. I'm part of the team that reported the issue, but we don't have bandwidth to test a custom build at the moment. Is it looking this patch will make it into a release soon?

          bhartshorn Brandon added a comment - Hi Jeff & Uli, Thanks for the fix and testing. I'm part of the team that reported the issue, but we don't have bandwidth to test a custom build at the moment. Is it looking this patch will make it into a release soon?
          jthompson Jeff Thompson added a comment -

          It's expected to be in the next release (4.8) as soon as we can get the release out.

          jthompson Jeff Thompson added a comment - It's expected to be in the next release (4.8) as soon as we can get the release out.
          bhartshorn Brandon added a comment -

          I see 4.8 made it into weekly ~2 weeks ago, and this issue has the lts-candidate label. I also see that LTS 2.289.1 is due in less than a week, so obviously won't include this. Can I put in a request that it be backported soon? Again, many thanks!

          bhartshorn Brandon added a comment - I see 4.8 made it into weekly ~2 weeks ago, and this issue has the lts-candidate label. I also see that LTS 2.289.1 is due in less than a week, so obviously won't include this. Can I put in a request that it be backported soon? Again, many thanks!

          People

            jthompson Jeff Thompson
            anhuong Anh Uong
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: