Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63520

Agent remoting deadlock after reboot

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Environment:
    • Similar Issues:

      Description

      When we upgrade and reboot the Jenkins agents, sometimes they hang on startup. We have about 50 agents and we upgrade/reboot them twice a day. About 1/100 times an agent will get stuck on startup.

      On the Jenkins master, we see this error message from the hung agent's logs:

      ERROR: Connection terminated
      java.nio.channels.ClosedChannelException
      	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
      	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
      	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      From the hung agent, we see the attached jstack thread dump with deadlock found. It looks like two threads are waiting on each other which causes the deadlock. After encountering this deadlock, the agent never finishes connecting to the master. The master is unable to use the agent as a node when it reaches this hung state.

      Could the fact that the java versions are different contribute to this problem? The master has version 1.8.0_252-8u252-b09-1~18.04-b09 whereas the agents have java version 1.8.0_265-8u265-b01-0.

        Attachments

          Activity

          Hide
          docwhat Christian Höltje added a comment -

          The agents that are broken have this as their log:

          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Agent discovery successful
            Agent address: jenkins.example.com
            Agent port:    33123
            Identity:      xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example.com:33123
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP4-connect
          Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Remote identity confirmed: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
          

          Note how it just stops after remote identity is confirmed.

          Agents that are NOT hung continue on with this:

          Aug 27, 2020 8:00:03 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connected
          
          Show
          docwhat Christian Höltje added a comment - The agents that are broken have this as their log: Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: jenkins.example.com Agent port: 33123 Identity: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example.com:33123 Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Aug 27, 2020 8:00:02 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx Note how it just stops after remote identity is confirmed. Agents that are NOT hung continue on with this: Aug 27, 2020 8:00:03 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
          Hide
          anhuong Anh Uong added a comment -

          Command we used to get the jstack dump to find deadlock: sudo -u <user> -H jstack -l <pid>

          Show
          anhuong Anh Uong added a comment - Command we used to get the jstack dump to find deadlock: sudo -u <user> -H jstack -l <pid>
          Hide
          anhuong Anh Uong added a comment -

          We are still seeing this issue on our production server. Note that we have upgraded our Java version and they are now the same on the agent and master so that is not the problem.

          We dug around in the source code and we are wondering how it's possible that this.nextRecv.layer could be the same object as this.nextSend.layer?

          Because that's the only way the synchronize lock in the FilterLayer could be the same one (0x000000076eb4fb58).

          Found one Java-level deadlock:
          =============================
          "pool-1-thread-3":
            waiting to lock monitor 0x00007f4a68003738 (object 0x000000076eb4fb58, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer),
            which is held by "Thread-1"
          "Thread-1":
            waiting for ownable synchronizer 0x000000076eb7d9e0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
            which is held by "pool-1-thread-3"
          
          Java stack information for the threads listed above:
          ===================================================
          "pool-1-thread-3":
          	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134)
          	- waiting to lock <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:913)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:662)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668)
          	at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
          	at hudson.remoting.Engine$1$$Lambda$7/869651373.run(Unknown Source)
          	at java.lang.Thread.run(Thread.java:748)
          "Thread-1":
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for  <0x000000076eb7d9e0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
          	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextSend(ProtocolStack.java:841)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:685)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:518)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
          	- locked <0x000000076eb40348> (a java.lang.Object)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691)
          	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:497)
          	- locked <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691)
          	at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156)
          	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
          	at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
          	at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
          	at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:181)
          	at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157)
          	at hudson.remoting.Engine.innerRun(Engine.java:743)
          	at hudson.remoting.Engine.run(Engine.java:518)
          
          Show
          anhuong Anh Uong added a comment - We are still seeing this issue on our production server. Note that we have upgraded our Java version and they are now the same on the agent and master so that is not the problem. We dug around in the source code and we are wondering how it's possible that this.nextRecv.layer could be the same object as this.nextSend.layer ? Because that's the only way the synchronize lock in the FilterLayer could be the same one ( 0x000000076eb4fb58 ). Found one Java-level deadlock: ============================= "pool-1-thread-3": waiting to lock monitor 0x00007f4a68003738 (object 0x000000076eb4fb58, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer), which is held by "Thread-1" "Thread-1": waiting for ownable synchronizer 0x000000076eb7d9e0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), which is held by "pool-1-thread-3" Java stack information for the threads listed above: =================================================== "pool-1-thread-3": at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134) - waiting to lock <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:913) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:662) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:668) at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) at hudson.remoting.Engine$1$$Lambda$7/869651373.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) "Thread-1": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000076eb7d9e0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextSend(ProtocolStack.java:841) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:685) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:518) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) - locked <0x000000076eb40348> (a java.lang.Object) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:497) - locked <0x000000076eb4fb58> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:691) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:181) at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157) at hudson.remoting.Engine.innerRun(Engine.java:743) at hudson.remoting.Engine.run(Engine.java:518)

            People

            Assignee:
            jthompson Jeff Thompson
            Reporter:
            anhuong Anh Uong
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated: