[JENKINS-53569] Remoting deadlock observed after upgrading to 3.26

Type: Bug
Resolution: Fixed
Priority: Critical
Component/s: remoting, swarm-plugin
Labels:
None
Environment:
Server: Jenkins 2.138.1 LTS (Remoting 3.25)
Client: Swarm Client 3.14 (Remoting 3.26)

Similar Issues:
Powered by SuggestiMate

Show
Released As:
Remoting 3.27, Jenkins 2.144

After upgrading my Jenkins master and the Swarm Client to the latest stable versions, I am seeing a new deadlock on the Swarm Client side when trying to connect to the master.

The relevant output from jstack:

Found one Java-level deadlock:
=============================
"pool-1-thread-3":
  waiting to lock monitor 0x0000000000d12970 (object 0x0000000784a00fc0, a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer),
  which is held by "Thread-2"
"Thread-2":
  waiting for ownable synchronizer 0x0000000784a4ac68, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "pool-1-thread-3"

Java stack information for the threads listed above:
===================================================
"pool-1-thread-3":
        at org.jenkinsci.remoting.protocol.FilterLayer.onRecvRemoved(FilterLayer.java:134)
        - waiting to lock <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.getNextRecv(ProtocolStack.java:929)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:663)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:369)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:669)
        at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:48)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:283)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
        at hudson.remoting.Engine$1$$Lambda$5/613009671.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:748)
"Thread-2":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000784a4ac68> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.remove(ProtocolStack.java:755)
        at org.jenkinsci.remoting.protocol.FilterLayer.completed(FilterLayer.java:108)
        - locked <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.complete(ConnectionHeadersFilterLayer.java:363)
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:499)
        - locked <0x0000000784a00fc0> (a org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:692)
        at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157)
        at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
        at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
        at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
        at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.connect(JnlpProtocol4Handler.java:179)
        at org.jenkinsci.remoting.engine.JnlpProtocolHandler.connect(JnlpProtocolHandler.java:157)
        at hudson.remoting.Engine.innerRun(Engine.java:573)
        at hudson.remoting.Engine.run(Engine.java:474)

Found 1 deadlock.

After encountering this deadlock, the Swarm Client never finishes connecting to the master. The master is unable to use the Swarm Client as a node when it reaches this hung state.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jstack.txt
13 kB
2018-09-14 05:38

is duplicated by

JENKINS-55700 deadlock remoting TCP agent connection handler

Closed

Basil Crow added a comment - 2018-09-14 05:41

I've attached to the bug full jstack output as well as the standard out of the Swarm Client and FINEST level log files from the Swarm Client. Note that compared to a successful connection, "Connected" is never printed to standard out. Even though the deadlock happens in the Swarm Client, the stack trace implicates Remoting.

Basil Crow added a comment - 2018-09-14 05:41 I've attached to the bug full jstack output as well as the standard out of the Swarm Client and FINEST level log files from the Swarm Client. Note that compared to a successful connection, "Connected" is never printed to standard out. Even though the deadlock happens in the Swarm Client, the stack trace implicates Remoting.

Basil Crow added a comment - 2018-09-14 07:34

I looked through the recent commits and didn't find anything remotely related to locking and thread notifications besides ~~JENKINS-51841~~. Could that change be related to this issue?

Basil Crow added a comment - 2018-09-14 07:34 I looked through the recent commits and didn't find anything remotely related to locking and thread notifications besides JENKINS-51841 . Could that change be related to this issue?

Jeff Thompson added a comment - 2018-09-14 21:37

basil it is unlikely that any of the recent changes would have caused the behavior you are seeing. The one you reference shouldn't have caused this as it was a re-factoring or rearrangement to allow the remoting-kafka plugin access to some pieces.

I don't have any insight into what might be going on in your system. I'm not familiar with any other reports like this. I'll try to take a little deeper look at your report when I get the chance.

Jeff Thompson added a comment - 2018-09-14 21:37 basil it is unlikely that any of the recent changes would have caused the behavior you are seeing. The one you reference shouldn't have caused this as it was a re-factoring or rearrangement to allow the remoting-kafka plugin access to some pieces. I don't have any insight into what might be going on in your system. I'm not familiar with any other reports like this. I'll try to take a little deeper look at your report when I get the chance.

Jeff Thompson added a comment - 2018-09-14 23:25

I haven't had a chance to examine your report any further, but I ran across something somewhere else and wondered if it might be similar to yours. From what I've read, this issue ~~JENKINS-42187~~ can possibly appear to cause hangs relating to Docker and swarms. It sounds like your environment might be similar so I thought I pass this along to see if it provided any help to you.

Jeff Thompson added a comment - 2018-09-14 23:25 I haven't had a chance to examine your report any further, but I ran across something somewhere else and wondered if it might be similar to yours. From what I've read, this issue JENKINS-42187 can possibly appear to cause hangs relating to Docker and swarms. It sounds like your environment might be similar so I thought I pass this along to see if it provided any help to you.

Jeff Thompson added a comment - 2018-09-18 21:44

No, it doesn't look like that Docker issue has anything to do with it. I got a little time to take a look at this and yes, it's a regular old Java threading deadlock. I'm not yet certain of the sequence that is causing this deadlock, or isn't causing it in other cases. I have an idea for a change, which may solve the problem and doesn't seem to cause any other problems that are covered by the automated tests. Unfortunately as usual they don't cover threading, locking, and deadlocking very well.

Jeff Thompson added a comment - 2018-09-18 21:44 No, it doesn't look like that Docker issue has anything to do with it. I got a little time to take a look at this and yes, it's a regular old Java threading deadlock. I'm not yet certain of the sequence that is causing this deadlock, or isn't causing it in other cases. I have an idea for a change, which may solve the problem and doesn't seem to cause any other problems that are covered by the automated tests. Unfortunately as usual they don't cover threading, locking, and deadlocking very well.

Jeff Thompson added a comment - 2018-09-28 16:16

Released Remoting 3.27, which contains a fix to avoid this deadlock. The potential deadlock has been around for a while and wasn't specific to 3.26. Something may have tweaked the timing in some environments that made it occur more. This should go into a weekly release soon.

Jeff Thompson added a comment - 2018-09-28 16:16 Released Remoting 3.27, which contains a fix to avoid this deadlock. The potential deadlock has been around for a while and wasn't specific to 3.26. Something may have tweaked the timing in some environments that made it occur more. This should go into a weekly release soon.

Basil Crow added a comment - 2018-09-28 18:22

Thank you! I appreciate this.

Basil Crow added a comment - 2018-09-28 18:22 Thank you! I appreciate this.

Assignee:: Jeff Thompson

Reporter:: Basil Crow

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2018-09-14 05:34

Updated:: 2020-07-06 03:30

Resolved:: 2018-09-28 16:16

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Basil Crow added a comment - 2018-09-14 05:41

Expand comment: Basil Crow added a comment - 2018-09-14 05:41

Collapse comment: Basil Crow added a comment - 2018-09-14 07:34

Expand comment: Basil Crow added a comment - 2018-09-14 07:34

Collapse comment: Jeff Thompson added a comment - 2018-09-14 21:37

Expand comment: Jeff Thompson added a comment - 2018-09-14 21:37

Collapse comment: Jeff Thompson added a comment - 2018-09-14 23:25

Expand comment: Jeff Thompson added a comment - 2018-09-14 23:25

Collapse comment: Jeff Thompson added a comment - 2018-09-18 21:44

Expand comment: Jeff Thompson added a comment - 2018-09-18 21:44

Collapse comment: Jeff Thompson added a comment - 2018-09-28 16:16

Expand comment: Jeff Thompson added a comment - 2018-09-28 16:16

Collapse comment: Basil Crow added a comment - 2018-09-28 18:22

Expand comment: Basil Crow added a comment - 2018-09-28 18:22

People

Dates