[JENKINS-73575] Jenkins master threads stuck on waiting - Jenkins Jira

Type: Bug
Resolution: Unresolved
Priority: Minor
Component/s: core
Labels:

Similar Issues:
Powered by SuggestiMate

Show

Recently on my Jenkins I encountered a kind of thread lock, which turned out to be an indefinite wait which happened after restart with connection to one of the nodes, which was no longer existing, but jenkins master was waiting to connect to it anyway. Due to this wait, the master couldn't properly communicate with other nodes about ssh steps in the pipelines (none of the 'sh' steps in the pipelines worked). Additionally when trying to check logs from /logs/warning the endpoint was not responding. All the while the cpu, and memory load weren't high on the master instance.

The thread dump (2 blocked, waiting for non responding Channel@1a0969c9, which wasn't present in the master thread dump):

Channel reader thread: node_124
"Channel reader thread: node_124" Id=24000 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
at java.base@11.0.22/java.lang.Object.wait(Native Method)

waiting on com.trilead.ssh2.channel.Channel@1a0969c9
at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:935)
at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:94)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:74)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:105)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)

javamelody
"javamelody" Id=165 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
at java.base@11.0.22/java.lang.Object.wait(Native Method)

waiting on com.trilead.ssh2.channel.Channel@1a0969c9
at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:385)
at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:93)
at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
at java.base@11.0.22/java.io.OutputStream.write(Unknown Source)
at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
at hudson.remoting.Channel.send(Channel.java:768)
locked hudson.remoting.Channel@741652e2
at hudson.remoting.Request.callAsync(Request.java:238)
at hudson.remoting.Channel.callAsync(Channel.java:1032)
at net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:189)
at net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:214)
at net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
at net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
at net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
at java.base@11.0.22/java.util.TimerThread.mainLoop(Unknown Source)
at java.base@11.0.22/java.util.TimerThread.run(Unknown Source)

The issue looks like a rare race condition when the agent is deleted midway communication, causing the master to be stuck on 'wait' method.

Proposed solution:
Could I/you update the parts:
https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/FifoBuffer.java#L212
and
https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/ChannelManager.java#L385
so that it will be: 'wait(900000)' 15min ?
Or create a new variable like: DEFAULT_CONNECTION_TIMEOUT_SECONDS, or use any other existing one, as long as it will have some timeout eventually.

There's total 8 uses of 'wait' function, so it'd be good to update all of them.

Unfortunately I wasn't able to reproduce the error so far as it's very time sensitive bug to produce.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-08-19-15-16-48-527.png
153 kB
2024-08-19 13:16

Details

Description

Attachments

Attachments

Activity

People

Dates