Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73575

Jenkins master threads stuck on waiting

XMLWordPrintable

      Recently on my Jenkins I encountered a kind of thread lock, which turned out to be an indefinite wait which happened after restart with connection to one of the nodes, which was no longer existing, but jenkins master was waiting to connect to it anyway. Due to this wait, the master couldn't properly communicate with other nodes about ssh steps in the pipelines (none of the 'sh' steps in the pipelines worked). Additionally when trying to check logs from /logs/warning the endpoint was not responding. All the while the cpu, and memory load weren't high on the master instance.

      The thread dump (2 blocked, waiting for non responding Channel@1a0969c9, which wasn't present in the master thread dump):
       
      Channel reader thread: node_124
      "Channel reader thread: node_124" Id=24000 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
      at java.base@11.0.22/java.lang.Object.wait(Native Method)

      •  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
        at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
        at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
        at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
        at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:935)
        at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
        at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
        at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:94)
        at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:74)
        at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:105)
        at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)

      javamelody
      "javamelody" Id=165 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
      at java.base@11.0.22/java.lang.Object.wait(Native Method)

      •  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
        at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
        at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:385)
        at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
        at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
        at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:93)
        at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
        at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
        at java.base@11.0.22/java.io.OutputStream.write(Unknown Source)
        at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
        at hudson.remoting.Channel.send(Channel.java:768)
      •  locked hudson.remoting.Channel@741652e2
        at hudson.remoting.Request.callAsync(Request.java:238)
        at hudson.remoting.Channel.callAsync(Channel.java:1032)
        at net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:189)
        at net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:214)
        at net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
        at net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
        at net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
        at java.base@11.0.22/java.util.TimerThread.mainLoop(Unknown Source)
        at java.base@11.0.22/java.util.TimerThread.run(Unknown Source)
         
        The issue looks like a rare race condition when the agent is deleted midway communication, causing the master to be stuck on 'wait' method.
         
        Proposed solution:
        Could I/you update the parts:
        https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/FifoBuffer.java#L212
        and
        https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/ChannelManager.java#L385
        so that it will be: 'wait(900000)' 15min ?
        Or create a new variable like: DEFAULT_CONNECTION_TIMEOUT_SECONDS, or use any other existing one, as long as it will have some timeout eventually.
         
        There's total 8 uses of 'wait' function, so it'd be good to update all of them.

      Unfortunately I wasn't able to reproduce the error so far as it's very time sensitive bug to produce.

            experrior Mateusz
            experrior Mateusz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: