Recently on my Jenkins I encountered a kind of thread lock, which turned out to be an indefinite wait which happened after restart with connection to one of the nodes, which was no longer existing, but jenkins master was waiting to connect to it anyway. Due to this wait, the master couldn't properly communicate with other nodes about ssh steps in the pipelines (none of the 'sh' steps in the pipelines worked). Additionally when trying to check logs from /logs/warning the endpoint was not responding. All the while the cpu, and memory load weren't high on the master instance.

      The thread dump (2 blocked, waiting for non responding Channel@1a0969c9, which wasn't present in the master thread dump):
       
      Channel reader thread: node_124
      "Channel reader thread: node_124" Id=24000 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
      at java.base@11.0.22/java.lang.Object.wait(Native Method)

      •  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
        at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
        at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
        at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
        at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:935)
        at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
        at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
        at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:94)
        at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:74)
        at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:105)
        at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)

      javamelody
      "javamelody" Id=165 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
      at java.base@11.0.22/java.lang.Object.wait(Native Method)

      •  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
        at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
        at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:385)
        at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
        at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
        at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:93)
        at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
        at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
        at java.base@11.0.22/java.io.OutputStream.write(Unknown Source)
        at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
        at hudson.remoting.Channel.send(Channel.java:768)
      •  locked hudson.remoting.Channel@741652e2
        at hudson.remoting.Request.callAsync(Request.java:238)
        at hudson.remoting.Channel.callAsync(Channel.java:1032)
        at net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:189)
        at net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:214)
        at net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
        at net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
        at net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
        at java.base@11.0.22/java.util.TimerThread.mainLoop(Unknown Source)
        at java.base@11.0.22/java.util.TimerThread.run(Unknown Source)
         
        The issue looks like a rare race condition when the agent is deleted midway communication, causing the master to be stuck on 'wait' method.
         
        Proposed solution:
        Could I/you update the parts:
        https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/FifoBuffer.java#L212
        and
        https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/ChannelManager.java#L385
        so that it will be: 'wait(900000)' 15min ?
        Or create a new variable like: DEFAULT_CONNECTION_TIMEOUT_SECONDS, or use any other existing one, as long as it will have some timeout eventually.
         
        There's total 8 uses of 'wait' function, so it'd be good to update all of them.

      Unfortunately I wasn't able to reproduce the error so far as it's very time sensitive bug to produce.

          [JENKINS-73575] Jenkins master threads stuck on waiting

          Mateusz created issue -
          Mateusz made changes -
          Description Original: Recently on my Jenkins I encountered a kind of thread lock, which turned out to be an indefinite wait which happened after restart with connection to one of the nodes, which was no longer existing, but jenkins master was waiting to connect to it anyway. Due to this wait, the master couldn't properly communicate with other nodes about ssh steps in the pipelines (none of the 'sh' steps in the pipelines worked). Additionally when trying to check logs from /logs/warning the endpoint was not responding. All the while the cpu, and memory load weren't high on the master instance.


          The thread dump (2 blocked, waiting for non responding Channel@1a0969c9, which wasn't present in the master thread dump):
           
          Channel reader thread: node_124
          "Channel reader thread: node_124" Id=24000 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Native Method)
          -  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
          at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
          at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
          at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:935)
          at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
          at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
          at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:94)
          at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:74)
          at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:105)
          at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
          at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)


          javamelody
          "javamelody" Id=165 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Native Method)
          -  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
          at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:385)
          at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
          at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
          at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:93)
          at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
          at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
          at java.base@11.0.22/java.io.OutputStream.write(Unknown Source)
          at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
          at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
          at hudson.remoting.Channel.send(Channel.java:768)
          -  locked hudson.remoting.Channel@741652e2
          at hudson.remoting.Request.callAsync(Request.java:238)
          at hudson.remoting.Channel.callAsync(Channel.java:1032)
          at net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:189)
          at net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:214)
          at net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
          at net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
          at net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
          at java.base@11.0.22/java.util.TimerThread.mainLoop(Unknown Source)
          at java.base@11.0.22/java.util.TimerThread.run(Unknown Source)
           
          The issue looks like a rare race condition when the agent is deleted midway communication, causing the master to be stuck on 'wait' method.
           
          Proposed solution:
          Could I/you update the parts:
          [https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/FifoBuffer.java#L212]
          and
          [https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/ChannelManager.java#L385]
          so that it will be: 'wait(900000)' 15min ?
          Or create a new variable like: DEFAULT_CONNECTION_TIMEOUT_SECONDS, or use any other existing one, as long as it will have some timeout eventually.
           
          There's total 8 uses of 'wait' function, so it'd be good to update all of them.
          !https://mail.google.com/mail/u/3?ui=2&ik=e46b2bae6d&attid=0.1&permmsgid=msg-a:r-7739636952705123575&th=19127dcb7f4436dc&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9T_iumGKzzcylGzD80ChY9BdFNYJUDUyz-OwUK03fv7uA440FMJvogMSDIWpX7iG0fYc3XOkmdfTdofNmqQ43X07bZ9RENMBG16qS7DD2O2OL0De7ReF2e5Hw&disp=emb&realattid=ii_lzigbyoi0|width=478,height=79!
          New: Recently on my Jenkins I encountered a kind of thread lock, which turned out to be an indefinite wait which happened after restart with connection to one of the nodes, which was no longer existing, but jenkins master was waiting to connect to it anyway. Due to this wait, the master couldn't properly communicate with other nodes about ssh steps in the pipelines (none of the 'sh' steps in the pipelines worked). Additionally when trying to check logs from /logs/warning the endpoint was not responding. All the while the cpu, and memory load weren't high on the master instance.

          The thread dump (2 blocked, waiting for non responding Channel@1a0969c9, which wasn't present in the master thread dump):
           
          Channel reader thread: node_124
          "Channel reader thread: node_124" Id=24000 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Native Method)
           -  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
          at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
          at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
          at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:935)
          at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
          at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
          at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:94)
          at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:74)
          at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:105)
          at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
          at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)

          javamelody
          "javamelody" Id=165 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Native Method)
           -  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
          at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
          at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:385)
          at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
          at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
          at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:93)
          at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
          at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
          at java.base@11.0.22/java.io.OutputStream.write(Unknown Source)
          at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
          at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
          at hudson.remoting.Channel.send(Channel.java:768)
           -  locked hudson.remoting.Channel@741652e2
          at hudson.remoting.Request.callAsync(Request.java:238)
          at hudson.remoting.Channel.callAsync(Channel.java:1032)
          at net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:189)
          at net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:214)
          at net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
          at net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
          at net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
          at java.base@11.0.22/java.util.TimerThread.mainLoop(Unknown Source)
          at java.base@11.0.22/java.util.TimerThread.run(Unknown Source)
           
          The issue looks like a rare race condition when the agent is deleted midway communication, causing the master to be stuck on 'wait' method.
           
          Proposed solution:
          Could I/you update the parts:
          [https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/FifoBuffer.java#L212]
          and
          [https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/ChannelManager.java#L385]
          so that it will be: 'wait(900000)' 15min ?
          Or create a new variable like: DEFAULT_CONNECTION_TIMEOUT_SECONDS, or use any other existing one, as long as it will have some timeout eventually.
           
          There's total 8 uses of 'wait' function, so it'd be good to update all of them.

          Unfortunately I wasn't able to reproduce the error so far as it's very time sensitive bug to produce.
          Mateusz made changes -
          Summary Original: Jenkins master thread stuck on waiting New: Jenkins master threads stuck on waiting
          Mateusz made changes -
          Attachment New: image-2024-08-19-15-16-48-527.png [ 63169 ]

          Mateusz added a comment -

          I've recently identified another issue with com.trilead.ssh2:

          On the long running Jenkins with many agents, there was a problem with monitoring plugin, it stopped recording graphs for quite long time:

           

           

          Waiting thread:

          "javamelody" daemon prio=5 WAITING
              java.lang.Object.wait(Native Method)
              java.lang.Object.wait(Object.java:502)
              com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)
              com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
              com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
              hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)
              hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:85)
              hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:54)
              java.io.OutputStream.write(OutputStream.java:75)
              hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
              hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
              hudson.remoting.Channel.send(Channel.java:764)
              hudson.remoting.Request.callAsync(Request.java:238)
              hudson.remoting.Channel.callAsync(Channel.java:1028)
              net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:188)
              net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:213)
              net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
              net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
              net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
              java.util.TimerThread.mainLoop(Timer.java:555)
              java.util.TimerThread.run(Timer.java:505)

          Again forever stuck by not having timeout in the method.

          Mateusz added a comment - I've recently identified another issue with com.trilead.ssh2: On the long running Jenkins with many agents, there was a problem with monitoring plugin, it stopped recording graphs for quite long time:     Waiting thread: "javamelody" daemon prio=5 WAITING     java.lang.Object.wait(Native Method)     java.lang.Object.wait(Object.java:502)     com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)     com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)     com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)     hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)     hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:85)     hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:54)     java.io.OutputStream.write(OutputStream.java:75)     hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)     hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)     hudson.remoting.Channel.send(Channel.java:764)     hudson.remoting.Request.callAsync(Request.java:238)     hudson.remoting.Channel.callAsync(Channel.java:1028)     net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:188)     net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:213)     net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)     net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)     net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)     java.util.TimerThread.mainLoop(Timer.java:555)     java.util.TimerThread.run(Timer.java:505) Again forever stuck by not having timeout in the method.
          Mateusz made changes -
          Assignee New: Mateusz [ experrior ]
          Mateusz made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

          If you configure the TCP stack of your Jenkins Controller and Jenkins Agents to a proper value for TIME_WAIT, The connection will die and the threads. The default value in Linux is 7200 seconds, I usually recommend to set it to 120 seconds.

          Ivan Fernandez Calvo added a comment - If you configure the TCP stack of your Jenkins Controller and Jenkins Agents to a proper value for TIME_WAIT, The connection will die and the threads. The default value in Linux is 7200 seconds, I usually recommend to set it to 120 seconds.

          This usually is a symptom of other issues, configure logs in both sides and tune the TCP stack helps to find the real Issue

          https://github.com/jenkinsci/ssh-agents-plugin/blob/main/doc/TROUBLESHOOTING.md#enable-ssh-keepalive-traffic

          Ivan Fernandez Calvo added a comment - This usually is a symptom of other issues, configure logs in both sides and tune the TCP stack helps to find the real Issue https://github.com/jenkinsci/ssh-agents-plugin/blob/main/doc/TROUBLESHOOTING.md#enable-ssh-keepalive-traffic

          Mateusz added a comment -

          I've checked the /etc/ssh/shh_config and /proc/sys/net/ipv4/tcp_keepalive_time (all default 7200) on both master and agents, the timeout was shorter than 2 weeks  - which is how long the 'javamelody' thread was stuck.

          Mateusz added a comment - I've checked the /etc/ssh/shh_config and /proc/sys/net/ipv4/tcp_keepalive_time (all default 7200) on both master and agents, the timeout was shorter than 2 weeks  - which is how long the 'javamelody' thread was stuck.

            experrior Mateusz
            experrior Mateusz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: