Recently on my Jenkins I encountered a kind of thread lock, which turned out to be an indefinite wait which happened after restart with connection to one of the nodes, which was no longer existing, but jenkins master was waiting to connect to it anyway. Due to this wait, the master couldn't properly communicate with other nodes about ssh steps in the pipelines (none of the 'sh' steps in the pipelines worked). Additionally when trying to check logs from /logs/warning the endpoint was not responding. All the while the cpu, and memory load weren't high on the master instance.

      The thread dump (2 blocked, waiting for non responding Channel@1a0969c9, which wasn't present in the master thread dump):
       
      Channel reader thread: node_124
      "Channel reader thread: node_124" Id=24000 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
      at java.base@11.0.22/java.lang.Object.wait(Native Method)

      •  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
        at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
        at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
        at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
        at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:935)
        at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
        at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
        at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:94)
        at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:74)
        at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:105)
        at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)

      javamelody
      "javamelody" Id=165 Group=main WAITING on com.trilead.ssh2.channel.Channel@1a0969c9
      at java.base@11.0.22/java.lang.Object.wait(Native Method)

      •  waiting on com.trilead.ssh2.channel.Channel@1a0969c9
        at java.base@11.0.22/java.lang.Object.wait(Unknown Source)
        at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:385)
        at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
        at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
        at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:93)
        at hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:89)
        at hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:58)
        at java.base@11.0.22/java.io.OutputStream.write(Unknown Source)
        at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
        at hudson.remoting.Channel.send(Channel.java:768)
      •  locked hudson.remoting.Channel@741652e2
        at hudson.remoting.Request.callAsync(Request.java:238)
        at hudson.remoting.Channel.callAsync(Channel.java:1032)
        at net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:189)
        at net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:214)
        at net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
        at net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
        at net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
        at java.base@11.0.22/java.util.TimerThread.mainLoop(Unknown Source)
        at java.base@11.0.22/java.util.TimerThread.run(Unknown Source)
         
        The issue looks like a rare race condition when the agent is deleted midway communication, causing the master to be stuck on 'wait' method.
         
        Proposed solution:
        Could I/you update the parts:
        https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/FifoBuffer.java#L212
        and
        https://github.com/jenkinsci/trilead-ssh2/blob/main/src/com/trilead/ssh2/channel/ChannelManager.java#L385
        so that it will be: 'wait(900000)' 15min ?
        Or create a new variable like: DEFAULT_CONNECTION_TIMEOUT_SECONDS, or use any other existing one, as long as it will have some timeout eventually.
         
        There's total 8 uses of 'wait' function, so it'd be good to update all of them.

      Unfortunately I wasn't able to reproduce the error so far as it's very time sensitive bug to produce.

          [JENKINS-73575] Jenkins master threads stuck on waiting

          Mateusz added a comment -

          I've recently identified another issue with com.trilead.ssh2:

          On the long running Jenkins with many agents, there was a problem with monitoring plugin, it stopped recording graphs for quite long time:

           

           

          Waiting thread:

          "javamelody" daemon prio=5 WAITING
              java.lang.Object.wait(Native Method)
              java.lang.Object.wait(Object.java:502)
              com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)
              com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
              com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)
              hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)
              hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:85)
              hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:54)
              java.io.OutputStream.write(OutputStream.java:75)
              hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)
              hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)
              hudson.remoting.Channel.send(Channel.java:764)
              hudson.remoting.Request.callAsync(Request.java:238)
              hudson.remoting.Channel.callAsync(Channel.java:1028)
              net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:188)
              net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:213)
              net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)
              net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
              net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)
              java.util.TimerThread.mainLoop(Timer.java:555)
              java.util.TimerThread.run(Timer.java:505)

          Again forever stuck by not having timeout in the method.

          Mateusz added a comment - I've recently identified another issue with com.trilead.ssh2: On the long running Jenkins with many agents, there was a problem with monitoring plugin, it stopped recording graphs for quite long time:     Waiting thread: "javamelody" daemon prio=5 WAITING     java.lang.Object.wait(Native Method)     java.lang.Object.wait(Object.java:502)     com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:383)     com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)     com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:68)     hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)     hudson.remoting.ChunkedOutputStream.drain(ChunkedOutputStream.java:85)     hudson.remoting.ChunkedOutputStream.write(ChunkedOutputStream.java:54)     java.io.OutputStream.write(OutputStream.java:75)     hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:45)     hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:46)     hudson.remoting.Channel.send(Channel.java:764)     hudson.remoting.Request.callAsync(Request.java:238)     hudson.remoting.Channel.callAsync(Channel.java:1028)     net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(RemoteCallHelper.java:188)     net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName(RemoteCallHelper.java:213)     net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:159)     net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)     net.bull.javamelody.NodesCollector$2.run(NodesCollector.java:115)     java.util.TimerThread.mainLoop(Timer.java:555)     java.util.TimerThread.run(Timer.java:505) Again forever stuck by not having timeout in the method.

          If you configure the TCP stack of your Jenkins Controller and Jenkins Agents to a proper value for TIME_WAIT, The connection will die and the threads. The default value in Linux is 7200 seconds, I usually recommend to set it to 120 seconds.

          Ivan Fernandez Calvo added a comment - If you configure the TCP stack of your Jenkins Controller and Jenkins Agents to a proper value for TIME_WAIT, The connection will die and the threads. The default value in Linux is 7200 seconds, I usually recommend to set it to 120 seconds.

          This usually is a symptom of other issues, configure logs in both sides and tune the TCP stack helps to find the real Issue

          https://github.com/jenkinsci/ssh-agents-plugin/blob/main/doc/TROUBLESHOOTING.md#enable-ssh-keepalive-traffic

          Ivan Fernandez Calvo added a comment - This usually is a symptom of other issues, configure logs in both sides and tune the TCP stack helps to find the real Issue https://github.com/jenkinsci/ssh-agents-plugin/blob/main/doc/TROUBLESHOOTING.md#enable-ssh-keepalive-traffic

          Mateusz added a comment -

          I've checked the /etc/ssh/shh_config and /proc/sys/net/ipv4/tcp_keepalive_time (all default 7200) on both master and agents, the timeout was shorter than 2 weeks  - which is how long the 'javamelody' thread was stuck.

          Mateusz added a comment - I've checked the /etc/ssh/shh_config and /proc/sys/net/ipv4/tcp_keepalive_time (all default 7200) on both master and agents, the timeout was shorter than 2 weeks  - which is how long the 'javamelody' thread was stuck.

          Again, you have an issue of SSH connections half closed, that ends on a tons of thread block forever. The solution is to configure the TCP stack to end this connections. The change on the wait you propose in the PR could help but the half closed connection will still there if you do not configure your TCP stack.

          Ivan Fernandez Calvo added a comment - Again, you have an issue of SSH connections half closed, that ends on a tons of thread block forever. The solution is to configure the TCP stack to end this connections. The change on the wait you propose in the PR could help but the half closed connection will still there if you do not configure your TCP stack.

          Mateusz added a comment - - edited

          I'll look more into the TCP stack configuration if there's an extremely large timeout, or no keepalive set, and I'll get back once I find something.

          Mateusz added a comment - - edited I'll look more into the TCP stack configuration if there's an extremely large timeout, or no keepalive set, and I'll get back once I find something.

          Mateusz added a comment -

          From the TCP / SSH configuration on master/agents there shouldn't be any problems. Only new thing that I've found is that:
          hudson.slaves.ChannelPinger.pingIntervalSeconds
          hudson.slaves.ChannelPinger.pingTimeoutSeconds
          were set to -1 in the original case, which could be the cause of the issue.
          Yet, that doesn't explain the second issue with monitoring plugin threads hang, where the values
          are set to 300, 900, so that shouldn't be the cause of problems there.

          Mateusz added a comment - From the TCP / SSH configuration on master/agents there shouldn't be any problems. Only new thing that I've found is that: hudson.slaves.ChannelPinger.pingIntervalSeconds hudson.slaves.ChannelPinger.pingTimeoutSeconds were set to -1 in the original case, which could be the cause of the issue. Yet, that doesn't explain the second issue with monitoring plugin threads hang, where the values are set to 300, 900, so that shouldn't be the cause of problems there.

            experrior Mateusz
            experrior Mateusz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: