Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68656

SSH Slaves Plugin Deadlock while spinning up a new agent

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • None
    • Jenkins 2.332.3, OpenJDK 11.0.15, running on Ubuntu 20.04
      SSH Slaves Plugin 1.814.vc82988f54b_10 (tested with 1.33.0 as well)
      Anka Build Plugin 2.7.0
    • 1.821.vd834f8a_c390e

      The error observed is agents simply hanging while starting. This happens about 5% of the VMs started in this manner.

      Anka Build plugin is used and the VM which is spun by it is 100% functional.

      Investigating the tread dump shows a deadlock between launch and 

      teardownConncetion methods in SSHLauncher.

      I have attached stack trace of both threads as files.

       

      The launch method seems to be hanging while executing this:
      java.lang.Thread.State: TIMED_WAITING (on object monitor)
      at java.lang.Object.wait(java.base@11.0.15/Native Method)

      • waiting on <no object reference available>
        at hudson.remoting.Request.call(Request.java:177)
      • waiting to re-lock in wait() <0x00000005f9721350> (a hudson.remoting.UserRequest)
        at hudson.remoting.Channel.call(Channel.java:999)
        at hudson.FilePath.act(FilePath.java:1194)
        at hudson.FilePath.act(FilePath.java:1183)
        at hudson.FilePath.exists(FilePath.java:1748)
        at jenkins.branch.WorkspaceLocatorImpl.load(WorkspaceLocatorImpl.java:254)
        at jenkins.branch.WorkspaceLocatorImpl.access$500(WorkspaceLocatorImpl.java:86)
        at jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(WorkspaceLocatorImpl.java:601)
      • locked <0x00000005f97214e0> (a java.lang.String)
        at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:727)
        at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:437)
        at hudson.plugins.sshslaves.SSHLauncher.startAgent(SSHLauncher.java:645)
        at hudson.plugins.sshslaves.SSHLauncher.lambda$launch$0(SSHLauncher.java:458)
        at hudson.plugins.sshslaves.SSHLauncher$$Lambda$393/0x0000000840c2c040.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(java.base@11.0.15/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.15/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.15/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.15/Thread.java:829)

          [JENKINS-68656] SSH Slaves Plugin Deadlock while spinning up a new agent

          niv keidan added a comment -

          Also, it seems that this is only happening in Pipelines and does not occur in freestyle jobs... :/

          niv keidan added a comment - Also, it seems that this is only happening in Pipelines and does not occur in freestyle jobs... :/

          niv keidan added a comment -

          Also, we have 3 output sets of the "Support Core" plugin when this issue is happening. Lots of info there. I can attach if you think that will help.

          niv keidan added a comment - Also, we have 3 output sets of the "Support Core" plugin when this issue is happening. Lots of info there. I can attach if you think that will help.

          I do not remember if the Support plugin anonymizes all the sensible info about your instance, so better not attach it here.

          I do not like to mix stuff in the same issue, but I think that both have the same root cause, we are talking about two issues one is that agents stucks when they start, and the other is that agents stucks in the middle (not sure but if it is at the end or a random point) of pipeline execution.

          With the stack trace, you paste before the three threads are using more than 70% of the CPU and are stuck reading from the disk and sending that data to the Jenkins Controller. The times I saw this in the past is related to a poor IO performance in the agent, it is usually caused by using a NFS filesystem(or other network filesystem) for the workspace of the agents, please check the following links

          https://support.cloudbees.com/hc/en-us/articles/115003461772-IO-Troubleshooting-on-Linux
          https://support.cloudbees.com/hc/en-us/articles/115003442371-Required-Data-IO-issues-on-Linux

          Ivan Fernandez Calvo added a comment - I do not remember if the Support plugin anonymizes all the sensible info about your instance, so better not attach it here. I do not like to mix stuff in the same issue, but I think that both have the same root cause, we are talking about two issues one is that agents stucks when they start, and the other is that agents stucks in the middle (not sure but if it is at the end or a random point) of pipeline execution. With the stack trace, you paste before the three threads are using more than 70% of the CPU and are stuck reading from the disk and sending that data to the Jenkins Controller. The times I saw this in the past is related to a poor IO performance in the agent, it is usually caused by using a NFS filesystem(or other network filesystem) for the workspace of the agents, please check the following links https://support.cloudbees.com/hc/en-us/articles/115003461772-IO-Troubleshooting-on-Linux https://support.cloudbees.com/hc/en-us/articles/115003442371-Required-Data-IO-issues-on-Linux

          niv keidan added a comment -

          I may be misunderstanding, but I am seeing all 3 stack traces stuck on "SocketWrite" so why are you saying its reading? and why do you say its reading from disk?

          niv keidan added a comment - I may be misunderstanding, but I am seeing all 3 stack traces stuck on "SocketWrite" so why are you saying its reading? and why do you say its reading from disk?

          IIRC this is the part that grabs the classes from the Jenkins controller and stores those classes in the local cache of the agent to run them locally, so it has a network import that is saved to disk.

          at hudson.remoting.Util.copy(Util.java:58)
          at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)

          https://github.com/daniel-beck/jenkins-remoting/blob/master/src/main/java/hudson/remoting/JarLoaderImpl.java#L31-L39

          Ivan Fernandez Calvo added a comment - IIRC this is the part that grabs the classes from the Jenkins controller and stores those classes in the local cache of the agent to run them locally, so it has a network import that is saved to disk. at hudson.remoting.Util.copy(Util.java:58) at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57) https://github.com/daniel-beck/jenkins-remoting/blob/master/src/main/java/hudson/remoting/JarLoaderImpl.java#L31-L39

          I just remember that I have used Anka provider by MacStadium about 2 years ago, at least at that time their performance was really poor.

          Ivan Fernandez Calvo added a comment - I just remember that I have used Anka provider by MacStadium about 2 years ago, at least at that time their performance was really poor.

          niv keidan added a comment -

          Yeah, since major version 2 is much better.

          In any case, this is relevant https://github.com/jenkinsci/ssh-slaves-plugin/pull/304

          niv keidan added a comment - Yeah, since major version 2 is much better. In any case, this is relevant https://github.com/jenkinsci/ssh-slaves-plugin/pull/304

          Ivan Fernandez Calvo added a comment - - edited

          The PR will kill the connection in the best case, but the issue about how the connection is stuck in a native IO operation will be there, so the plugin will try again to reconnect, it could work or not.
          In the case of your pipelines stuck on IO operation, the fix in the PR will not apply if the channel is not broken.

          Ivan Fernandez Calvo added a comment - - edited The PR will kill the connection in the best case, but the issue about how the connection is stuck in a native IO operation will be there, so the plugin will try again to reconnect, it could work or not. In the case of your pipelines stuck on IO operation, the fix in the PR will not apply if the channel is not broken.

          Did the new version fix the Deadlock at start time?

          Ivan Fernandez Calvo added a comment - Did the new version fix the Deadlock at start time?

          Nathan added a comment -

          Hi Ivan, yep! tomekjarosik is unblocked using the new code.

          Nathan added a comment - Hi Ivan, yep! tomekjarosik is unblocked using the new code.

            ifernandezcalvo Ivan Fernandez Calvo
            niv_keidan_veertu niv keidan
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: