Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48616

SSH Slaves should pass connection timeout to connection.connect() if there is an agent startup timeout

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • ssh-slaves-plugin
    • None

      Infinite hanging of connections is likely a root cause of JENKINS-48613.

      "SSHLauncher.launch for 'myagent' node [#1]" #2565 prio=5 os_prio=0 tid=0x00007f080c1b1000 nid=0x35c runnable [0x00007f07b2c5c000]
         java.lang.Thread.State: RUNNABLE
          at java.net.SocketInputStream.socketRead0(Native Method)
          at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
          at java.net.SocketInputStream.read(SocketInputStream.java:171)
          at java.net.SocketInputStream.read(SocketInputStream.java:141)
          at java.net.SocketInputStream.read(SocketInputStream.java:224)
          at com.trilead.ssh2.transport.ClientServerHello.readLineRN(ClientServerHello.java:31)
          at com.trilead.ssh2.transport.ClientServerHello.<init>(ClientServerHello.java:68)
          at com.trilead.ssh2.transport.TransportManager.initialize(TransportManager.java:487)
          at com.trilead.ssh2.Connection.connect(Connection.java:774)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at com.trilead.ssh2.Connection.connect(Connection.java:703)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at com.trilead.ssh2.Connection.connect(Connection.java:617)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1302)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:814)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:803)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:748)
      

      Trilead SSH API allows passing timeouts, so we should leverage that at least in the cases when the agent startup timeout is specified.

          [JENKINS-48616] SSH Slaves should pass connection timeout to connection.connect() if there is an agent startup timeout

          Oleg Nenashev added a comment -

          Oleg Nenashev added a comment - https://github.com/jenkinsci/ssh-slaves-plugin/pull/80

          Code changed in jenkins
          User: Oleg Nenashev
          Path:
          src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
          http://jenkins-ci.org/commit/ssh-slaves-plugin/618b3b366753dd2c607f4e34ec1d62e3c9996580
          Log:
          JENKINS-48616 - Pass launch timeout as Connection and KEx timeout to TrileadSSH

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: src/main/java/hudson/plugins/sshslaves/SSHLauncher.java http://jenkins-ci.org/commit/ssh-slaves-plugin/618b3b366753dd2c607f4e34ec1d62e3c9996580 Log: JENKINS-48616 - Pass launch timeout as Connection and KEx timeout to TrileadSSH

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
          http://jenkins-ci.org/commit/ssh-slaves-plugin/ce90954922228fe4106ad1977a7d3ecce599db70
          Log:
          Merge pull request #80 from oleg-nenashev/bug/JENKINS-48616

          JENKINS-48616 - Pass launch timeout as Connection and KEx timeout to TrileadSSH

          Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/e2c87cf2e65e...ce9095492222

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/hudson/plugins/sshslaves/SSHLauncher.java http://jenkins-ci.org/commit/ssh-slaves-plugin/ce90954922228fe4106ad1977a7d3ecce599db70 Log: Merge pull request #80 from oleg-nenashev/bug/ JENKINS-48616 JENKINS-48616 - Pass launch timeout as Connection and KEx timeout to TrileadSSH Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/e2c87cf2e65e...ce9095492222

          Gregor Philp added a comment -

          Hi

          I upgraded this plugin to the latest version and we now have a problem with slaves randomly loosing the ssh connection killing the jobs.  I have rolled the plugin back.

          java.util.concurrent.TimeoutException: Ping started at 1515519675182 hasn't completed by 1515519915183
              at hudson.remoting.PingThread.ping(PingThread.java:134)

              at hudson.remoting.PingThread.ping(PingThread.java:90)

           

          It seems that this issue is not resolved or maybe another introduced.  We are running version 2.89.2 of jenkins

           

          thanks

          Gregor

          Gregor Philp added a comment - Hi I upgraded this plugin to the latest version and we now have a problem with slaves randomly loosing the ssh connection killing the jobs.  I have rolled the plugin back. java.util.concurrent.TimeoutException: Ping started at 1515519675182 hasn't completed by 1515519915183     at hudson.remoting.PingThread.ping(PingThread.java:134)     at hudson.remoting.PingThread.ping(PingThread.java:90)   It seems that this issue is not resolved or maybe another introduced.  We are running version 2.89.2 of jenkins   thanks Gregor

          Oleg Nenashev added a comment -

          Hi Gregor,

          What was the previous plugin version. Could you also provide agent stackdjmps for the timeframe before the outage?

          I am not sure it is related to this issue so far

          Oleg Nenashev added a comment - Hi Gregor, What was the previous plugin version. Could you also provide agent stackdjmps for the timeframe before the outage? I am not sure it is related to this issue so far

          Gregor Philp added a comment -

          Hi Oleg

          I can try to upgrade our test stack again and see if I can duplicate the issue.  It seemed fairly random but that slaves are terminated so I cannot get any info from them.  We use the EC2 plugin to lunch slaves in AWS on demand and then they are terminated.

          We have several stacks, this one the previous plugin version on this stack was 1.22 so I rolled back to that.

          I have other stacks that are on 1.24 and I've not seen the problem and since they were not upgraded to 1.25, I assumed it was that version that broke.

           

          I'll post if I can duplicate the issue in our test stack.

          Gregor Philp added a comment - Hi Oleg I can try to upgrade our test stack again and see if I can duplicate the issue.  It seemed fairly random but that slaves are terminated so I cannot get any info from them.  We use the EC2 plugin to lunch slaves in AWS on demand and then they are terminated. We have several stacks, this one the previous plugin version on this stack was 1.22 so I rolled back to that. I have other stacks that are on 1.24 and I've not seen the problem and since they were not upgraded to 1.25, I assumed it was that version that broke.   I'll post if I can duplicate the issue in our test stack.

          Oleg Nenashev added a comment -

          gjphilp your issue sounds very similar to JENKINS-48865 though you do not use the weekly core. maybe just a coincidence

          Oleg Nenashev added a comment - gjphilp your issue sounds very similar to JENKINS-48865 though you do not use the weekly core. maybe just a coincidence

          Gregor Philp added a comment -

          yeah these are not VM or container slaves.  They are physical EC2 linux boxes in AWS.  They are only terminated if not used for one hour.  If anything runs on the slave they are not terminated.  We've run this setup for 2 years now and only saw that issue last couple of days after some plugin updates.

          Gregor Philp added a comment - yeah these are not VM or container slaves.  They are physical EC2 linux boxes in AWS.  They are only terminated if not used for one hour.  If anything runs on the slave they are not terminated.  We've run this setup for 2 years now and only saw that issue last couple of days after some plugin updates.

          Oleg Nenashev added a comment -

          Fixed in 1.25

          Oleg Nenashev added a comment - Fixed in 1.25

            oleg_nenashev Oleg Nenashev
            oleg_nenashev Oleg Nenashev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: