Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48616

SSH Slaves should pass connection timeout to connection.connect() if there is an agent startup timeout

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Infinite hanging of connections is likely a root cause of JENKINS-48613.

      "SSHLauncher.launch for 'myagent' node [#1]" #2565 prio=5 os_prio=0 tid=0x00007f080c1b1000 nid=0x35c runnable [0x00007f07b2c5c000]
         java.lang.Thread.State: RUNNABLE
          at java.net.SocketInputStream.socketRead0(Native Method)
          at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
          at java.net.SocketInputStream.read(SocketInputStream.java:171)
          at java.net.SocketInputStream.read(SocketInputStream.java:141)
          at java.net.SocketInputStream.read(SocketInputStream.java:224)
          at com.trilead.ssh2.transport.ClientServerHello.readLineRN(ClientServerHello.java:31)
          at com.trilead.ssh2.transport.ClientServerHello.<init>(ClientServerHello.java:68)
          at com.trilead.ssh2.transport.TransportManager.initialize(TransportManager.java:487)
          at com.trilead.ssh2.Connection.connect(Connection.java:774)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at com.trilead.ssh2.Connection.connect(Connection.java:703)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at com.trilead.ssh2.Connection.connect(Connection.java:617)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1302)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:814)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:803)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:748)
      

      Trilead SSH API allows passing timeouts, so we should leverage that at least in the cases when the agent startup timeout is specified.

        Attachments

          Issue Links

            Activity

            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Hi Gregor,

            What was the previous plugin version. Could you also provide agent stackdjmps for the timeframe before the outage?

            I am not sure it is related to this issue so far

            Show
            oleg_nenashev Oleg Nenashev added a comment - Hi Gregor, What was the previous plugin version. Could you also provide agent stackdjmps for the timeframe before the outage? I am not sure it is related to this issue so far
            Hide
            gjphilp Gregor Philp added a comment -

            Hi Oleg

            I can try to upgrade our test stack again and see if I can duplicate the issue.  It seemed fairly random but that slaves are terminated so I cannot get any info from them.  We use the EC2 plugin to lunch slaves in AWS on demand and then they are terminated.

            We have several stacks, this one the previous plugin version on this stack was 1.22 so I rolled back to that.

            I have other stacks that are on 1.24 and I've not seen the problem and since they were not upgraded to 1.25, I assumed it was that version that broke.

             

            I'll post if I can duplicate the issue in our test stack.

            Show
            gjphilp Gregor Philp added a comment - Hi Oleg I can try to upgrade our test stack again and see if I can duplicate the issue.  It seemed fairly random but that slaves are terminated so I cannot get any info from them.  We use the EC2 plugin to lunch slaves in AWS on demand and then they are terminated. We have several stacks, this one the previous plugin version on this stack was 1.22 so I rolled back to that. I have other stacks that are on 1.24 and I've not seen the problem and since they were not upgraded to 1.25, I assumed it was that version that broke.   I'll post if I can duplicate the issue in our test stack.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Gregor Philp your issue sounds very similar to JENKINS-48865 though you do not use the weekly core. maybe just a coincidence

            Show
            oleg_nenashev Oleg Nenashev added a comment - Gregor Philp your issue sounds very similar to JENKINS-48865 though you do not use the weekly core. maybe just a coincidence
            Hide
            gjphilp Gregor Philp added a comment -

            yeah these are not VM or container slaves.  They are physical EC2 linux boxes in AWS.  They are only terminated if not used for one hour.  If anything runs on the slave they are not terminated.  We've run this setup for 2 years now and only saw that issue last couple of days after some plugin updates.

            Show
            gjphilp Gregor Philp added a comment - yeah these are not VM or container slaves.  They are physical EC2 linux boxes in AWS.  They are only terminated if not used for one hour.  If anything runs on the slave they are not terminated.  We've run this setup for 2 years now and only saw that issue last couple of days after some plugin updates.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Fixed in 1.25

            Show
            oleg_nenashev Oleg Nenashev added a comment - Fixed in 1.25

              People

              Assignee:
              oleg_nenashev Oleg Nenashev
              Reporter:
              oleg_nenashev Oleg Nenashev
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: