Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48616

SSH Slaves should pass connection timeout to connection.connect() if there is an agent startup timeout

    XMLWordPrintable

Details

    Description

      Infinite hanging of connections is likely a root cause of JENKINS-48613.

      "SSHLauncher.launch for 'myagent' node [#1]" #2565 prio=5 os_prio=0 tid=0x00007f080c1b1000 nid=0x35c runnable [0x00007f07b2c5c000]
         java.lang.Thread.State: RUNNABLE
          at java.net.SocketInputStream.socketRead0(Native Method)
          at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
          at java.net.SocketInputStream.read(SocketInputStream.java:171)
          at java.net.SocketInputStream.read(SocketInputStream.java:141)
          at java.net.SocketInputStream.read(SocketInputStream.java:224)
          at com.trilead.ssh2.transport.ClientServerHello.readLineRN(ClientServerHello.java:31)
          at com.trilead.ssh2.transport.ClientServerHello.<init>(ClientServerHello.java:68)
          at com.trilead.ssh2.transport.TransportManager.initialize(TransportManager.java:487)
          at com.trilead.ssh2.Connection.connect(Connection.java:774)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at com.trilead.ssh2.Connection.connect(Connection.java:703)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at com.trilead.ssh2.Connection.connect(Connection.java:617)
          - locked <0x0000000594003de0> (a com.trilead.ssh2.Connection)
          at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1302)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:814)
          at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:803)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:748)
      

      Trilead SSH API allows passing timeouts, so we should leverage that at least in the cases when the agent startup timeout is specified.

      Attachments

        Issue Links

          Activity

            oleg_nenashev Oleg Nenashev added a comment -

            Hi Gregor,

            What was the previous plugin version. Could you also provide agent stackdjmps for the timeframe before the outage?

            I am not sure it is related to this issue so far

            oleg_nenashev Oleg Nenashev added a comment - Hi Gregor, What was the previous plugin version. Could you also provide agent stackdjmps for the timeframe before the outage? I am not sure it is related to this issue so far
            gjphilp Gregor Philp added a comment -

            Hi Oleg

            I can try to upgrade our test stack again and see if I can duplicate the issue.  It seemed fairly random but that slaves are terminated so I cannot get any info from them.  We use the EC2 plugin to lunch slaves in AWS on demand and then they are terminated.

            We have several stacks, this one the previous plugin version on this stack was 1.22 so I rolled back to that.

            I have other stacks that are on 1.24 and I've not seen the problem and since they were not upgraded to 1.25, I assumed it was that version that broke.

             

            I'll post if I can duplicate the issue in our test stack.

            gjphilp Gregor Philp added a comment - Hi Oleg I can try to upgrade our test stack again and see if I can duplicate the issue.  It seemed fairly random but that slaves are terminated so I cannot get any info from them.  We use the EC2 plugin to lunch slaves in AWS on demand and then they are terminated. We have several stacks, this one the previous plugin version on this stack was 1.22 so I rolled back to that. I have other stacks that are on 1.24 and I've not seen the problem and since they were not upgraded to 1.25, I assumed it was that version that broke.   I'll post if I can duplicate the issue in our test stack.
            oleg_nenashev Oleg Nenashev added a comment -

            gjphilp your issue sounds very similar to JENKINS-48865 though you do not use the weekly core. maybe just a coincidence

            oleg_nenashev Oleg Nenashev added a comment - gjphilp your issue sounds very similar to JENKINS-48865 though you do not use the weekly core. maybe just a coincidence
            gjphilp Gregor Philp added a comment -

            yeah these are not VM or container slaves.  They are physical EC2 linux boxes in AWS.  They are only terminated if not used for one hour.  If anything runs on the slave they are not terminated.  We've run this setup for 2 years now and only saw that issue last couple of days after some plugin updates.

            gjphilp Gregor Philp added a comment - yeah these are not VM or container slaves.  They are physical EC2 linux boxes in AWS.  They are only terminated if not used for one hour.  If anything runs on the slave they are not terminated.  We've run this setup for 2 years now and only saw that issue last couple of days after some plugin updates.
            oleg_nenashev Oleg Nenashev added a comment -

            Fixed in 1.25

            oleg_nenashev Oleg Nenashev added a comment - Fixed in 1.25

            People

              oleg_nenashev Oleg Nenashev
              oleg_nenashev Oleg Nenashev
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: