Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53810

Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • ssh-slaves-1.30.0

      Launching node/agent fails with

      ERROR: null
      java.util.concurrent.CancellationException

      We have large number number of jobs in queue which gets assigned to slaves being created by Docker plugin. Even, if we try creating slave and try to launch agent, it fails.
      Note: Slave image adheres to all the requirement and works well if there is no huge queue.

      Executor Status
      SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true}
      [09/27/18 02:53:32] [SSH] Opening SSH connection to 9.47.78.144:32870.
      [09/27/18 02:53:32] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
      [09/27/18 02:53:32] [SSH] Authentication successful.
      [09/27/18 02:53:32] [SSH] The remote user's environment is:
      BASH=/usr/bin/bash
      BASHOPTS=cmdhist:expand_aliases:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
      BASH_ALIASES=()
      BASH_ARGC=()
      BASH_ARGV=()
      BASH_CMDS=()
      BASH_EXECUTION_STRING=set
      BASH_LINENO=()
      BASH_SOURCE=()
      BASH_VERSINFO=([0]="4" [1]="2" [2]="46" [3]="1" [4]="release" [5]="s390x-ibm-linux-gnu")
      BASH_VERSION='4.2.46(1)-release'
      DIRSTACK=()
      EUID=1000
      GROUPS=()
      HOME=/home/test
      HOSTNAME=01695f4aae73
      HOSTTYPE=s390x
      IFS=$' \t\n'
      LESSOPEN='||/usr/bin/lesspipe.sh %s'
      LOGNAME=test
      MACHTYPE=s390x-ibm-linux-gnu
      MAIL=/var/mail/test
      OPTERR=1
      OPTIND=1
      OSTYPE=linux-gnu
      PATH=/usr/local/bin:/usr/bin
      PIPESTATUS=([0]="0")
      PPID=13
      PS4='+ '
      PWD=/home/test
      SHELL=/bin/bash
      SHELLOPTS=braceexpand:hashall:interactive-comments
      SHLVL=1
      SSH_CLIENT='9.42.27.56 44378 22'
      SSH_CONNECTION='9.42.27.56 44378 172.17.0.2 22'
      TERM=dumb
      UID=1000
      USER=test
      _=sudo
      [09/27/18 02:53:32] [SSH] Checking java version of /home/test/jdk/bin/java
      Couldn't figure out the Java version of /home/test/jdk/bin/java
      bash: /home/test/jdk/bin/java: No such file or directory
      
      [09/27/18 02:53:33] [SSH] Checking java version of java
      [09/27/18 02:53:34] [SSH] java -version returned 1.8.0_151.
      [09/27/18 02:53:34] [SSH] Starting sftp client.
      [09/27/18 02:53:34] [SSH] Copying latest remoting.jar...
      [09/27/18 02:53:36] [SSH] Copied 776,265 bytes.
      Expanded the channel window size to 4MB
      [09/27/18 02:53:36] [SSH] Starting agent process: cd "/home/test" && java  -jar remoting.jar -workDir /home/test
      Sep 27, 2018 6:54:09 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /home/test/remoting as a remoting work directory
      Both error and output logs will be printed to /home/test/remoting
      ERROR: null
      java.util.concurrent.CancellationException
      	at java.util.concurrent.FutureTask.report(FutureTask.java:121)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:904)
      	at hudson.slaves.DelegatingComputerLauncher.launch(DelegatingComputerLauncher.java:64)
      	at io.jenkins.docker.connector.DockerComputerConnector$1.launch(DockerComputerConnector.java:117)
      	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      [09/27/18 02:57:02] Launch failed - cleaning up connection
      Slave JVM has not reported exit code. Is it still running?
      [09/27/18 02:57:02] [SSH] Connection closed.
      
      
      

          [JENKINS-53810] Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

          Oleg Nenashev added a comment -

          Also CC ifernandezcalvo. It looks rather like an SSH Slaves plugin issue

          Oleg Nenashev added a comment - Also CC ifernandezcalvo . It looks rather like an SSH Slaves plugin issue

          Ivan Fernandez Calvo added a comment - 90% percent sure that it is related to https://issues.jenkins-ci.org/browse/JENKINS-49235 , there is a workaround https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/TROUBLESHOOTING.md#threads-stuck-at-credentialsprovidertrackall

          ifernandezcalvo I tried the above workaround but that didn't solve the problem.

          Durgadas Kamath added a comment - ifernandezcalvo I tried the above workaround but that didn't solve the problem.

          So you set the property `-Dhudson.plugins.sshslaves.SSHLauncher.trackCredentials=false` on you Jenkins instance JVM parameters and the issue persist, In that case I need this info to try to understand/replicate whatever happen. https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/TROUBLESHOOTING.md#common-info-needed-to-troubleshooting-a-bug, what I saw in the log it is that the agent try to connect and after 4 min it is killed (pingThread probably) but it seems never end the connection. You said that this happens when you have when you have a huge queue, probably we'll need a thread dump of the instance when the issue happens to see what threads are blocked.

          https://wiki.jenkins.io/display/JENKINS/Obtaining+a+thread+dump

          Ivan Fernandez Calvo added a comment - So you set the property `-Dhudson.plugins.sshslaves.SSHLauncher.trackCredentials=false` on you Jenkins instance JVM parameters and the issue persist, In that case I need this info to try to understand/replicate whatever happen. https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/TROUBLESHOOTING.md#common-info-needed-to-troubleshooting-a-bug , what I saw in the log it is that the agent try to connect and after 4 min it is killed (pingThread probably) but it seems never end the connection. You said that this happens when you have when you have a huge queue, probably we'll need a thread dump of the instance when the issue happens to see what threads are blocked. https://wiki.jenkins.io/display/JENKINS/Obtaining+a+thread+dump

          yong wu added a comment -

           

          I also ran into similar problem while adding an node – SLES12.3 . Java ver on slave was up to 1.8 , not sure if this is related to ssh slave plugin or not...

          yong wu added a comment -   I also ran into similar problem while adding an node – SLES12.3 . Java ver on slave was up to 1.8 , not sure if this is related to ssh slave plugin or not...

          xman_pires Could you grab a thread dump meanwhile the agent is stuck trying to start? I see in the log that you have about a minute to get it

          https://wiki.jenkins.io/display/JENKINS/Obtaining+a+thread+dump

          Ivan Fernandez Calvo added a comment - xman_pires Could you grab a thread dump meanwhile the agent is stuck trying to start? I see in the log that you have about a minute to get it https://wiki.jenkins.io/display/JENKINS/Obtaining+a+thread+dump

          Gerd Hoffmann added a comment -

          I see the agent fail with the same stacktrace, thread dump below

          --------------------------------- cut here ------------------------------
          [11/29/18 14:35:45] [SSH] Checking java version of java
          [11/29/18 14:35:45] [SSH] java -version returned 1.8.0_191.
          [11/29/18 14:35:45] [SSH] Starting sftp client.
          [11/29/18 14:35:45] [SSH] Copying latest remoting.jar...
          [11/29/18 14:35:45] [SSH] Copied 776,717 bytes.
          Expanded the channel window size to 4MB
          [11/29/18 14:35:45] [SSH] Starting agent process: cd "/home/jenkins" && java -jar remoting.jar -workDir /home/jenkins

          Nov 29, 2018 2:35:50 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
          INFO: Using /home/jenkins/remoting as a remoting work directory
          Both error and output logs will be printed to /home/jenkins/remoting

          2018-11-29 14:36:18
          Full thread dump OpenJDK Zero VM (25.191-b12 interpreted mode):

          "Service Thread" #5 daemon prio=9 os_prio=0 tid=0xb62d2cb8 nid=0x6692 runnable [0x00000000]
          java.lang.Thread.State: RUNNABLE

          "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0xb62d0530 nid=0x6691 waiting on condition [0x00000000]
          java.lang.Thread.State: RUNNABLE

          "Finalizer" #3 daemon prio=8 os_prio=0 tid=0xb62b4ba0 nid=0x6690 in Object.wait() [0xa34fe000]
          java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)

          • waiting on <0xa9718618> (a java.lang.ref.ReferenceQueue$Lock)
            at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
          • locked <0xa9718618> (a java.lang.ref.ReferenceQueue$Lock)
            at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
            at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)

          "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0xb62b2160 nid=0x668f in Object.wait() [0xa37fe000]
          java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)

          • waiting on <0xa97187b8> (a java.lang.ref.Reference$Lock)
            at java.lang.Object.wait(Object.java:502)
            at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
          • locked <0xa97187b8> (a java.lang.ref.Reference$Lock)
            at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

          "main" #1 prio=5 os_prio=0 tid=0xb6246b10 nid=0x668d runnable [0xb6562000]
          java.lang.Thread.State: RUNNABLE
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3782)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4262)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
          at java.util.regex.Pattern$Slice.match(Pattern.java:3974)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
          at java.util.regex.Pattern$Slice.match(Pattern.java:3974)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
          at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
          at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
          at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
          at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
          at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4803)
          at java.util.regex.Pattern$Prolog.match(Pattern.java:4743)
          at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
          at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
          at java.util.regex.Pattern$Begin.match(Pattern.java:3527)
          at java.util.regex.Matcher.match(Matcher.java:1270)
          at java.util.regex.Matcher.matches(Matcher.java:604)
          at hudson.remoting.ClassFilter$RegExpClassFilter.add(ClassFilter.java:270)
          at hudson.remoting.ClassFilter$RegExpClassFilter.<init>(ClassFilter.java:264)
          at hudson.remoting.ClassFilter.createDefaultInstance(ClassFilter.java:191)
          at hudson.remoting.ClassFilter.<clinit>(ClassFilter.java:147)
          at hudson.remoting.ChannelBuilder.<init>(ChannelBuilder.java:59)
          at hudson.remoting.Launcher.main(Launcher.java:734)
          at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:691)
          at hudson.remoting.Launcher.run(Launcher.java:373)
          at hudson.remoting.Launcher.main(Launcher.java:283)

          "VM Thread" os_prio=0 tid=0xb629ad60 nid=0x668e runnable

          "VM Periodic Task Thread" os_prio=0 tid=0xb62d4590 nid=0x6693 waiting on condition

          JNI global references: 316

          Heap
          def new generation total 4992K, used 2007K [0xa4000000, 0xa4560000, 0xa9550000)
          eden space 4480K, 44% used [0xa4000000, 0xa41f5c30, 0xa4460000)
          from space 512K, 0% used [0xa4460000, 0xa4460000, 0xa44e0000)
          to space 512K, 0% used [0xa44e0000, 0xa44e0000, 0xa4560000)
          tenured generation total 10944K, used 2165K [0xa9550000, 0xaa000000, 0xb4000000)
          the space 10944K, 19% used [0xa9550000, 0xa976d588, 0xa976d600, 0xaa000000)
          Metaspace used 5424K, capacity 5565K, committed 5680K, reserved 6448K

          Gerd Hoffmann added a comment - I see the agent fail with the same stacktrace, thread dump below --------------------------------- cut here ------------------------------ [11/29/18 14:35:45] [SSH] Checking java version of java [11/29/18 14:35:45] [SSH] java -version returned 1.8.0_191. [11/29/18 14:35:45] [SSH] Starting sftp client. [11/29/18 14:35:45] [SSH] Copying latest remoting.jar... [11/29/18 14:35:45] [SSH] Copied 776,717 bytes. Expanded the channel window size to 4MB [11/29/18 14:35:45] [SSH] Starting agent process: cd "/home/jenkins" && java -jar remoting.jar -workDir /home/jenkins Nov 29, 2018 2:35:50 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /home/jenkins/remoting as a remoting work directory Both error and output logs will be printed to /home/jenkins/remoting 2018-11-29 14:36:18 Full thread dump OpenJDK Zero VM (25.191-b12 interpreted mode): "Service Thread" #5 daemon prio=9 os_prio=0 tid=0xb62d2cb8 nid=0x6692 runnable [0x00000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0xb62d0530 nid=0x6691 waiting on condition [0x00000000] java.lang.Thread.State: RUNNABLE "Finalizer" #3 daemon prio=8 os_prio=0 tid=0xb62b4ba0 nid=0x6690 in Object.wait() [0xa34fe000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0xa9718618> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) locked <0xa9718618> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216) "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0xb62b2160 nid=0x668f in Object.wait() [0xa37fe000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0xa97187b8> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:502) at java.lang.ref.Reference.tryHandlePending(Reference.java:191) locked <0xa97187b8> (a java.lang.ref.Reference$Lock) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153) "main" #1 prio=5 os_prio=0 tid=0xb6246b10 nid=0x668d runnable [0xb6562000] java.lang.Thread.State: RUNNABLE at java.util.regex.Pattern$CharProperty.match(Pattern.java:3782) at java.util.regex.Pattern$Curly.match0(Pattern.java:4262) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570) at java.util.regex.Pattern$Slice.match(Pattern.java:3974) at java.util.regex.Pattern$Branch.match(Pattern.java:4606) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Branch.match(Pattern.java:4606) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570) at java.util.regex.Pattern$Slice.match(Pattern.java:3974) at java.util.regex.Pattern$Branch.match(Pattern.java:4606) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Branch.match(Pattern.java:4606) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.match(Pattern.java:4787) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$Curly.match0(Pattern.java:4274) at java.util.regex.Pattern$Curly.match(Pattern.java:4236) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4803) at java.util.regex.Pattern$Prolog.match(Pattern.java:4743) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660) at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800) at java.util.regex.Pattern$Begin.match(Pattern.java:3527) at java.util.regex.Matcher.match(Matcher.java:1270) at java.util.regex.Matcher.matches(Matcher.java:604) at hudson.remoting.ClassFilter$RegExpClassFilter.add(ClassFilter.java:270) at hudson.remoting.ClassFilter$RegExpClassFilter.<init>(ClassFilter.java:264) at hudson.remoting.ClassFilter.createDefaultInstance(ClassFilter.java:191) at hudson.remoting.ClassFilter.<clinit>(ClassFilter.java:147) at hudson.remoting.ChannelBuilder.<init>(ChannelBuilder.java:59) at hudson.remoting.Launcher.main(Launcher.java:734) at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:691) at hudson.remoting.Launcher.run(Launcher.java:373) at hudson.remoting.Launcher.main(Launcher.java:283) "VM Thread" os_prio=0 tid=0xb629ad60 nid=0x668e runnable "VM Periodic Task Thread" os_prio=0 tid=0xb62d4590 nid=0x6693 waiting on condition JNI global references: 316 Heap def new generation total 4992K, used 2007K [0xa4000000, 0xa4560000, 0xa9550000) eden space 4480K, 44% used [0xa4000000, 0xa41f5c30, 0xa4460000) from space 512K, 0% used [0xa4460000, 0xa4460000, 0xa44e0000) to space 512K, 0% used [0xa44e0000, 0xa44e0000, 0xa4560000) tenured generation total 10944K, used 2165K [0xa9550000, 0xaa000000, 0xb4000000) the space 10944K, 19% used [0xa9550000, 0xa976d588, 0xa976d600, 0xaa000000) Metaspace used 5424K, capacity 5565K, committed 5680K, reserved 6448K

          jeffret jglick Could the ClassFilter block the agent launch? the stack trace that kraxel72 post in the previous comment it is really near to the error but I am not sure if it could be the RC.

          Ivan Fernandez Calvo added a comment - jeffret jglick Could the ClassFilter block the agent launch? the stack trace that kraxel72 post in the previous comment it is really near to the error but I am not sure if it could be the RC.

          Jesse Glick added a comment -

          Not sure what would cause that. This is within ClassFilter.createDefaultInstance using the DEFAULT_PATTERNS, so it should not behave differently in one environment than another.

          Jesse Glick added a comment - Not sure what would cause that. This is within ClassFilter.createDefaultInstance using the DEFAULT_PATTERNS , so it should not behave differently in one environment than another.

          Jeff Thompson added a comment -

          I can't see how ClassFilter could cause that, either. It may take a little bit of time to process those patterns though it still shouldn't be very long.

          Jeff Thompson added a comment - I can't see how ClassFilter could cause that, either. It may take a little bit of time to process those patterns though it still shouldn't be very long.

          Jesper Jensen added a comment -

          Just had the same problem as described, the workaround I found was to change the timeout to 60 seconds (default 10). First time it too very long almost 60 seconds the second time only a few seconds. Hope that this helps!

          Jesper Jensen added a comment - Just had the same problem as described, the workaround I found was to change the timeout to 60 seconds (default 10). First time it too very long almost 60 seconds the second time only a few seconds. Hope that this helps!

          Gerd Hoffmann added a comment -

          Interesting. Raising the timeout helped in my case too.

          Gerd Hoffmann added a comment - Interesting. Raising the timeout helped in my case too.

          Ivan Fernandez Calvo added a comment - - edited

          Do you mean "Connection Timeout in Seconds"? the default value is 210 seconds, this time is used for all retries so should be enough to cover them all, I mean, if you have 6 retries and a time to wait between them of 10 seconds, this timeout should be 6*10+The timeout of each connection, so if you want to wait for 30 seconds on each connection you would set this setting to 6*10+30*6=240

          On this issue the settings are the default so it is not related to the timeout, see launchTimeoutSeconds value

          SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, 
          

          There is a couple of improvements to change this behavior JENKINS-48617 and JENKINS-48618, for me is really confusing

          In any case, 10 seconds is a shorter value to that timeout, maybe worth to validate that the setting is not lower than 60 and push value under that value to 60 (JENKINS-55858)

          https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/CONFIGURE.md#advanced-settings

          Ivan Fernandez Calvo added a comment - - edited Do you mean "Connection Timeout in Seconds"? the default value is 210 seconds, this time is used for all retries so should be enough to cover them all, I mean, if you have 6 retries and a time to wait between them of 10 seconds, this timeout should be 6*10+The timeout of each connection, so if you want to wait for 30 seconds on each connection you would set this setting to 6*10+30*6=240 On this issue the settings are the default so it is not related to the timeout, see launchTimeoutSeconds value SSHLauncher{host= '9.47.78.144' , port=32870, credentialsId= 'slave-test' , jvmOptions= '', javaPath=' ', prefixStartSlaveCmd=' ', suffixStartSlaveCmd=' ', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, There is a couple of improvements to change this behavior JENKINS-48617 and JENKINS-48618 , for me is really confusing In any case, 10 seconds is a shorter value to that timeout, maybe worth to validate that the setting is not lower than 60 and push value under that value to 60 ( JENKINS-55858 ) https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/CONFIGURE.md#advanced-settings

          Gerd Hoffmann added a comment -

          I've changed two values actually, one from 210 to 600. My UI is in in german, so not fully sure what the english label is, "Connection Timeout in Seconds" sounds right though.

          The other one is two lines below the first, switched from 10 to 60. This is the interval between connection attempts according to the label.

          Gerd Hoffmann added a comment - I've changed two values actually, one from 210 to 600. My UI is in in german, so not fully sure what the english label is, "Connection Timeout in Seconds" sounds right though. The other one is two lines below the first, switched from 10 to 60. This is the interval between connection attempts according to the label.

          Ok... I ran into the the same. I'd like to clarify the confusion about why increasing "Connection Timeouts in Seconds" actually helps with this issue even though that should only affect the connection establishment, but not already working connections.

          I have noticed that whenever I ran into this issue, first there has been a failed connection establishment. Jenkins would then retry to connect, and succeed on the second attempt and everything seems working, jobs start building. However, after about 3-4 minutes they fail with the above connection breakdown. This does indeed match the 210 seconds, but why?

          It seems there must be a bug in the ssh connector code, because when a connection attempt fails, jenkins does retry, but it does not seem to stop the timer running for that previous failed attempt. The second attempt might succeed to establish, but the timeout  timer for the first is still active and when it runs out, it will kill the current working connection, causing builds to fail.

           

          So, increasing the "Connection timeout in seconds" does work around this issue, the actual cause however, seems somewhere in the code that handles ssh connections.

          Michael Zanetti added a comment - Ok... I ran into the the same. I'd like to clarify the confusion about why increasing "Connection Timeouts in Seconds" actually helps with this issue even though that should only affect the connection establishment, but not already working connections. I have noticed that whenever I ran into this issue, first there has been a failed connection establishment. Jenkins would then retry to connect, and succeed on the second attempt and everything seems working, jobs start building. However, after about 3-4 minutes they fail with the above connection breakdown. This does indeed match the 210 seconds, but why? It seems there must be a bug in the ssh connector code, because when a connection attempt fails, jenkins does retry, but it does not seem to stop the timer running for that previous failed attempt. The second attempt might succeed to establish, but the timeout  timer for the first is still active and when it runs out, it will kill the current working connection, causing builds to fail.   So, increasing the "Connection timeout in seconds" does work around this issue, the actual cause however, seems somewhere in the code that handles ssh connections.

          this timeout is only used on a connection, the time that you see 3-4 min it is the time that takes the Ping Thread to check the connection if it cannot communicate with the agent it breaks the channel. You have to review the exact exception you show in the Jenkins instance and the exceptions in the Agent logs, I'd recommend enabling verbose log in the SSH server to show the exact moment when the disconnection happens, because it uses to happen befor the ping thread check the connection, in that case, the problem is in the SSH configuration or in the network infrastructure.

          Ivan Fernandez Calvo added a comment - this timeout is only used on a connection, the time that you see 3-4 min it is the time that takes the Ping Thread to check the connection if it cannot communicate with the agent it breaks the channel. You have to review the exact exception you show in the Jenkins instance and the exceptions in the Agent logs, I'd recommend enabling verbose log in the SSH server to show the exact moment when the disconnection happens, because it uses to happen befor the ping thread check the connection, in that case, the problem is in the SSH configuration or in the network infrastructure.

          Michael Zanetti added a comment - - edited

          I understand that this timeout is not supposed to do this. Still I am pretty sure it does...

           

          Ever since I increased it to an insanely high number, the issue is gone. 

          Michael Zanetti added a comment - - edited I understand that this timeout is not supposed to do this. Still I am pretty sure it does...   Ever since I increased it to an insanely high number, the issue is gone. 

          mzanetti
          I've read the whole thing again, and I think could be related to how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents, SSHLauncher is designed to manage one agent, not a fleet, so I want to test something pretty similar to your configuration and conditions to see if I can replicate the issue and find a solution. Could you tell me how your agent configurations look like? and in which condition the issue is exposed?

          Ivan Fernandez Calvo added a comment - mzanetti I've read the whole thing again, and I think could be related to how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents, SSHLauncher is designed to manage one agent, not a fleet, so I want to test something pretty similar to your configuration and conditions to see if I can replicate the issue and find a solution. Could you tell me how your agent configurations look like? and in which condition the issue is exposed?

          Arno Moonen added a comment - - edited

          We are not using Docker in our setup (might have the Docker plugin installed though). So I'm not sure if this is really caused by the Docker plugin.

          Currently we are not upgrading the SSH Slaves plugin to prevent these issues (we're still on version 1.26)

          Arno Moonen added a comment - - edited We are not using Docker in our setup (might have the Docker plugin installed though). So I'm not sure if this is really caused by the Docker plugin. Currently we are not upgrading the SSH Slaves plugin to prevent these issues (we're still on version 1.26)

          Jesse Glick added a comment -

          how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents

          Supposed to be using SSHConnector, designed for clouds.

          Jesse Glick added a comment - how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents Supposed to be using SSHConnector , designed for clouds.

          Ivan Fernandez Calvo added a comment - AFAIU it uses the SSHLauncher https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L457 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L264 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L359 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L407 it has his own connector but is not the one from the slaves-ssh-plugin or an extension of it or ComputerConnector, I don't really dig too much in the code. https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L61 https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerConnector.java#L28 In any case, I'm changing how the timeout works, because we fix the trilead-ssh2 lib to manage the timeout so the current behavior it is no longer valid, we have to manage the timeout per connection, not per retries block.

          ifernandezcalvo

          In reply to your question on when exactly this happens to for me:

          So, my setup is as follows: I use AWS EC2 instances which are started on demand using the "Launch Agents via SSH" method. I also have an additional plugin installed which allows me to execute commands before the SSH connection.

          In my custom script I use "aws ec2 launch-instance..." and then "aws ec2 instance-wait ..." to wait until it has booted up. This will stall the normal "Launch agents via SSH" procedure until slaves are up and running. After that, it continues with the actual SSH call.

          Now, this works fine in 90% of the cases, however, in some cases, aws instance-wait returns before sshd on the slave is actually running and the connection attempt fails with the message:

          [07/29/19 11:57:14] [SSH] Opening SSH connection to ec2-obfuscated.eu-west-1.compute.amazonaws.com:22. Connection refused (Connection refused) SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds. There are 10 more retries left.
          

          This waits for 15 seconds, and tries again. The second attempt succeeds and all is fine now, jobs start building.

           

          However, this is the situation where the bug strikes. The ssh connector plugin did not stop the connection timeout for this first failed connection. The connection timeout for the first, failed connection is still running, and by default for another 3 minutes. Now the jobs start building but when the 3 minutes have passed, the currently working SSH connection is killed because the ssh-connector thinks it is still trying to establish the first connection.

           

          Hope this makes it clear enough.

          Michael Zanetti added a comment - ifernandezcalvo ,  In reply to your question on when exactly this happens to for me: So, my setup is as follows: I use AWS EC2 instances which are started on demand using the "Launch Agents via SSH" method. I also have an additional plugin installed which allows me to execute commands before the SSH connection. In my custom script I use "aws ec2 launch-instance..." and then "aws ec2 instance-wait ..." to wait until it has booted up. This will stall the normal "Launch agents via SSH" procedure until slaves are up and running. After that, it continues with the actual SSH call. Now, this works fine in 90% of the cases, however, in some cases, aws instance-wait returns before sshd on the slave is actually running and the connection attempt fails with the message: [07/29/19 11:57:14] [SSH] Opening SSH connection to ec2-obfuscated.eu-west-1.compute.amazonaws.com:22. Connection refused (Connection refused) SSH Connection failed with IOException: "Connection refused (Connection refused)" , retrying in 15 seconds. There are 10 more retries left. This waits for 15 seconds, and tries again. The second attempt succeeds and all is fine now, jobs start building.   However, this is the situation where the bug strikes. The ssh connector plugin did not stop the connection timeout for this first failed connection. The connection timeout for the first, failed connection is still running, and by default for another 3 minutes. Now the jobs start building but when the 3 minutes have passed, the currently working SSH connection is killed because the ssh-connector thinks it is still trying to establish the first connection.   Hope this makes it clear enough.

          yep, it is really helpful, I'm fixing the retry method on https://issues.jenkins-ci.org/browse/JENKINS-58589, and I've found the connection was not closed in some cases, so now it forces to close the connection before to retry again, also the timeout is passed to the connection and the weird global timeout disappear, now it'd be a timeout per connection.

          Ivan Fernandez Calvo added a comment - yep, it is really helpful, I'm fixing the retry method on https://issues.jenkins-ci.org/browse/JENKINS-58589 , and I've found the connection was not closed in some cases, so now it forces to close the connection before to retry again, also the timeout is passed to the connection and the weird global timeout disappear, now it'd be a timeout per connection.

          I am pretty sure is related to JENKINS-59764, I have make the PR to the docker-plugin with the solution, it'd need trilead-ssh2-build-217-jenkins-15 on Jenkins Core 2.190- or trilead-api 1.0.5 on Jenkins core 2.190.1+

          Ivan Fernandez Calvo added a comment - I am pretty sure is related to JENKINS-59764 , I have make the PR to the docker-plugin with the solution, it'd need trilead-ssh2-build-217-jenkins-15 on Jenkins Core 2.190- or trilead-api 1.0.5 on Jenkins core 2.190.1+

          pjdarton added a comment -

          I'd be a bit wary of updating the docker-plugin so it requires a fully up-to-date version of Jenkins.  I believe that users should be able to use the latest docker-plugin with a core Jenkins that's at least 6 months out of date.  If we had a critical security defect that required the latest Jenkins core then, sure, I guess we'd have to do that but for anything else, it'd be best not to force everyone to update.

          FYI what the docker-plugin is trying to do is to poll the docker container (that it just created) to wait until the container is ready to accept SSH connections before allowing the main SSHLauncher code to start trying to connect.  If the docker-plugin declared its provisioning "complete" and the slave stayed offline for any period then Jenkins would ask it to make more slave nodes than are required, causing resource wastage.

          pjdarton added a comment - I'd be a bit wary of updating the docker-plugin so it requires a fully up-to-date version of Jenkins.  I believe that users should be able to use the latest docker-plugin with a core Jenkins that's at least 6 months out of date.  If we had a critical security defect that required the latest Jenkins core then, sure, I guess we'd have to do that but for anything else, it'd be best not to force everyone to update. FYI what the docker-plugin is trying to do is to poll the docker container (that it just created) to wait until the container is ready to accept SSH connections before allowing the main SSHLauncher code to start trying to connect.  If the docker-plugin declared its provisioning "complete" and the slave stayed offline for any period then Jenkins would ask it to make more slave nodes than are required, causing resource wastage.

            ifernandezcalvo Ivan Fernandez Calvo
            durgadas Durgadas Kamath
            Votes:
            4 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: