Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53810

Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • ssh-slaves-1.30.0

      Launching node/agent fails with

      ERROR: null
      java.util.concurrent.CancellationException

      We have large number number of jobs in queue which gets assigned to slaves being created by Docker plugin. Even, if we try creating slave and try to launch agent, it fails.
      Note: Slave image adheres to all the requirement and works well if there is no huge queue.

      Executor Status
      SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true}
      [09/27/18 02:53:32] [SSH] Opening SSH connection to 9.47.78.144:32870.
      [09/27/18 02:53:32] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
      [09/27/18 02:53:32] [SSH] Authentication successful.
      [09/27/18 02:53:32] [SSH] The remote user's environment is:
      BASH=/usr/bin/bash
      BASHOPTS=cmdhist:expand_aliases:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
      BASH_ALIASES=()
      BASH_ARGC=()
      BASH_ARGV=()
      BASH_CMDS=()
      BASH_EXECUTION_STRING=set
      BASH_LINENO=()
      BASH_SOURCE=()
      BASH_VERSINFO=([0]="4" [1]="2" [2]="46" [3]="1" [4]="release" [5]="s390x-ibm-linux-gnu")
      BASH_VERSION='4.2.46(1)-release'
      DIRSTACK=()
      EUID=1000
      GROUPS=()
      HOME=/home/test
      HOSTNAME=01695f4aae73
      HOSTTYPE=s390x
      IFS=$' \t\n'
      LESSOPEN='||/usr/bin/lesspipe.sh %s'
      LOGNAME=test
      MACHTYPE=s390x-ibm-linux-gnu
      MAIL=/var/mail/test
      OPTERR=1
      OPTIND=1
      OSTYPE=linux-gnu
      PATH=/usr/local/bin:/usr/bin
      PIPESTATUS=([0]="0")
      PPID=13
      PS4='+ '
      PWD=/home/test
      SHELL=/bin/bash
      SHELLOPTS=braceexpand:hashall:interactive-comments
      SHLVL=1
      SSH_CLIENT='9.42.27.56 44378 22'
      SSH_CONNECTION='9.42.27.56 44378 172.17.0.2 22'
      TERM=dumb
      UID=1000
      USER=test
      _=sudo
      [09/27/18 02:53:32] [SSH] Checking java version of /home/test/jdk/bin/java
      Couldn't figure out the Java version of /home/test/jdk/bin/java
      bash: /home/test/jdk/bin/java: No such file or directory
      
      [09/27/18 02:53:33] [SSH] Checking java version of java
      [09/27/18 02:53:34] [SSH] java -version returned 1.8.0_151.
      [09/27/18 02:53:34] [SSH] Starting sftp client.
      [09/27/18 02:53:34] [SSH] Copying latest remoting.jar...
      [09/27/18 02:53:36] [SSH] Copied 776,265 bytes.
      Expanded the channel window size to 4MB
      [09/27/18 02:53:36] [SSH] Starting agent process: cd "/home/test" && java  -jar remoting.jar -workDir /home/test
      Sep 27, 2018 6:54:09 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /home/test/remoting as a remoting work directory
      Both error and output logs will be printed to /home/test/remoting
      ERROR: null
      java.util.concurrent.CancellationException
      	at java.util.concurrent.FutureTask.report(FutureTask.java:121)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:904)
      	at hudson.slaves.DelegatingComputerLauncher.launch(DelegatingComputerLauncher.java:64)
      	at io.jenkins.docker.connector.DockerComputerConnector$1.launch(DockerComputerConnector.java:117)
      	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      [09/27/18 02:57:02] Launch failed - cleaning up connection
      Slave JVM has not reported exit code. Is it still running?
      [09/27/18 02:57:02] [SSH] Connection closed.
      
      
      

          [JENKINS-53810] Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

          this timeout is only used on a connection, the time that you see 3-4 min it is the time that takes the Ping Thread to check the connection if it cannot communicate with the agent it breaks the channel. You have to review the exact exception you show in the Jenkins instance and the exceptions in the Agent logs, I'd recommend enabling verbose log in the SSH server to show the exact moment when the disconnection happens, because it uses to happen befor the ping thread check the connection, in that case, the problem is in the SSH configuration or in the network infrastructure.

          Ivan Fernandez Calvo added a comment - this timeout is only used on a connection, the time that you see 3-4 min it is the time that takes the Ping Thread to check the connection if it cannot communicate with the agent it breaks the channel. You have to review the exact exception you show in the Jenkins instance and the exceptions in the Agent logs, I'd recommend enabling verbose log in the SSH server to show the exact moment when the disconnection happens, because it uses to happen befor the ping thread check the connection, in that case, the problem is in the SSH configuration or in the network infrastructure.

          Michael Zanetti added a comment - - edited

          I understand that this timeout is not supposed to do this. Still I am pretty sure it does...

           

          Ever since I increased it to an insanely high number, the issue is gone. 

          Michael Zanetti added a comment - - edited I understand that this timeout is not supposed to do this. Still I am pretty sure it does...   Ever since I increased it to an insanely high number, the issue is gone. 

          mzanetti
          I've read the whole thing again, and I think could be related to how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents, SSHLauncher is designed to manage one agent, not a fleet, so I want to test something pretty similar to your configuration and conditions to see if I can replicate the issue and find a solution. Could you tell me how your agent configurations look like? and in which condition the issue is exposed?

          Ivan Fernandez Calvo added a comment - mzanetti I've read the whole thing again, and I think could be related to how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents, SSHLauncher is designed to manage one agent, not a fleet, so I want to test something pretty similar to your configuration and conditions to see if I can replicate the issue and find a solution. Could you tell me how your agent configurations look like? and in which condition the issue is exposed?

          Arno Moonen added a comment - - edited

          We are not using Docker in our setup (might have the Docker plugin installed though). So I'm not sure if this is really caused by the Docker plugin.

          Currently we are not upgrading the SSH Slaves plugin to prevent these issues (we're still on version 1.26)

          Arno Moonen added a comment - - edited We are not using Docker in our setup (might have the Docker plugin installed though). So I'm not sure if this is really caused by the Docker plugin. Currently we are not upgrading the SSH Slaves plugin to prevent these issues (we're still on version 1.26)

          Jesse Glick added a comment -

          how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents

          Supposed to be using SSHConnector, designed for clouds.

          Jesse Glick added a comment - how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents Supposed to be using SSHConnector , designed for clouds.

          Ivan Fernandez Calvo added a comment - AFAIU it uses the SSHLauncher https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L457 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L264 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L359 https://github.com/jenkinsci/docker-plugin/blob/cde48970a0ccb2260afa290d020de79dda66ae4d/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L407 it has his own connector but is not the one from the slaves-ssh-plugin or an extension of it or ComputerConnector, I don't really dig too much in the code. https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L61 https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerConnector.java#L28 In any case, I'm changing how the timeout works, because we fix the trilead-ssh2 lib to manage the timeout so the current behavior it is no longer valid, we have to manage the timeout per connection, not per retries block.

          ifernandezcalvo

          In reply to your question on when exactly this happens to for me:

          So, my setup is as follows: I use AWS EC2 instances which are started on demand using the "Launch Agents via SSH" method. I also have an additional plugin installed which allows me to execute commands before the SSH connection.

          In my custom script I use "aws ec2 launch-instance..." and then "aws ec2 instance-wait ..." to wait until it has booted up. This will stall the normal "Launch agents via SSH" procedure until slaves are up and running. After that, it continues with the actual SSH call.

          Now, this works fine in 90% of the cases, however, in some cases, aws instance-wait returns before sshd on the slave is actually running and the connection attempt fails with the message:

          [07/29/19 11:57:14] [SSH] Opening SSH connection to ec2-obfuscated.eu-west-1.compute.amazonaws.com:22. Connection refused (Connection refused) SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds. There are 10 more retries left.
          

          This waits for 15 seconds, and tries again. The second attempt succeeds and all is fine now, jobs start building.

           

          However, this is the situation where the bug strikes. The ssh connector plugin did not stop the connection timeout for this first failed connection. The connection timeout for the first, failed connection is still running, and by default for another 3 minutes. Now the jobs start building but when the 3 minutes have passed, the currently working SSH connection is killed because the ssh-connector thinks it is still trying to establish the first connection.

           

          Hope this makes it clear enough.

          Michael Zanetti added a comment - ifernandezcalvo ,  In reply to your question on when exactly this happens to for me: So, my setup is as follows: I use AWS EC2 instances which are started on demand using the "Launch Agents via SSH" method. I also have an additional plugin installed which allows me to execute commands before the SSH connection. In my custom script I use "aws ec2 launch-instance..." and then "aws ec2 instance-wait ..." to wait until it has booted up. This will stall the normal "Launch agents via SSH" procedure until slaves are up and running. After that, it continues with the actual SSH call. Now, this works fine in 90% of the cases, however, in some cases, aws instance-wait returns before sshd on the slave is actually running and the connection attempt fails with the message: [07/29/19 11:57:14] [SSH] Opening SSH connection to ec2-obfuscated.eu-west-1.compute.amazonaws.com:22. Connection refused (Connection refused) SSH Connection failed with IOException: "Connection refused (Connection refused)" , retrying in 15 seconds. There are 10 more retries left. This waits for 15 seconds, and tries again. The second attempt succeeds and all is fine now, jobs start building.   However, this is the situation where the bug strikes. The ssh connector plugin did not stop the connection timeout for this first failed connection. The connection timeout for the first, failed connection is still running, and by default for another 3 minutes. Now the jobs start building but when the 3 minutes have passed, the currently working SSH connection is killed because the ssh-connector thinks it is still trying to establish the first connection.   Hope this makes it clear enough.

          yep, it is really helpful, I'm fixing the retry method on https://issues.jenkins-ci.org/browse/JENKINS-58589, and I've found the connection was not closed in some cases, so now it forces to close the connection before to retry again, also the timeout is passed to the connection and the weird global timeout disappear, now it'd be a timeout per connection.

          Ivan Fernandez Calvo added a comment - yep, it is really helpful, I'm fixing the retry method on https://issues.jenkins-ci.org/browse/JENKINS-58589 , and I've found the connection was not closed in some cases, so now it forces to close the connection before to retry again, also the timeout is passed to the connection and the weird global timeout disappear, now it'd be a timeout per connection.

          I am pretty sure is related to JENKINS-59764, I have make the PR to the docker-plugin with the solution, it'd need trilead-ssh2-build-217-jenkins-15 on Jenkins Core 2.190- or trilead-api 1.0.5 on Jenkins core 2.190.1+

          Ivan Fernandez Calvo added a comment - I am pretty sure is related to JENKINS-59764 , I have make the PR to the docker-plugin with the solution, it'd need trilead-ssh2-build-217-jenkins-15 on Jenkins Core 2.190- or trilead-api 1.0.5 on Jenkins core 2.190.1+

          pjdarton added a comment -

          I'd be a bit wary of updating the docker-plugin so it requires a fully up-to-date version of Jenkins.  I believe that users should be able to use the latest docker-plugin with a core Jenkins that's at least 6 months out of date.  If we had a critical security defect that required the latest Jenkins core then, sure, I guess we'd have to do that but for anything else, it'd be best not to force everyone to update.

          FYI what the docker-plugin is trying to do is to poll the docker container (that it just created) to wait until the container is ready to accept SSH connections before allowing the main SSHLauncher code to start trying to connect.  If the docker-plugin declared its provisioning "complete" and the slave stayed offline for any period then Jenkins would ask it to make more slave nodes than are required, causing resource wastage.

          pjdarton added a comment - I'd be a bit wary of updating the docker-plugin so it requires a fully up-to-date version of Jenkins.  I believe that users should be able to use the latest docker-plugin with a core Jenkins that's at least 6 months out of date.  If we had a critical security defect that required the latest Jenkins core then, sure, I guess we'd have to do that but for anything else, it'd be best not to force everyone to update. FYI what the docker-plugin is trying to do is to poll the docker container (that it just created) to wait until the container is ready to accept SSH connections before allowing the main SSHLauncher code to start trying to connect.  If the docker-plugin declared its provisioning "complete" and the slave stayed offline for any period then Jenkins would ask it to make more slave nodes than are required, causing resource wastage.

            ifernandezcalvo Ivan Fernandez Calvo
            durgadas Durgadas Kamath
            Votes:
            4 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: