Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-64520

EC2 node not start after stop/disconnect with parameter Idle termination time

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • ec2-plugin
    • Debian 10
      Jenkins 2.263.1, 2.230
      Plugin Amazon EC2 version 1.56, 1.53

      We have a problem - after stopping ec2, the nodes do not start again. Playable on three jenkins servers. One of them is clean with plugin Amazon EC2. 

      Plugin settings and access to ec2 are the same

      Common settings on plugin EC2

      Idle termination time: 30

      Stop/Disconnect on Idle Timeout: yes

      Minimum number of instances: 0

      Minimum number of spare instances: 0
      Host key Verification Strategy: Off

      Connection stratagy: PublicDNS

       

      When I checked credentials and AMI - plugin says - success

       

      Logs on one node when I launch

      2020-12-28 12:33:18.949+0000 [id=25] INFO h.p.ec2.EC2RetentionStrategy#internalCheck: Idle timeout of EC2 (Test Amazon connection) - Jenkins PHP70 (same-instance-id) after 30 idle minutes, instance statusRUNNING
      2020-12-28 12:33:18.950+0000 [id=25] INFO h.plugins.ec2.EC2AbstractSlave#idleTimeout: EC2 instance idle time expired: same-instance-id
      2020-12-28 12:33:19.220+0000 [id=25] INFO h.plugins.ec2.EC2AbstractSlave#stop: EC2 instance stop request sent for same-instance-id
      2020-12-28 12:38:01.182+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Launching instance: same-instance-id
      2020-12-28 12:38:01.183+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: bootstrap()
      2020-12-28 12:38:01.184+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Getting keypair...
      2020-12-28 12:38:01.184+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Using private key jenkins (SHA-1 fingerprint same-fingerprint)
      2020-12-28 12:38:01.184+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Authenticating as jenkins
      2020-12-28 12:38:01.286+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Connecting to null on port 22, with timeout 10000.
      2020-12-28 12:38:01.287+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Failed to connect via ssh: There was a problem while connecting to null:22
      2020-12-28 12:38:01.288+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Waiting for SSH to come up. Sleeping 5.
      2020-12-28 12:38:06.363+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Connecting to null on port 22, with timeout 10000.
      2020-12-28 12:38:06.364+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Failed to connect via ssh: There was a problem while connecting to null:22
      2020-12-28 12:38:06.364+0000 [id=1078] INFO hudson.plugins.ec2.EC2Cloud#log: Waiting for SSH to come up. Sleeping 5.

       

      On other jenkins server launch says:

      Dec 28, 2020 7:15:36 AM hudson.plugins.ec2.EC2Cloud
      INFO: Connecting to private-ip-ec2-instance.compute.internal on port 22, with timeout 10000.
      Dec 28, 2020 7:15:36 AM hudson.plugins.ec2.EC2Cloud
      INFO: Failed to connect via ssh: There was a problem while connecting to private-ip-ec2-instance:22
      Dec 28, 2020 7:15:36 AM hudson.plugins.ec2.EC2Cloud
      INFO: Waiting for SSH to come up. Sleeping 5.
       

          [JENKINS-64520] EC2 node not start after stop/disconnect with parameter Idle termination time

          Can you confirm that the servers are reachable via SSH?

          Raihaan Shouhell added a comment - Can you confirm that the servers are reachable via SSH?

          Nikolay added a comment -

          They are available via SSH when the servers are running. 

          But problem is that they don't start when you do "Launch agent".

          Nikolay added a comment - They are available via SSH when the servers are running.  But problem is that they don't start when you do "Launch agent".

          The bottom log seems correct showing the supposed host and the above one does show null as the host.

          Do you mean that the nodes have been stopped on the EC2 console and you expect the behaviour of Launch agent to start the ec2 instance and attempt to connect?

          Raihaan Shouhell added a comment - The bottom log seems correct showing the supposed host and the above one does show null as the host. Do you mean that the nodes have been stopped on the EC2 console and you expect the behaviour of Launch agent to start the ec2 instance and attempt to connect?

          Nikolay added a comment - - edited

          Yes, it worked for me in version EC2 plugin ~1.4 . 

          Nikolay added a comment - - edited Yes, it worked for me in version EC2 plugin ~1.4 . 

          Nikolay added a comment -

          In all cases described in the logs - VM in ec2 won't start again. I don't see any error logs related to this. The user in IAM has maximum rights, as far as I know.

          Nikolay added a comment - In all cases described in the logs - VM in ec2 won't start again. I don't see any error logs related to this. The user in IAM has maximum rights, as far as I know.

          AFAIK, the launch agent button has never actually attempted to start a stopped instance.

          Raihaan Shouhell added a comment - AFAIK, the launch agent button has never actually attempted to start a stopped instance.

          Nikolay added a comment - - edited

          I'm sure the button worked to launch the nodes. Because developers used this button for manual start nodes. But I'm not sure if it was a custom solution.

          I will check it on older versions of the plugin and report here.

          Nikolay added a comment - - edited I'm sure the button worked to launch the nodes. Because developers used this button for manual start nodes. But I'm not sure if it was a custom solution. I will check it on older versions of the plugin and report here.

          Evan added a comment -

          We're seeing this regression issue as well. The "Launch Agent" button no longer calls the EC2 API with StartInstances, yet Jenkins still tries to connect to the agent, even though it "forgot" to start it. I've verified this in CloudTrail.

          This causes a problem when we want to launch a specific agent from Jenkins. The only workaround is that someone in our team has to log into AWS and manually start it. Once we start it in AWS, Jenkins connects no problem.

          mkozell I think you're right that that PR introduced the regression. As of now, the only other place in the codebase that calls ec2.startInstances is hudson.plugins.ec2.SlaveTemplate.wakeOrphansOrStoppedUp(AmazonEC2 ec2, List<Instance> orphansOrStopped), which I think only gets called when a new build task comes in. This reflects the behavior that we see: Jenkins will (re)start stopped instances, if, and only if, a new build task needs an agent.

          thoulen since you created the PR, would you please be able to take a look at how to bring this functionality back?

          Evan added a comment - We're seeing this regression issue as well. The "Launch Agent" button no longer calls the EC2 API with StartInstances , yet Jenkins still tries to connect to the agent, even though it "forgot" to start it. I've verified this in CloudTrail. This causes a problem when we want to launch a specific agent from Jenkins. The only workaround is that someone in our team has to log into AWS and manually start it. Once we start it in AWS, Jenkins connects no problem. mkozell I think you're right that that PR introduced the regression. As of now, the only other place in the codebase that calls ec2.startInstances is  hudson.plugins.ec2.SlaveTemplate.wakeOrphansOrStoppedUp(AmazonEC2 ec2, List<Instance> orphansOrStopped) , which I think only gets called when a new build task comes in. This reflects the behavior that we see: Jenkins will (re)start stopped instances, if, and only if, a new build task needs an agent. thoulen  since you created the PR, would you please be able to take a look at how to bring this functionality back?

          I can confirm that we're seeing the same problem as seen by mkozell  and mira_evan - this is with v1.68 of Jenkins' ec2-plugin.

          If the EC2 instance is stopped, then clicking the "Launch Agent" button should call AWS to start it, but nothing happens - the instance remains stopped. Eventually the launch fails (of course), and the instance is then terminated (deleted).

          (This is possibly related to JENKINS-67190 "EC2-plugin not spooling up stopped nodes, starting new nodes instead", so I've linked the two).

          It looks like there are two PRs (both merged) that are involved here:
          https://github.com/jenkinsci/ec2-plugin/pull/252
          https://github.com/jenkinsci/ec2-plugin/pull/294

          I've added a comment to both those, linking back to this ticket, and @mentioning the PR authors.
           

          Matthew Webber added a comment - I can confirm that we're seeing the same problem as seen by mkozell   and mira_evan - this is with v1.68 of Jenkins' ec2-plugin . If the EC2 instance is stopped, then clicking the "Launch Agent" button should call AWS to start it, but nothing happens - the instance remains stopped. Eventually the launch fails (of course), and the instance is then terminated (deleted). (This is possibly related to JENKINS-67190 "EC2-plugin not spooling up stopped nodes, starting new nodes instead", so I've linked the two). It looks like there are two PRs (both merged) that are involved here: https://github.com/jenkinsci/ec2-plugin/pull/252 https://github.com/jenkinsci/ec2-plugin/pull/294 I've added a comment to both those, linking back to this ticket, and @mentioning the PR authors.  

            thoulen FABRIZIO MANFREDI
            nlopyrev Nikolay
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: