Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23787

EC2-plugin not spooling up stopped nodes - "still in the queue ... all nodes of label ... are offline"

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • ec2-plugin
    • Jenkins 1.572, EC2 plugin 1.21, Node Iterator API Plugin 1.5

      The Jenkins EC2 plugin no longer launches stopped nodes. Unfortunately I'm not sure exactly when it stopped working - I wasn't sure that was the issue until later, due to unrelated issues caused by too many nodes spawning and having to be killed.

      If I use Manage Jenkins -> Manage Nodes to start a stopped EC2 node that a build is waiting on manually, the build proceeds.

      Builds succeed when the EC2 plugin spawns a new node for the first time. It's only a problem if the node is stopped for idleness - the plugin doesn't seem to restart it.

      Builds get stuck with output like:

      Triggering bdr_linux ? x64,debian7
      Triggering bdr_linux ? x86,amazonlinux201209
      Triggering bdr_linux ? x86,debian6
      Triggering bdr_linux ? x64,amazonlinux201209
      Configuration bdr_linux ? x86,amazonlinux201209 is still in the queue: Amazon Linux 2012.09 EBS 32-bit  (i-b848fbfa) is offline
      Configuration bdr_linux ? x86,amazonlinux201209 is still in the queue: All nodes of label ?amazonlinux201209&&x86? are offline
      

      where there's at least one node with that label stopped, ready to start and use.

      There's no sign that any attempt is made to start the node.

          [JENKINS-23787] EC2-plugin not spooling up stopped nodes - "still in the queue ... all nodes of label ... are offline"

          Craig Ringer created issue -

          Craig Ringer added a comment -

          After updating to 1.23 I instead get the behaviour in JENKINS-23788 . Manual node launch no longer works.

          Craig Ringer added a comment - After updating to 1.23 I instead get the behaviour in JENKINS-23788 . Manual node launch no longer works.
          Craig Ringer made changes -
          Labels New: demand-launch ec2
          Craig Ringer made changes -
          Labels Original: demand-launch ec2 New: demand-launch ec2 regression

          I think I am seeing a similar issue with EC2 1.23 and Node Iterator 1.5.

          Every time I start a build, Jenkins launches a new slave rather than restarting one of the stopped instances. It is successfully stopping the instance when it hits the idle time.

          During idle, the log shows lots of this (the _check, idleTimeout, stop entries appear once for every instance currently registered):

          Jul 16, 2014 12:40:47 PM hudson.model.AsyncPeriodicWork$1 run
          INFO: Started EC2 alive slaves monitor
          Jul 16, 2014 12:40:48 PM hudson.model.AsyncPeriodicWork$1 run
          INFO: Finished EC2 alive slaves monitor. 1172 ms
          Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2RetentionStrategy _check
          INFO: Idle timeout: edifestivalsapi build slave (i-ce32a08c)
          Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2AbstractSlave idleTimeout
          INFO: EC2 instance idle time expired: i-ce32a08c
          Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2AbstractSlave stop
          INFO: EC2 instance stopped: i-ce32a08c
          

          Then a build is triggered and the idle timeout checks run again (again, one set of entries for every instance):

          Jul 16, 2014 12:44:23 PM com.cloudbees.jenkins.GitHubPushTrigger$1 run
          INFO: SCM changes detected in edifestivalsapi-master. Triggering  #36
          Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2RetentionStrategy _check
          INFO: Idle timeout: edifestivalsapi build slave (i-ce32a08c)
          Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2AbstractSlave idleTimeout
          INFO: EC2 instance idle time expired: i-ce32a08c
          Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2AbstractSlave stop
          INFO: EC2 instance stopped: i-ce32a08c
          

          And then the plugin starts to provision a new instance - apparently without any attempt to restart a stopped slave.

          Jul 16, 2014 12:46:33 PM hudson.plugins.ec2.EC2Cloud provision
          INFO: Excess workload after pending Spot instances: 1
          Jul 16, 2014 12:46:33 PM hudson.plugins.ec2.EC2Cloud addProvisionedSlave
          INFO: Provisioning for AMI ami-57ea3d20; Estimated number of total slaves: 0; Estimated number of slaves for ami ami-57ea3d20: 0
          Launching ami-57ea3d20 for template edifestivalsapi build slave
          Jul 16, 2014 12:46:33 PM hudson.slaves.NodeProvisioner update
          INFO: Started provisioning edifestivalsapi build slave (ami-57ea3d20) from ec2-eu-west-1 with 1 executors. Remaining excess workload:0.0
          Looking for existing instances: {InstanceIds: [],Filters: [{Name: image-id,Values: [ami-57ea3d20]}, {Name: group-name,Values: [jenkins-build-slave]}, {Name: key-name,Values: [build-slave]}, {Name: instance-type,Values: [t1.micro]}, {Name: tag:Name,Values: [edifestivalsapi-build-slave]}, {Name: tag:Project,Values: [edifestivalsapi]}, {Name: instance-state-name,Values: [stopped, stopping]}],}
          No existing instance found - created: {InstanceId: i-eb35a8a9,ImageId: ami-57ea3d20,State: {Code: 0,Name: pending},"**REDACTED**}
          

          Then another block of the idle timeout checks while the instance is launched, and then this:

          Jul 16, 2014 12:47:44 PM hudson.slaves.NodeProvisioner update
          INFO: edifestivalsapi build slave (ami-57ea3d20) provisioningE successfully completed. We have now 8 computer(s)
          Jul 16, 2014 12:47:47 PM hudson.node_monitors.ResponseTimeMonitor$1 monitor
          WARNING: Making edifestivalsapi build slave (i-ce32a08c) offline because it’s not responding
          

          The UI shows all the slaves that were launched for previous jobs but shows them as offline with "Time out for last 5 try". When I manually start the instance (by clicking onto the slave page and clicking "Launch slave agent") I see that the stopped instance is restarted and comes online as expected.

          So my hunch is that that Jenkins somehow isn't detecting that it has a stopped instance for the given AMI?

          Andrew Coulton added a comment - I think I am seeing a similar issue with EC2 1.23 and Node Iterator 1.5. Every time I start a build, Jenkins launches a new slave rather than restarting one of the stopped instances. It is successfully stopping the instance when it hits the idle time. During idle, the log shows lots of this (the _check, idleTimeout, stop entries appear once for every instance currently registered): Jul 16, 2014 12:40:47 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started EC2 alive slaves monitor Jul 16, 2014 12:40:48 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished EC2 alive slaves monitor. 1172 ms Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2RetentionStrategy _check INFO: Idle timeout: edifestivalsapi build slave (i-ce32a08c) Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2AbstractSlave idleTimeout INFO: EC2 instance idle time expired: i-ce32a08c Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2AbstractSlave stop INFO: EC2 instance stopped: i-ce32a08c Then a build is triggered and the idle timeout checks run again (again, one set of entries for every instance): Jul 16, 2014 12:44:23 PM com.cloudbees.jenkins.GitHubPushTrigger$1 run INFO: SCM changes detected in edifestivalsapi-master. Triggering #36 Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2RetentionStrategy _check INFO: Idle timeout: edifestivalsapi build slave (i-ce32a08c) Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2AbstractSlave idleTimeout INFO: EC2 instance idle time expired: i-ce32a08c Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2AbstractSlave stop INFO: EC2 instance stopped: i-ce32a08c And then the plugin starts to provision a new instance - apparently without any attempt to restart a stopped slave. Jul 16, 2014 12:46:33 PM hudson.plugins.ec2.EC2Cloud provision INFO: Excess workload after pending Spot instances: 1 Jul 16, 2014 12:46:33 PM hudson.plugins.ec2.EC2Cloud addProvisionedSlave INFO: Provisioning for AMI ami-57ea3d20; Estimated number of total slaves: 0; Estimated number of slaves for ami ami-57ea3d20: 0 Launching ami-57ea3d20 for template edifestivalsapi build slave Jul 16, 2014 12:46:33 PM hudson.slaves.NodeProvisioner update INFO: Started provisioning edifestivalsapi build slave (ami-57ea3d20) from ec2-eu-west-1 with 1 executors. Remaining excess workload:0.0 Looking for existing instances: {InstanceIds: [],Filters: [{Name: image-id,Values: [ami-57ea3d20]}, {Name: group-name,Values: [jenkins-build-slave]}, {Name: key-name,Values: [build-slave]}, {Name: instance-type,Values: [t1.micro]}, {Name: tag:Name,Values: [edifestivalsapi-build-slave]}, {Name: tag:Project,Values: [edifestivalsapi]}, {Name: instance-state-name,Values: [stopped, stopping]}],} No existing instance found - created: {InstanceId: i-eb35a8a9,ImageId: ami-57ea3d20,State: {Code: 0,Name: pending},"**REDACTED**} Then another block of the idle timeout checks while the instance is launched, and then this: Jul 16, 2014 12:47:44 PM hudson.slaves.NodeProvisioner update INFO: edifestivalsapi build slave (ami-57ea3d20) provisioningE successfully completed. We have now 8 computer(s) Jul 16, 2014 12:47:47 PM hudson.node_monitors.ResponseTimeMonitor$1 monitor WARNING: Making edifestivalsapi build slave (i-ce32a08c) offline because it’s not responding The UI shows all the slaves that were launched for previous jobs but shows them as offline with "Time out for last 5 try". When I manually start the instance (by clicking onto the slave page and clicking "Launch slave agent") I see that the stopped instance is restarted and comes online as expected. So my hunch is that that Jenkins somehow isn't detecting that it has a stopped instance for the given AMI?

          Craig Ringer added a comment -

          So my hunch is that that Jenkins somehow isn't detecting that it has a stopped instance for the given AMI?

          That used to happen due to a bug when label support was added, but it's fixed in 1.21 IIRC.

          I'm not seeing the same behaviour - rather, it's just waiting indefinitely for the stopped node to start.

          It might be a good idea to file a separate issue for what you're discussing here, then comment to mention the issue number here in case they prove to be related.

          Craig Ringer added a comment - So my hunch is that that Jenkins somehow isn't detecting that it has a stopped instance for the given AMI? That used to happen due to a bug when label support was added, but it's fixed in 1.21 IIRC. I'm not seeing the same behaviour - rather, it's just waiting indefinitely for the stopped node to start. It might be a good idea to file a separate issue for what you're discussing here, then comment to mention the issue number here in case they prove to be related.
          Andrew Coulton made changes -
          Link New: This issue is related to JENKINS-23850 [ JENKINS-23850 ]

          Thanks Craig, I've filed as a separate issue at https://issues.jenkins-ci.org/browse/JENKINS-23850. Do you have an instance cap? It's possible we're seeing the same thing if your Jenkins has hit the instance cap and therefore can't start a new node (so the build stalls), while mine is uncapped so just goes ahead and makes a new one.

          Andrew Coulton added a comment - Thanks Craig, I've filed as a separate issue at https://issues.jenkins-ci.org/browse/JENKINS-23850 . Do you have an instance cap? It's possible we're seeing the same thing if your Jenkins has hit the instance cap and therefore can't start a new node (so the build stalls), while mine is uncapped so just goes ahead and makes a new one.

          Craig Ringer added a comment -

          I do have an instance cap, but neither the per-node-type nor global instance caps are being reached. It's an issue with restarting existing stoppped nodes, not with starting new ones.

          Craig Ringer added a comment - I do have an instance cap, but neither the per-node-type nor global instance caps are being reached. It's an issue with restarting existing stoppped nodes, not with starting new ones.

          OK, does sound like yours is a different issue then.

          Andrew Coulton added a comment - OK, does sound like yours is a different issue then.

            francisu Francis Upton
            ringerc Craig Ringer
            Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: