Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23792

PATCH: EC2 plugin idles-down nodes that are still launching

    • 1.62

      The ec2-plugin will treat a still-launching node as idle (because it has no jbs running) and will thus shut it down if its idle timeout expires.

      This means that if the idle timeout for a node is shorter than the time it takes to do one-off configuration on first launch, it will be shut down before it finishes starting up, aborting any init script that might be in progress.

      That's a clear bug - a node isn't idle when it's launching, and shouldn't be idled out until it's first made a connection with its slave agent.

          [JENKINS-23792] PATCH: EC2 plugin idles-down nodes that are still launching

          Craig Ringer created issue -

          Craig Ringer added a comment -

          This also causes a race condition in demand-launched workers, where:

          • ec2-plugin launches a new node when one is requested by the executor
          • the node takes longer than the idle timeout to start
          • ec2-plugin stops the node
          • the executor waits indefinitely for the node to become available

          Jenkins doesn't try to start the node again, because as far as it's concerned the node is still starting up. ec2-plugin doesn't start it because as far as it's concerned it started, went idle, and was stopped.

          Craig Ringer added a comment - This also causes a race condition in demand-launched workers, where: ec2-plugin launches a new node when one is requested by the executor the node takes longer than the idle timeout to start ec2-plugin stops the node the executor waits indefinitely for the node to become available Jenkins doesn't try to start the node again, because as far as it's concerned the node is still starting up. ec2-plugin doesn't start it because as far as it's concerned it started, went idle, and was stopped.

          Craig Ringer added a comment -

          As a workaround for this, you can set the idle time longer than the longest period a node might reasonably take to run its init script.

          If you're using start/stop instead of terminate for your nodes, they probably take a while to initialize. So this limits how eagerly you can terminate them. Still, it works.

          Craig Ringer added a comment - As a workaround for this, you can set the idle time longer than the longest period a node might reasonably take to run its init script. If you're using start/stop instead of terminate for your nodes, they probably take a while to initialize. So this limits how eagerly you can terminate them. Still, it works.

          Craig Ringer added a comment -

          Actually, that turns out to be insufficient. I just hit the case again with an already-built instance that takes significantly less than the idle timer to boot from stopped.

          It looks like it's an issue if the build time takes longer than the timeout and there's a

          Output for node launch looks like:

          Starting existing instance: i-205dcf62 result:{StartingInstances: [{InstanceId: i-205dcf62,CurrentState: {Code: 0,Name: pending},PreviousState: {Code: 80,Name: stopped}}]}
          Waiting for SSH to come up. Sleeping 5.
          Connecting to 10.0.1.206 on port 22, with timeout 10000.
          Waiting for SSH to come up. Sleeping 5.
          Connecting to 10.0.1.206 on port 22, with timeout 10000.
          

          (repeating endlessly), but when the node state is examined in AWS it's shown as stopped.

          It's not clear from the ec2 plugin logs why this is happening; they look fairly reasonable:

          Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2RetentionStrategy _check
          Idle timeout: Amazon Linux 2012.09 EBS 64-bit  (i-205dcf62)
          Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave idleTimeout
          EC2 instance idle time expired: i-205dcf62
          Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave stop
          EC2 instance stopped: i-205dcf62
          

          those are logs for hudson.plugins.ec2 at log level FINEST.

          I'm still speculating that there's a race - or two - here. It's not clear if this is the same one as above, or if this is another issue with a race between requesting a new instance and idle timeout.

          Craig Ringer added a comment - Actually, that turns out to be insufficient. I just hit the case again with an already-built instance that takes significantly less than the idle timer to boot from stopped. It looks like it's an issue if the build time takes longer than the timeout and there's a Output for node launch looks like: Starting existing instance: i-205dcf62 result:{StartingInstances: [{InstanceId: i-205dcf62,CurrentState: {Code: 0,Name: pending},PreviousState: {Code: 80,Name: stopped}}]} Waiting for SSH to come up. Sleeping 5. Connecting to 10.0.1.206 on port 22, with timeout 10000. Waiting for SSH to come up. Sleeping 5. Connecting to 10.0.1.206 on port 22, with timeout 10000. (repeating endlessly), but when the node state is examined in AWS it's shown as stopped. It's not clear from the ec2 plugin logs why this is happening; they look fairly reasonable: Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2RetentionStrategy _check Idle timeout: Amazon Linux 2012.09 EBS 64-bit (i-205dcf62) Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave idleTimeout EC2 instance idle time expired: i-205dcf62 Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave stop EC2 instance stopped: i-205dcf62 those are logs for hudson.plugins.ec2 at log level FINEST. I'm still speculating that there's a race - or two - here. It's not clear if this is the same one as above, or if this is another issue with a race between requesting a new instance and idle timeout.
          Craig Ringer made changes -
          Link New: This issue is related to JENKINS-23787 [ JENKINS-23787 ]

          Craig Ringer added a comment -

          I've put together a change to increase the logging detail on this:

          https://github.com/jenkinsci/ec2-plugin/pull/101

          While reading for this I also noticed that in hudson.model.Computer, getIdleStartMilliseconds() returns Long.MIN_VALUE if there are no executors.

          If there are executors, then the greatest of the last executed job time and the executor's owning computer's getConnectTime() are returned. The executor's last executed job time is initialized to the executor's creation time.

          The Computer's connectTime is initialized to zero, but is set to the current time when proactive launch begins.

          So unless a node is actively running a job on an executor, it's reported as having been idle since creation, even if it's busily running init scripts and other setup so the slave agent hasn't connected and created executors yet.

          That's why it's being stopped again before it's started properly.

          I think the best fix for this will be for EC2Computer to @Override getIdleStartMilliseconds(...) and report a time in the future (i.e. busy) while the node is still starting and hasn't yet connected its slave agent.

          Or at least until the init script finishes running, in EC2UnixLauncher.launch(...)

          It looks like testing isOffline() should be sufficient for that - if the node is offline, it's clearly not a candidate for idle shutdown.

          To be really picky, we could instead have the ComputerLauncher set a flag on the Computer when init scripts have finished and the slave agent has been launched. However, that's the point at which a channel is registered and isOffline becomes false, so there seems little point.

          Craig Ringer added a comment - I've put together a change to increase the logging detail on this: https://github.com/jenkinsci/ec2-plugin/pull/101 While reading for this I also noticed that in hudson.model.Computer, getIdleStartMilliseconds() returns Long.MIN_VALUE if there are no executors. If there are executors, then the greatest of the last executed job time and the executor's owning computer's getConnectTime() are returned. The executor's last executed job time is initialized to the executor's creation time. The Computer's connectTime is initialized to zero, but is set to the current time when proactive launch begins. So unless a node is actively running a job on an executor, it's reported as having been idle since creation, even if it's busily running init scripts and other setup so the slave agent hasn't connected and created executors yet. That's why it's being stopped again before it's started properly. I think the best fix for this will be for EC2Computer to @Override getIdleStartMilliseconds(...) and report a time in the future (i.e. busy) while the node is still starting and hasn't yet connected its slave agent. Or at least until the init script finishes running, in EC2UnixLauncher.launch(...) It looks like testing isOffline() should be sufficient for that - if the node is offline, it's clearly not a candidate for idle shutdown. To be really picky, we could instead have the ComputerLauncher set a flag on the Computer when init scripts have finished and the slave agent has been launched. However, that's the point at which a channel is registered and isOffline becomes false, so there seems little point.

          Craig Ringer added a comment -

          Actually, it's better to suppress idle timeout in the retention manager, rather than playing with the reported idle time.

          See https://github.com/jenkinsci/ec2-plugin/pull/102

          Craig Ringer added a comment - Actually, it's better to suppress idle timeout in the retention manager, rather than playing with the reported idle time. See https://github.com/jenkinsci/ec2-plugin/pull/102
          Craig Ringer made changes -
          Labels Original: demand-launch ec2 idle-shutdown New: demand-launch ec2 idle-shutdown patch
          Summary Original: EC2 plugin idles-down nodes that are still launching New: PATCH: EC2 plugin idles-down nodes that are still launching

          Code changed in jenkins
          User: Craig Ringer
          Path:
          src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java
          http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b
          Log:
          Prevent nodes that are still starting up from being idle-stopped

          Per JENKINS-23792, the EC2 plugin will shut down nodes that're still starting
          up if the idle timeout is shorter than the time the node takes to go from
          launch request to successfully starting its first job on an executor.

          To prevent this, don't perform idle shutdown on a node that is marked offline.
          When it comes online, executors will be created and the new idle time will
          become the executor creation time, effectively resetting the timer.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Craig Ringer Path: src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b Log: Prevent nodes that are still starting up from being idle-stopped Per JENKINS-23792 , the EC2 plugin will shut down nodes that're still starting up if the idle timeout is shorter than the time the node takes to go from launch request to successfully starting its first job on an executor. To prevent this, don't perform idle shutdown on a node that is marked offline. When it comes online, executors will be created and the new idle time will become the executor creation time, effectively resetting the timer.
          Francis Upton made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]

            francisu Francis Upton
            ringerc Craig Ringer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: