Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23792

PATCH: EC2 plugin idles-down nodes that are still launching

    XMLWordPrintable

Details

    • 1.62

    Description

      The ec2-plugin will treat a still-launching node as idle (because it has no jbs running) and will thus shut it down if its idle timeout expires.

      This means that if the idle timeout for a node is shorter than the time it takes to do one-off configuration on first launch, it will be shut down before it finishes starting up, aborting any init script that might be in progress.

      That's a clear bug - a node isn't idle when it's launching, and shouldn't be idled out until it's first made a connection with its slave agent.

      Attachments

        Issue Links

          Activity

            ringerc Craig Ringer added a comment -

            This also causes a race condition in demand-launched workers, where:

            • ec2-plugin launches a new node when one is requested by the executor
            • the node takes longer than the idle timeout to start
            • ec2-plugin stops the node
            • the executor waits indefinitely for the node to become available

            Jenkins doesn't try to start the node again, because as far as it's concerned the node is still starting up. ec2-plugin doesn't start it because as far as it's concerned it started, went idle, and was stopped.

            ringerc Craig Ringer added a comment - This also causes a race condition in demand-launched workers, where: ec2-plugin launches a new node when one is requested by the executor the node takes longer than the idle timeout to start ec2-plugin stops the node the executor waits indefinitely for the node to become available Jenkins doesn't try to start the node again, because as far as it's concerned the node is still starting up. ec2-plugin doesn't start it because as far as it's concerned it started, went idle, and was stopped.
            ringerc Craig Ringer added a comment -

            As a workaround for this, you can set the idle time longer than the longest period a node might reasonably take to run its init script.

            If you're using start/stop instead of terminate for your nodes, they probably take a while to initialize. So this limits how eagerly you can terminate them. Still, it works.

            ringerc Craig Ringer added a comment - As a workaround for this, you can set the idle time longer than the longest period a node might reasonably take to run its init script. If you're using start/stop instead of terminate for your nodes, they probably take a while to initialize. So this limits how eagerly you can terminate them. Still, it works.
            ringerc Craig Ringer added a comment -

            Actually, that turns out to be insufficient. I just hit the case again with an already-built instance that takes significantly less than the idle timer to boot from stopped.

            It looks like it's an issue if the build time takes longer than the timeout and there's a

            Output for node launch looks like:

            Starting existing instance: i-205dcf62 result:{StartingInstances: [{InstanceId: i-205dcf62,CurrentState: {Code: 0,Name: pending},PreviousState: {Code: 80,Name: stopped}}]}
            Waiting for SSH to come up. Sleeping 5.
            Connecting to 10.0.1.206 on port 22, with timeout 10000.
            Waiting for SSH to come up. Sleeping 5.
            Connecting to 10.0.1.206 on port 22, with timeout 10000.
            

            (repeating endlessly), but when the node state is examined in AWS it's shown as stopped.

            It's not clear from the ec2 plugin logs why this is happening; they look fairly reasonable:

            Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2RetentionStrategy _check
            Idle timeout: Amazon Linux 2012.09 EBS 64-bit  (i-205dcf62)
            Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave idleTimeout
            EC2 instance idle time expired: i-205dcf62
            Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave stop
            EC2 instance stopped: i-205dcf62
            

            those are logs for hudson.plugins.ec2 at log level FINEST.

            I'm still speculating that there's a race - or two - here. It's not clear if this is the same one as above, or if this is another issue with a race between requesting a new instance and idle timeout.

            ringerc Craig Ringer added a comment - Actually, that turns out to be insufficient. I just hit the case again with an already-built instance that takes significantly less than the idle timer to boot from stopped. It looks like it's an issue if the build time takes longer than the timeout and there's a Output for node launch looks like: Starting existing instance: i-205dcf62 result:{StartingInstances: [{InstanceId: i-205dcf62,CurrentState: {Code: 0,Name: pending},PreviousState: {Code: 80,Name: stopped}}]} Waiting for SSH to come up. Sleeping 5. Connecting to 10.0.1.206 on port 22, with timeout 10000. Waiting for SSH to come up. Sleeping 5. Connecting to 10.0.1.206 on port 22, with timeout 10000. (repeating endlessly), but when the node state is examined in AWS it's shown as stopped. It's not clear from the ec2 plugin logs why this is happening; they look fairly reasonable: Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2RetentionStrategy _check Idle timeout: Amazon Linux 2012.09 EBS 64-bit (i-205dcf62) Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave idleTimeout EC2 instance idle time expired: i-205dcf62 Jul 16, 2014 7:58:55 AM INFO hudson.plugins.ec2.EC2AbstractSlave stop EC2 instance stopped: i-205dcf62 those are logs for hudson.plugins.ec2 at log level FINEST. I'm still speculating that there's a race - or two - here. It's not clear if this is the same one as above, or if this is another issue with a race between requesting a new instance and idle timeout.
            ringerc Craig Ringer added a comment -

            I've put together a change to increase the logging detail on this:

            https://github.com/jenkinsci/ec2-plugin/pull/101

            While reading for this I also noticed that in hudson.model.Computer, getIdleStartMilliseconds() returns Long.MIN_VALUE if there are no executors.

            If there are executors, then the greatest of the last executed job time and the executor's owning computer's getConnectTime() are returned. The executor's last executed job time is initialized to the executor's creation time.

            The Computer's connectTime is initialized to zero, but is set to the current time when proactive launch begins.

            So unless a node is actively running a job on an executor, it's reported as having been idle since creation, even if it's busily running init scripts and other setup so the slave agent hasn't connected and created executors yet.

            That's why it's being stopped again before it's started properly.

            I think the best fix for this will be for EC2Computer to @Override getIdleStartMilliseconds(...) and report a time in the future (i.e. busy) while the node is still starting and hasn't yet connected its slave agent.

            Or at least until the init script finishes running, in EC2UnixLauncher.launch(...)

            It looks like testing isOffline() should be sufficient for that - if the node is offline, it's clearly not a candidate for idle shutdown.

            To be really picky, we could instead have the ComputerLauncher set a flag on the Computer when init scripts have finished and the slave agent has been launched. However, that's the point at which a channel is registered and isOffline becomes false, so there seems little point.

            ringerc Craig Ringer added a comment - I've put together a change to increase the logging detail on this: https://github.com/jenkinsci/ec2-plugin/pull/101 While reading for this I also noticed that in hudson.model.Computer, getIdleStartMilliseconds() returns Long.MIN_VALUE if there are no executors. If there are executors, then the greatest of the last executed job time and the executor's owning computer's getConnectTime() are returned. The executor's last executed job time is initialized to the executor's creation time. The Computer's connectTime is initialized to zero, but is set to the current time when proactive launch begins. So unless a node is actively running a job on an executor, it's reported as having been idle since creation, even if it's busily running init scripts and other setup so the slave agent hasn't connected and created executors yet. That's why it's being stopped again before it's started properly. I think the best fix for this will be for EC2Computer to @Override getIdleStartMilliseconds(...) and report a time in the future (i.e. busy) while the node is still starting and hasn't yet connected its slave agent. Or at least until the init script finishes running, in EC2UnixLauncher.launch(...) It looks like testing isOffline() should be sufficient for that - if the node is offline, it's clearly not a candidate for idle shutdown. To be really picky, we could instead have the ComputerLauncher set a flag on the Computer when init scripts have finished and the slave agent has been launched. However, that's the point at which a channel is registered and isOffline becomes false, so there seems little point.
            ringerc Craig Ringer added a comment -

            Actually, it's better to suppress idle timeout in the retention manager, rather than playing with the reported idle time.

            See https://github.com/jenkinsci/ec2-plugin/pull/102

            ringerc Craig Ringer added a comment - Actually, it's better to suppress idle timeout in the retention manager, rather than playing with the reported idle time. See https://github.com/jenkinsci/ec2-plugin/pull/102

            Code changed in jenkins
            User: Craig Ringer
            Path:
            src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java
            http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b
            Log:
            Prevent nodes that are still starting up from being idle-stopped

            Per JENKINS-23792, the EC2 plugin will shut down nodes that're still starting
            up if the idle timeout is shorter than the time the node takes to go from
            launch request to successfully starting its first job on an executor.

            To prevent this, don't perform idle shutdown on a node that is marked offline.
            When it comes online, executors will be created and the new idle time will
            become the executor creation time, effectively resetting the timer.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Craig Ringer Path: src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b Log: Prevent nodes that are still starting up from being idle-stopped Per JENKINS-23792 , the EC2 plugin will shut down nodes that're still starting up if the idle timeout is shorter than the time the node takes to go from launch request to successfully starting its first job on an executor. To prevent this, don't perform idle shutdown on a node that is marked offline. When it comes online, executors will be created and the new idle time will become the executor creation time, effectively resetting the timer.
            unicolet Umberto Nicoletti added a comment - - edited

            We are seeing this problem again:

            the EC2 plugins starts a node, the instance starts. If the EC2 plugin checks the node quickly enough it will shut it down again because the node is offline, and also its idle time will be calculated since the node was last used (usually way more than 1 hour). Our idle timeout is set to 1h.

            In the logs we are seeing nodes being started and shutdown in a loop as short as a few seconds. If we are lucky and the node manages to complete boot and connect to Jenkins, then the plugin will not try to shut it down. However in our experience nodes being started and immediately stopped are very frequent. The following logs illustrate a case:

             

            {"log":"2021-06-07T04:31:19+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description='builder', labels='slave'}. Found stopped instances - will start it: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]","stream":"stderr","time":"2021-06-07T04:31:19.394868109Z"}
            {"log":"2021-06-07T04:31:20+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description='builder', labels='slave'}. Result of starting stopped instances:{StartingInstances: [omissis]","stream":"stderr","time":"2021-06-07T04:31:20.411850985Z"}
            {"log":"2021-06-07T04:31:22+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description='builder', labels='slave'}. Return instance: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]","stream":"stderr","time":"2021-06-07T04:31:22.387817199Z"}
            {"log":"2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2RetentionStrategy] Idle timeout of EC2 (proemion) - builder (i-05323ff7f119b6ad0) after 1243 idle minutes, instance statusRUNNING\n","stream":"stderr","time":"2021-06-07T04:31:29.893754707Z"}
            {"log":"2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2AbstractSlave] EC2 instance idle time expired: i-05323ff7f119b6ad0\n","stream":"stderr","time":"2021-06-07T04:31:29.893781858Z"}
            {"log":"2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2Cloud] SlaveTemplate{description='builder', labels='slave'} Node EC2 (proemion) - builder (i-05323ff7f119b6ad0) moved to RUNNING state in 9 seconds and is ready to be connected by Jenkins\n","stream":"stderr","time":"2021-06-07T04:31:29.897879692Z"}

            I've pinpointed the problematic change to https://github.com/jenkinsci/ec2-plugin/commit/9acfd64aab6ccba0da5f18620c79452632c8be5a : notice how the if (computer.isOffline()){ now allows for the code to continue until the next if where the node can shutdown if the idle timeout has expired. Before that commit, the node being offline would always cause the check to end.

            We cannot revert the plugin to a version earlier than 1.55 because we're blocked by https://github.com/jenkinsci/ec2-plugin/pull/538

            I'd be more than happy to submit a PR with a fix and tests. I'd appreciate an expedite turnaround, if possible.

             

            unicolet Umberto Nicoletti added a comment - - edited We are seeing this problem again: the EC2 plugins starts a node, the instance starts. If the EC2 plugin checks the node quickly enough it will shut it down again because the node is offline, and also its idle time will be calculated since the node was last used (usually way more than 1 hour). Our idle timeout is set to 1h. In the logs we are seeing nodes being started and shutdown in a loop as short as a few seconds. If we are lucky and the node manages to complete boot and connect to Jenkins, then the plugin will not try to shut it down. However in our experience nodes being started and immediately stopped are very frequent. The following logs illustrate a case:   { "log" : "2021-06-07T04:31:19+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description= 'builder' , labels= 'slave' }. Found stopped instances - will start it: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]" , "stream" : "stderr" , "time" : "2021-06-07T04:31:19.394868109Z" } { "log" : "2021-06-07T04:31:20+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description= 'builder' , labels= 'slave' }. Result of starting stopped instances:{StartingInstances: [omissis]" , "stream" : "stderr" , "time" : "2021-06-07T04:31:20.411850985Z" } { "log" : "2021-06-07T04:31:22+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description= 'builder' , labels= 'slave' }. Return instance: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]" , "stream" : "stderr" , "time" : "2021-06-07T04:31:22.387817199Z" } { "log" : "2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2RetentionStrategy] Idle timeout of EC2 (proemion) - builder (i-05323ff7f119b6ad0) after 1243 idle minutes, instance statusRUNNING\n" , "stream" : "stderr" , "time" : "2021-06-07T04:31:29.893754707Z" } { "log" : "2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2AbstractSlave] EC2 instance idle time expired: i-05323ff7f119b6ad0\n" , "stream" : "stderr" , "time" : "2021-06-07T04:31:29.893781858Z" } { "log" : "2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2Cloud] SlaveTemplate{description= 'builder' , labels= 'slave' } Node EC2 (proemion) - builder (i-05323ff7f119b6ad0) moved to RUNNING state in 9 seconds and is ready to be connected by Jenkins\n" , "stream" : "stderr" , "time" : "2021-06-07T04:31:29.897879692Z" } I've pinpointed the problematic change to https://github.com/jenkinsci/ec2-plugin/commit/9acfd64aab6ccba0da5f18620c79452632c8be5a  : notice how the if (computer.isOffline()){ now allows for the code to continue until the next if where the node can shutdown if the idle timeout has expired. Before that commit, the node being offline would always cause the check to end. We cannot revert the plugin to a version earlier than 1.55 because we're blocked by  https://github.com/jenkinsci/ec2-plugin/pull/538 I'd be more than happy to submit a PR with a fix and tests. I'd appreciate an expedite turnaround, if possible.  
            unicolet Umberto Nicoletti added a comment - PR submitted:  https://github.com/jenkinsci/ec2-plugin/pull/632

            version 1.62 which contains the fix in the previous comment solved the problem for us, this issue can be closed

            unicolet Umberto Nicoletti added a comment - version 1.62 which contains the fix in the previous comment solved the problem for us, this issue can be closed

            version 1.62 contains the fix, we have been running it for 2 days and we never had an occurrence of a node being shutdown while still launching. Thanks for the quick turnaround

            unicolet Umberto Nicoletti added a comment - version 1.62 contains the fix, we have been running it for 2 days and we never had an occurrence of a node being shutdown while still launching. Thanks for the quick turnaround

            People

              francisu Francis Upton
              ringerc Craig Ringer
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: