Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23792

PATCH: EC2 plugin idles-down nodes that are still launching

    XMLWordPrintable

Details

    • 1.62

    Description

      The ec2-plugin will treat a still-launching node as idle (because it has no jbs running) and will thus shut it down if its idle timeout expires.

      This means that if the idle timeout for a node is shorter than the time it takes to do one-off configuration on first launch, it will be shut down before it finishes starting up, aborting any init script that might be in progress.

      That's a clear bug - a node isn't idle when it's launching, and shouldn't be idled out until it's first made a connection with its slave agent.

      Attachments

        Issue Links

          Activity

            version 1.62 contains the fix, we have been running it for 2 days and we never had an occurrence of a node being shutdown while still launching. Thanks for the quick turnaround

            unicolet Umberto Nicoletti added a comment - version 1.62 contains the fix, we have been running it for 2 days and we never had an occurrence of a node being shutdown while still launching. Thanks for the quick turnaround

            version 1.62 which contains the fix in the previous comment solved the problem for us, this issue can be closed

            unicolet Umberto Nicoletti added a comment - version 1.62 which contains the fix in the previous comment solved the problem for us, this issue can be closed
            unicolet Umberto Nicoletti added a comment - PR submitted:  https://github.com/jenkinsci/ec2-plugin/pull/632
            unicolet Umberto Nicoletti added a comment - - edited

            We are seeing this problem again:

            the EC2 plugins starts a node, the instance starts. If the EC2 plugin checks the node quickly enough it will shut it down again because the node is offline, and also its idle time will be calculated since the node was last used (usually way more than 1 hour). Our idle timeout is set to 1h.

            In the logs we are seeing nodes being started and shutdown in a loop as short as a few seconds. If we are lucky and the node manages to complete boot and connect to Jenkins, then the plugin will not try to shut it down. However in our experience nodes being started and immediately stopped are very frequent. The following logs illustrate a case:

             

            {"log":"2021-06-07T04:31:19+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description='builder', labels='slave'}. Found stopped instances - will start it: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]","stream":"stderr","time":"2021-06-07T04:31:19.394868109Z"}
            {"log":"2021-06-07T04:31:20+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description='builder', labels='slave'}. Result of starting stopped instances:{StartingInstances: [omissis]","stream":"stderr","time":"2021-06-07T04:31:20.411850985Z"}
            {"log":"2021-06-07T04:31:22+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description='builder', labels='slave'}. Return instance: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]","stream":"stderr","time":"2021-06-07T04:31:22.387817199Z"}
            {"log":"2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2RetentionStrategy] Idle timeout of EC2 (proemion) - builder (i-05323ff7f119b6ad0) after 1243 idle minutes, instance statusRUNNING\n","stream":"stderr","time":"2021-06-07T04:31:29.893754707Z"}
            {"log":"2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2AbstractSlave] EC2 instance idle time expired: i-05323ff7f119b6ad0\n","stream":"stderr","time":"2021-06-07T04:31:29.893781858Z"}
            {"log":"2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2Cloud] SlaveTemplate{description='builder', labels='slave'} Node EC2 (proemion) - builder (i-05323ff7f119b6ad0) moved to RUNNING state in 9 seconds and is ready to be connected by Jenkins\n","stream":"stderr","time":"2021-06-07T04:31:29.897879692Z"}

            I've pinpointed the problematic change to https://github.com/jenkinsci/ec2-plugin/commit/9acfd64aab6ccba0da5f18620c79452632c8be5a : notice how the if (computer.isOffline()){ now allows for the code to continue until the next if where the node can shutdown if the idle timeout has expired. Before that commit, the node being offline would always cause the check to end.

            We cannot revert the plugin to a version earlier than 1.55 because we're blocked by https://github.com/jenkinsci/ec2-plugin/pull/538

            I'd be more than happy to submit a PR with a fix and tests. I'd appreciate an expedite turnaround, if possible.

             

            unicolet Umberto Nicoletti added a comment - - edited We are seeing this problem again: the EC2 plugins starts a node, the instance starts. If the EC2 plugin checks the node quickly enough it will shut it down again because the node is offline, and also its idle time will be calculated since the node was last used (usually way more than 1 hour). Our idle timeout is set to 1h. In the logs we are seeing nodes being started and shutdown in a loop as short as a few seconds. If we are lucky and the node manages to complete boot and connect to Jenkins, then the plugin will not try to shut it down. However in our experience nodes being started and immediately stopped are very frequent. The following logs illustrate a case:   { "log" : "2021-06-07T04:31:19+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description= 'builder' , labels= 'slave' }. Found stopped instances - will start it: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]" , "stream" : "stderr" , "time" : "2021-06-07T04:31:19.394868109Z" } { "log" : "2021-06-07T04:31:20+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description= 'builder' , labels= 'slave' }. Result of starting stopped instances:{StartingInstances: [omissis]" , "stream" : "stderr" , "time" : "2021-06-07T04:31:20.411850985Z" } { "log" : "2021-06-07T04:31:22+0000 INFO: [hudson.plugins.ec2.SlaveTemplate] SlaveTemplate{description= 'builder' , labels= 'slave' }. Return instance: {AmiLaunchIndex: 0,ImageId: ami-XXXXXXXXXXXXXX,InstanceId: i-05323ff7f119b6ad0,[omissis]" , "stream" : "stderr" , "time" : "2021-06-07T04:31:22.387817199Z" } { "log" : "2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2RetentionStrategy] Idle timeout of EC2 (proemion) - builder (i-05323ff7f119b6ad0) after 1243 idle minutes, instance statusRUNNING\n" , "stream" : "stderr" , "time" : "2021-06-07T04:31:29.893754707Z" } { "log" : "2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2AbstractSlave] EC2 instance idle time expired: i-05323ff7f119b6ad0\n" , "stream" : "stderr" , "time" : "2021-06-07T04:31:29.893781858Z" } { "log" : "2021-06-07T04:31:29+0000 INFO: [hudson.plugins.ec2.EC2Cloud] SlaveTemplate{description= 'builder' , labels= 'slave' } Node EC2 (proemion) - builder (i-05323ff7f119b6ad0) moved to RUNNING state in 9 seconds and is ready to be connected by Jenkins\n" , "stream" : "stderr" , "time" : "2021-06-07T04:31:29.897879692Z" } I've pinpointed the problematic change to https://github.com/jenkinsci/ec2-plugin/commit/9acfd64aab6ccba0da5f18620c79452632c8be5a  : notice how the if (computer.isOffline()){ now allows for the code to continue until the next if where the node can shutdown if the idle timeout has expired. Before that commit, the node being offline would always cause the check to end. We cannot revert the plugin to a version earlier than 1.55 because we're blocked by  https://github.com/jenkinsci/ec2-plugin/pull/538 I'd be more than happy to submit a PR with a fix and tests. I'd appreciate an expedite turnaround, if possible.  

            Code changed in jenkins
            User: Craig Ringer
            Path:
            src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java
            http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b
            Log:
            Prevent nodes that are still starting up from being idle-stopped

            Per JENKINS-23792, the EC2 plugin will shut down nodes that're still starting
            up if the idle timeout is shorter than the time the node takes to go from
            launch request to successfully starting its first job on an executor.

            To prevent this, don't perform idle shutdown on a node that is marked offline.
            When it comes online, executors will be created and the new idle time will
            become the executor creation time, effectively resetting the timer.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Craig Ringer Path: src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java http://jenkins-ci.org/commit/ec2-plugin/d0cd0db5b845c747607a60759eebb7d31e4ea32b Log: Prevent nodes that are still starting up from being idle-stopped Per JENKINS-23792 , the EC2 plugin will shut down nodes that're still starting up if the idle timeout is shorter than the time the node takes to go from launch request to successfully starting its first job on an executor. To prevent this, don't perform idle shutdown on a node that is marked offline. When it comes online, executors will be created and the new idle time will become the executor creation time, effectively resetting the timer.

            People

              francisu Francis Upton
              ringerc Craig Ringer
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: