• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • ec2-plugin
    • ec2-plugin 1.40.1

      Stopped ec2 instances are no longer being resumed when needed.  Instead, new instances are spun up.   

      Instances were correctly resumed in version 1.39 but not in 1.40.1.   Reverting to 1.39 fixed the issue for me.

      This recent pull request appears to have removed the logic necessary to resume stopped instances.

      https://github.com/jenkinsci/ec2-plugin/pull/252

      Specifically the changes here: 

      https://github.com/jenkinsci/ec2-plugin/pull/252/files#diff-f2115e33148d3db7c133fe014ad9dfddR419

       

          [JENKINS-54071] EC2-plugin not spooling up stopped nodes

          I am testing the attached snapshot now and it just waits for the slaves to be available even though the servers are all up and running.

          There is a lot of over provisioning (same as PR252) but it wasn't the case with the releases 1.40.1, even though that version had other bugs.

          I will continue to test.

          Oliver Pereira added a comment - I am testing the attached snapshot now and it just waits for the slaves to be available even though the servers are all up and running. There is a lot of over provisioning (same as PR252) but it wasn't the case with the releases 1.40.1, even though that version had other bugs. I will continue to test.

          Oliver Pereira added a comment - It's the same script taken from here .   https://gist.github.com/vrivellino/97954495938e38421ba4504049fd44ea

          It plugin just waits for ever for the slaves to come up even though there are a few machines which have already started.

          I can see the following error in the logs.

          Oct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeOct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeWARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#7] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]):  at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687)  at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)  at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)  at hudson.plugins.ec2.EC2Computer.getState(EC2Computer.java:127)  at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:112)  at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:90)  at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:48)  at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)  at hudson.model.Queue._withLock(Queue.java:1380)  at hudson.model.Queue.withLock(Queue.java:1257)  at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)  at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)  at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748), Computer.threadPoolForRemoting [#34] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]):  at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687)  at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)  at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)  at hudson.plugins.ec2.EC2Computer.updateInstanceDescription(EC2Computer.java:117)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.getEC2HostAddress(EC2UnixLauncher.java:365)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.connectToSsh(EC2UnixLauncher.java:319)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:283)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.launchScript(EC2UnixLauncher.java:130)  at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48)  at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)  at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)  at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)  at java.util.concurrent.FutureTask.run(FutureTask.java:266)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748), jenkins.util.Timer [#10] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@779d1b9d (owned by jenkins.util.Timer [#7]):  at sun.misc.Unsafe.park(Native Method)  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)  at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)  at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)  at hudson.model.Queue._withLock(Queue.java:1437)  at hudson.model.Queue.withLock(Queue.java:1300)  at jenkins.model.Nodes.updateNode(Nodes.java:193)  at jenkins.model.Jenkins.updateNode(Jenkins.java:2080)  at hudson.model.Node.save(Node.java:140)  at hudson.util.PersistedList.onModified(PersistedList.java:173)  at hudson.util.PersistedList.replaceBy(PersistedList.java:85)  at hudson.model.Slave.<init>(Slave.java:198)  at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:136)  at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)  at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)  at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:954)  at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:659)  at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:631)  at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:462)  at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:548)  at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:563)  at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)  at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)  at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)  at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)  at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)  at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748)]] 

          Oliver Pereira added a comment - It plugin just waits for ever for the slaves to come up even though there are a few machines which have already started. I can see the following error in the logs. Oct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeOct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeWARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#7] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]): at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25) at hudson.plugins.ec2.EC2Computer.getState(EC2Computer.java:127) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:112) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:90) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:48) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1380) at hudson.model.Queue.withLock(Queue.java:1257) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748), Computer.threadPoolForRemoting [#34] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]): at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25) at hudson.plugins.ec2.EC2Computer.updateInstanceDescription(EC2Computer.java:117) at hudson.plugins.ec2.ssh.EC2UnixLauncher.getEC2HostAddress(EC2UnixLauncher.java:365) at hudson.plugins.ec2.ssh.EC2UnixLauncher.connectToSsh(EC2UnixLauncher.java:319) at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:283) at hudson.plugins.ec2.ssh.EC2UnixLauncher.launchScript(EC2UnixLauncher.java:130) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748), jenkins.util.Timer [#10] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@779d1b9d (owned by jenkins.util.Timer [#7]): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1437) at hudson.model.Queue.withLock(Queue.java:1300) at jenkins.model.Nodes.updateNode(Nodes.java:193) at jenkins.model.Jenkins.updateNode(Jenkins.java:2080) at hudson.model.Node.save(Node.java:140) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.<init>(Slave.java:198) at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:136) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:954) at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:659) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:631) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:462) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:548) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:563) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748)]]

          I have to restart Jenkins to clear up my queue.

          Going to rollback for now.

          Oliver Pereira added a comment - I have to restart Jenkins to clear up my queue. Going to rollback for now.

          Thanks, I am checking I had received another notification on similar issues with node hanging on AWS side. Can you tell me the reason of the unhealthy ?  disk full ?  we have a loop on this corner cases 

          FABRIZIO MANFREDI added a comment - Thanks, I am checking I had received another notification on similar issues with node hanging on AWS side. Can you tell me the reason of the unhealthy ?  disk full ?  we have a loop on this corner cases 

          It just waits for the servers to respond but it works fine after rolling back to the PR252 version.

          Oliver Pereira added a comment - It just waits for the servers to respond but it works fine after rolling back to the PR252 version.

          Could someone confirm, this was fixed in 1.41? I made sure, I added iam:ListInstanceProfilesForRole to the master's role policy. Then I terminated the old slave (only 1 currently). Building a job now correctly spins up a new instance (cap for this AMI is set to 1). Afterwards the instance is stopped after 1 minute of idling. Then for the next build it only seems to attempt to provision another machine (which of course fails with the cap limit). see the attached ec2.log

          Tobias Krönke added a comment - Could someone confirm, this was fixed in 1.41? I made sure, I added iam:ListInstanceProfilesForRole to the master's role policy. Then I terminated the old slave (only 1 currently). Building a job now correctly spins up a new instance (cap for this AMI is set to 1). Afterwards the instance is stopped after 1 minute of idling. Then for the next build it only seems to attempt to provision another machine (which of course fails with the cap limit). see the attached  ec2.log

          Paul Bovbel added a comment -

          I believe I see the same issue as Tobias in 1.41

          Paul Bovbel added a comment - I believe I see the same issue as Tobias in 1.41

          FABRIZIO MANFREDI added a comment - - edited

          paulbovbel can you check if you didn't reach the cap ? there is a bug in the EC2 that is counting the stopped instances as running, that means when you reach the CAP is not able to restart the instance in the stop state (duplication of the JENKINS-53920)

           

          FABRIZIO MANFREDI added a comment - - edited paulbovbel can you check if you didn't reach the cap ? there is a bug in the EC2 that is counting the stopped instances as running, that means when you reach the CAP is not able to restart the instance in the stop state (duplication of the  JENKINS-53920 )  

          Paul Bovbel added a comment -

          That sounds about right Fabrizio, I'll keep an eye on https://issues.jenkins-ci.org/browse/JENKINS-53920.

          Paul Bovbel added a comment - That sounds about right Fabrizio, I'll keep an eye on  https://issues.jenkins-ci.org/browse/JENKINS-53920 .

            thoulen FABRIZIO MANFREDI
            brycedrennan Bryce Drennan
            Votes:
            7 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: