Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-54071

EC2-plugin not spooling up stopped nodes

    XMLWordPrintable

Details

    Description

      Stopped ec2 instances are no longer being resumed when needed.  Instead, new instances are spun up.   

      Instances were correctly resumed in version 1.39 but not in 1.40.1.   Reverting to 1.39 fixed the issue for me.

      This recent pull request appears to have removed the logic necessary to resume stopped instances.

      https://github.com/jenkinsci/ec2-plugin/pull/252

      Specifically the changes here: 

      https://github.com/jenkinsci/ec2-plugin/pull/252/files#diff-f2115e33148d3db7c133fe014ad9dfddR419

       

      Attachments

        Issue Links

          Activity

            thoulen FABRIZIO MANFREDI added a comment - - edited

            The PR-252 improved performance, it also fixed the problem of the doble connection to the node, unfortunately the automatically connection of the core is triggered only for new node and not for existing one.

            The fix will be in the 1.41 (PR-311)

            can you try the last snapshot of the 1.41 : 

            https://repo.jenkins-ci.org/snapshots/org/jenkins-ci/plugins/ec2/1.41-SNAPSHOT/ec2-1.41-20181017.001057-2.hpi

             

            don't  forget to add the new permission to the Role:

            { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "iam:ListInstanceProfilesForRole", "iam:PassRole" ], "Resource": "*" }
            thoulen FABRIZIO MANFREDI added a comment - - edited The PR-252 improved performance, it also fixed the problem of the doble connection to the node, unfortunately the automatically connection of the core is triggered only for new node and not for existing one. The fix will be in the 1.41 (PR-311) can you try the last snapshot of the 1.41 :  https://repo.jenkins-ci.org/snapshots/org/jenkins-ci/plugins/ec2/1.41-SNAPSHOT/ec2-1.41-20181017.001057-2.hpi   don't  forget to add the new permission to the Role: { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "iam:ListInstanceProfilesForRole", "iam:PassRole" ], "Resource": "*" }

            Has some constructor changed in the 1.41-snapshot version?

            We use a groovy script to configure the ec2-plugin and I am getting the following error now

            groovy.lang.GroovyRuntimeException: Could not find matching constructor for: hudson.plugins.ec2.UnixData(java.lang.String, java.lang.String, java.lang.String)
             
            oliverp Oliver Pereira added a comment - Has some constructor changed in the 1.41-snapshot version? We use a groovy script to configure the ec2-plugin and I am getting the following error now groovy.lang.GroovyRuntimeException: Could not find matching constructor for : hudson.plugins.ec2.UnixData(java.lang. String , java.lang. String , java.lang. String )

            Yes, some interfaces has been changed to support the round robin allocation of nodes across multiple subnet, but not the UnixData it is still :

            @DataBoundConstructor
            public UnixData(String rootCommandPrefix, String slaveCommandPrefix, String sshPort) {
            this.rootCommandPrefix = rootCommandPrefix;
            this.slaveCommandPrefix = slaveCommandPrefix;
            this.sshPort = sshPort;

            this.readResolve();
            }

             

            can you share a bit more of your script ? (eventually through a private channel)

             

            thoulen FABRIZIO MANFREDI added a comment - Yes, some interfaces has been changed to support the round robin allocation of nodes across multiple subnet, but not the UnixData it is still : @DataBoundConstructor public UnixData(String rootCommandPrefix, String slaveCommandPrefix, String sshPort) { this.rootCommandPrefix = rootCommandPrefix; this.slaveCommandPrefix = slaveCommandPrefix; this.sshPort = sshPort; this.readResolve(); }   can you share a bit more of your script ? (eventually through a private channel)  

            I am testing the attached snapshot now and it just waits for the slaves to be available even though the servers are all up and running.

            There is a lot of over provisioning (same as PR252) but it wasn't the case with the releases 1.40.1, even though that version had other bugs.

            I will continue to test.

            oliverp Oliver Pereira added a comment - I am testing the attached snapshot now and it just waits for the slaves to be available even though the servers are all up and running. There is a lot of over provisioning (same as PR252) but it wasn't the case with the releases 1.40.1, even though that version had other bugs. I will continue to test.
            oliverp Oliver Pereira added a comment - It's the same script taken from here .   https://gist.github.com/vrivellino/97954495938e38421ba4504049fd44ea

            It plugin just waits for ever for the slaves to come up even though there are a few machines which have already started.

            I can see the following error in the logs.

            Oct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeOct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeWARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#7] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]):  at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687)  at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)  at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)  at hudson.plugins.ec2.EC2Computer.getState(EC2Computer.java:127)  at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:112)  at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:90)  at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:48)  at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)  at hudson.model.Queue._withLock(Queue.java:1380)  at hudson.model.Queue.withLock(Queue.java:1257)  at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)  at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)  at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748), Computer.threadPoolForRemoting [#34] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]):  at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687)  at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)  at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)  at hudson.plugins.ec2.EC2Computer.updateInstanceDescription(EC2Computer.java:117)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.getEC2HostAddress(EC2UnixLauncher.java:365)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.connectToSsh(EC2UnixLauncher.java:319)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:283)  at hudson.plugins.ec2.ssh.EC2UnixLauncher.launchScript(EC2UnixLauncher.java:130)  at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48)  at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)  at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)  at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)  at java.util.concurrent.FutureTask.run(FutureTask.java:266)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748), jenkins.util.Timer [#10] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@779d1b9d (owned by jenkins.util.Timer [#7]):  at sun.misc.Unsafe.park(Native Method)  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)  at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)  at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)  at hudson.model.Queue._withLock(Queue.java:1437)  at hudson.model.Queue.withLock(Queue.java:1300)  at jenkins.model.Nodes.updateNode(Nodes.java:193)  at jenkins.model.Jenkins.updateNode(Jenkins.java:2080)  at hudson.model.Node.save(Node.java:140)  at hudson.util.PersistedList.onModified(PersistedList.java:173)  at hudson.util.PersistedList.replaceBy(PersistedList.java:85)  at hudson.model.Slave.<init>(Slave.java:198)  at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:136)  at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)  at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)  at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:954)  at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:659)  at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:631)  at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:462)  at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:548)  at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:563)  at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)  at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)  at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)  at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)  at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)  at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748)]] 
            oliverp Oliver Pereira added a comment - It plugin just waits for ever for the slaves to come up even though there are a few machines which have already started. I can see the following error in the logs. Oct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeOct 18, 2018 11:53:49 AM jenkins.metrics.api.Metrics$HealthChecker executeWARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#7] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]): at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25) at hudson.plugins.ec2.EC2Computer.getState(EC2Computer.java:127) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:112) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:90) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:48) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1380) at hudson.model.Queue.withLock(Queue.java:1257) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748), Computer.threadPoolForRemoting [#34] locked on hudson.plugins.ec2.AmazonEC2Cloud@5a66924 (owned by jenkins.util.Timer [#10]): at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:687) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25) at hudson.plugins.ec2.EC2Computer.updateInstanceDescription(EC2Computer.java:117) at hudson.plugins.ec2.ssh.EC2UnixLauncher.getEC2HostAddress(EC2UnixLauncher.java:365) at hudson.plugins.ec2.ssh.EC2UnixLauncher.connectToSsh(EC2UnixLauncher.java:319) at hudson.plugins.ec2.ssh.EC2UnixLauncher.bootstrap(EC2UnixLauncher.java:283) at hudson.plugins.ec2.ssh.EC2UnixLauncher.launchScript(EC2UnixLauncher.java:130) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748), jenkins.util.Timer [#10] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@779d1b9d (owned by jenkins.util.Timer [#7]): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1437) at hudson.model.Queue.withLock(Queue.java:1300) at jenkins.model.Nodes.updateNode(Nodes.java:193) at jenkins.model.Jenkins.updateNode(Jenkins.java:2080) at hudson.model.Node.save(Node.java:140) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.<init>(Slave.java:198) at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:136) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:954) at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:659) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:631) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:462) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:548) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:563) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748)]]

            I have to restart Jenkins to clear up my queue.

            Going to rollback for now.

            oliverp Oliver Pereira added a comment - I have to restart Jenkins to clear up my queue. Going to rollback for now.

            Thanks, I am checking I had received another notification on similar issues with node hanging on AWS side. Can you tell me the reason of the unhealthy ?  disk full ?  we have a loop on this corner cases 

            thoulen FABRIZIO MANFREDI added a comment - Thanks, I am checking I had received another notification on similar issues with node hanging on AWS side. Can you tell me the reason of the unhealthy ?  disk full ?  we have a loop on this corner cases 

            It just waits for the servers to respond but it works fine after rolling back to the PR252 version.

            oliverp Oliver Pereira added a comment - It just waits for the servers to respond but it works fine after rolling back to the PR252 version.

            Could someone confirm, this was fixed in 1.41? I made sure, I added iam:ListInstanceProfilesForRole to the master's role policy. Then I terminated the old slave (only 1 currently). Building a job now correctly spins up a new instance (cap for this AMI is set to 1). Afterwards the instance is stopped after 1 minute of idling. Then for the next build it only seems to attempt to provision another machine (which of course fails with the cap limit). see the attached ec2.log

            tuky Tobias Krönke added a comment - Could someone confirm, this was fixed in 1.41? I made sure, I added iam:ListInstanceProfilesForRole to the master's role policy. Then I terminated the old slave (only 1 currently). Building a job now correctly spins up a new instance (cap for this AMI is set to 1). Afterwards the instance is stopped after 1 minute of idling. Then for the next build it only seems to attempt to provision another machine (which of course fails with the cap limit). see the attached  ec2.log
            paulbovbel Paul Bovbel added a comment -

            I believe I see the same issue as Tobias in 1.41

            paulbovbel Paul Bovbel added a comment - I believe I see the same issue as Tobias in 1.41
            thoulen FABRIZIO MANFREDI added a comment - - edited

            paulbovbel can you check if you didn't reach the cap ? there is a bug in the EC2 that is counting the stopped instances as running, that means when you reach the CAP is not able to restart the instance in the stop state (duplication of the JENKINS-53920)

             

            thoulen FABRIZIO MANFREDI added a comment - - edited paulbovbel can you check if you didn't reach the cap ? there is a bug in the EC2 that is counting the stopped instances as running, that means when you reach the CAP is not able to restart the instance in the stop state (duplication of the  JENKINS-53920 )  
            paulbovbel Paul Bovbel added a comment -

            That sounds about right Fabrizio, I'll keep an eye on https://issues.jenkins-ci.org/browse/JENKINS-53920.

            paulbovbel Paul Bovbel added a comment - That sounds about right Fabrizio, I'll keep an eye on  https://issues.jenkins-ci.org/browse/JENKINS-53920 .

            People

              thoulen FABRIZIO MANFREDI
              brycedrennan Bryce Drennan
              Votes:
              7 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: