-
New Feature
-
Resolution: Fixed
-
Major
-
Jenkins 2.18
Plugin v1.5
Ubuntu 14.04
Java 8u101
Running on an AWS EC2 c4.xlarge
-
Powered by SuggestiMate
An invalid list of Jenkins build agents that were created in ECS remains even after the job completes.
To reproduce:
- Configure the AWS ECS Plugin
- Configure a new freestyle job
- Restrict the job to the configured ECS cluster
- Build the job and observe completion (pass/fail state does not matter)
- Immediately build the job again
- Observe the completion and two offline nodes in the build executor list
[JENKINS-37597] ECS nodes not removed from build executor list
Is someone working on this? Plugin doesn't really seem production ready. Too many open bugs? No updates? Abandoned?
I too am seeing similar, with build executors gradually accumulating in an offline state. Jobs to continue to execute okay, but this accumulation of offline executors is problematic, especially on a busy server.
Jenkins version 2.121 (based on the official jenkins/jenkins:alpine Docker image) with plugin version 1.14, JNLP slaves using the jenkinsci/jnlp-slave Docker image.
WARNING: jenkins.util.Timer [#3] for ci-jenkins-build-executors-42b6bb8c663f terminated java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314) at hudson.remoting.Channel.close(Channel.java:1450) at hudson.remoting.Channel.close(Channel.java:1403) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:746) at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:713) at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:88) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:227) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1551) at jenkins.model.Nodes$6.run(Nodes.java:261) at hudson.model.Queue._withLock(Queue.java:1378) at hudson.model.Queue.withLock(Queue.java:1255) at jenkins.model.Nodes.removeNode(Nodes.java:252) at jenkins.model.Jenkins.removeNode(Jenkins.java:2065) at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:70) at com.cloudbees.jenkins.plugins.amazonecs.ECSSlave$1.check(ECSSlave.java:82) at com.cloudbees.jenkins.plugins.amazonecs.ECSSlave$1.check(ECSSlave.java:70) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1378) at hudson.model.Queue.withLock(Queue.java:1255) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
WARNING: Computer.threadPoolForRemoting [#8281] for ci-jenkins-build-executors-429623dffe91 terminated java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314) at hudson.remoting.Channel.close(Channel.java:1450) at hudson.remoting.Channel.close(Channel.java:1403) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:746) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:99) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:664) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Seeing this issue occur when we've more than 30 agents connected, it's only effecting ECS tasks EC2 seems to spawn fine. What I've noted is that only some of the ECS agent 'types' we have fail other ECS agent types seem to launch and connect fine with the same underlying ECS cluster.
here's a cleanup script that may help for those with busy servers.
String agentList = "" String agentPrefix = "example" Integer agentTotal = 0 Integer ofType = 0 Integer ofDeleted = 0 // Itterate nodes Jenkins.instance.nodes.each { //println "Checking agent: $it.nodeName" if (it.nodeName.contains(agentPrefix)){ //println it.computer.offlineCause.toString() if (it.computer.offlineCause.toString().contains('Time out for last 5 try')) { agentList += it.nodeName + "\n" it.computer.doDoDelete() ofDeleted +=1 } ofType +=1 } agentTotal +=1 } println "Deleted Agent list: \n" + agentList println "Total: ${agentTotal} Total type: ${ofType} Deleted: ${ofDeleted}"
So there are a couple of different issues where they don't get cleaned up automatically.
INFO: Created Slave: cidd-9bd0fafd9be3 INFO: Running task definition arn:aws:ecs:us-east-1:123456789012:task-definition/cidd-t2-small-generic:1 on slave cidd-9bd0fafd9be3 INFO: Slave cidd-9bd0fafd9be3 - Slave Task Started : arn:aws:ecs:us-east-1:123456789012:task/example-omited INFO: ECS Slave cidd-9bd0fafd9be3 (ecs task arn:aws:ecs:us-east-1:123456789012:task/example-omited) connected WARNING: Computer.threadPoolForRemoting [#1754] for cidd-9bd0fafd9be3 terminated WARNING: Making cidd-9bd0fafd9be3 offline because it’s not responding
Even tho it was terminated the node/agent remained in the list
INFO: Created Slave: cidd-9b44dbcdddde INFO: Running task definition arn:aws:ecs:us-east-1:123456789012:task-definition/cidd-t2-small-generic:1 on slave cidd-9b44dbcdddde INFO: Slave cidd-9b44dbcdddde - Slave Task Started : arn:aws:ecs:us-east-1:123456789012:task/example-omitted INFO: ECS Slave cidd-9b44dbcdddde (ecs task arn:aws:ecs:us-east-1:123456789012:task/example-omitted) connected WARNING: Making cidd-9b44dbcdddde offline temporarily due to the use of an old slave.jar WARNING: Computer.threadPoolForRemoting [#1753] for cidd-9b44dbcdddde terminated WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding
Similar issue node/agent was not removed from the list. But I believe at least this node is using a newer version of remoting than the master.
Will downgrade this to match and try again.
Still an issue with Jenkins 2.60.3 and plugin version 1.11.
When launching a bunch of parallel jobs, I see the plugin starts launching agents as supposed but as there are still jobs in queue after the first jobs finish the agents stay on the list as offline even though the container tasks are stopped and containers deleted as supposed from ECS cluster. And as the plugin thinks the offline agents use all available ECS cpu capacity, no new agents are launched thus the rest of the jobs do not get run. This is a blocking bug for us, the plugin is virtually unusable as we would need to constantly manually delete the offline agents.