-
New Feature
-
Resolution: Fixed
-
Major
-
Jenkins 2.18
Plugin v1.5
Ubuntu 14.04
Java 8u101
Running on an AWS EC2 c4.xlarge
-
Powered by SuggestiMate
An invalid list of Jenkins build agents that were created in ECS remains even after the job completes.
To reproduce:
- Configure the AWS ECS Plugin
- Configure a new freestyle job
- Restrict the job to the configured ECS cluster
- Build the job and observe completion (pass/fail state does not matter)
- Immediately build the job again
- Observe the completion and two offline nodes in the build executor list
[JENKINS-37597] ECS nodes not removed from build executor list
Nope. I gave up didn't use ECS and just ran my own Docker host. Sorry . Would be great if this worked because I would love to defer this to ECS.
Me too. But I'm using docker swarm plus docker registry and it works like ecs. I've found a script for Jenkins, it could be useful for deleting offline nodes, but it's not a right way. Waiting for this feature for Jenkins plugin.
I think this issue is related to Jenkins JNLP slave functional. I tried to use JNLP slaves with Docker swarm cluster - the same issue.
I'm able to reproduce this problem with plugin version 1.6, but only when build duration is less than 5 seconds or so. Longer builds result in the node being removed after build completion.
Perhaps there exists a race condition which leads to node removal failure if a build finishes before the node it runs on is completely registered, or something along those lines.
Hmm, this is interesting
I updated to Jenkins 2.41 this morning, one of the 'enhancements' of which in that release is JNLP4 for all agents (https://issues.jenkins-ci.org/browse/JENKINS-40886)
As soon as I updated any agents created by the ECS plugin were left in an idle state after their builds. I occasionally them go into a suspended state, but then they would come back out & go idle again.
Because all the ECS cluster resources were in use, no more containers would spawn.
Rolling back to 2.40 immediately corrected the issue.
Still an issue with Jenkins 2.60.3 and plugin version 1.11.
When launching a bunch of parallel jobs, I see the plugin starts launching agents as supposed but as there are still jobs in queue after the first jobs finish the agents stay on the list as offline even though the container tasks are stopped and containers deleted as supposed from ECS cluster. And as the plugin thinks the offline agents use all available ECS cpu capacity, no new agents are launched thus the rest of the jobs do not get run. This is a blocking bug for us, the plugin is virtually unusable as we would need to constantly manually delete the offline agents.
Is someone working on this? Plugin doesn't really seem production ready. Too many open bugs? No updates? Abandoned?
I too am seeing similar, with build executors gradually accumulating in an offline state. Jobs to continue to execute okay, but this accumulation of offline executors is problematic, especially on a busy server.
Jenkins version 2.121 (based on the official jenkins/jenkins:alpine Docker image) with plugin version 1.14, JNLP slaves using the jenkinsci/jnlp-slave Docker image.
WARNING: jenkins.util.Timer [#3] for ci-jenkins-build-executors-42b6bb8c663f terminated java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314) at hudson.remoting.Channel.close(Channel.java:1450) at hudson.remoting.Channel.close(Channel.java:1403) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:746) at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:713) at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:88) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:227) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1551) at jenkins.model.Nodes$6.run(Nodes.java:261) at hudson.model.Queue._withLock(Queue.java:1378) at hudson.model.Queue.withLock(Queue.java:1255) at jenkins.model.Nodes.removeNode(Nodes.java:252) at jenkins.model.Jenkins.removeNode(Jenkins.java:2065) at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:70) at com.cloudbees.jenkins.plugins.amazonecs.ECSSlave$1.check(ECSSlave.java:82) at com.cloudbees.jenkins.plugins.amazonecs.ECSSlave$1.check(ECSSlave.java:70) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1378) at hudson.model.Queue.withLock(Queue.java:1255) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
WARNING: Computer.threadPoolForRemoting [#8281] for ci-jenkins-build-executors-429623dffe91 terminated java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314) at hudson.remoting.Channel.close(Channel.java:1450) at hudson.remoting.Channel.close(Channel.java:1403) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:746) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:99) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:664) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Seeing this issue occur when we've more than 30 agents connected, it's only effecting ECS tasks EC2 seems to spawn fine. What I've noted is that only some of the ECS agent 'types' we have fail other ECS agent types seem to launch and connect fine with the same underlying ECS cluster.
here's a cleanup script that may help for those with busy servers.
String agentList = "" String agentPrefix = "example" Integer agentTotal = 0 Integer ofType = 0 Integer ofDeleted = 0 // Itterate nodes Jenkins.instance.nodes.each { //println "Checking agent: $it.nodeName" if (it.nodeName.contains(agentPrefix)){ //println it.computer.offlineCause.toString() if (it.computer.offlineCause.toString().contains('Time out for last 5 try')) { agentList += it.nodeName + "\n" it.computer.doDoDelete() ofDeleted +=1 } ofType +=1 } agentTotal +=1 } println "Deleted Agent list: \n" + agentList println "Total: ${agentTotal} Total type: ${ofType} Deleted: ${ofDeleted}"
So there are a couple of different issues where they don't get cleaned up automatically.
INFO: Created Slave: cidd-9bd0fafd9be3 INFO: Running task definition arn:aws:ecs:us-east-1:123456789012:task-definition/cidd-t2-small-generic:1 on slave cidd-9bd0fafd9be3 INFO: Slave cidd-9bd0fafd9be3 - Slave Task Started : arn:aws:ecs:us-east-1:123456789012:task/example-omited INFO: ECS Slave cidd-9bd0fafd9be3 (ecs task arn:aws:ecs:us-east-1:123456789012:task/example-omited) connected WARNING: Computer.threadPoolForRemoting [#1754] for cidd-9bd0fafd9be3 terminated WARNING: Making cidd-9bd0fafd9be3 offline because it’s not responding
Even tho it was terminated the node/agent remained in the list
INFO: Created Slave: cidd-9b44dbcdddde INFO: Running task definition arn:aws:ecs:us-east-1:123456789012:task-definition/cidd-t2-small-generic:1 on slave cidd-9b44dbcdddde INFO: Slave cidd-9b44dbcdddde - Slave Task Started : arn:aws:ecs:us-east-1:123456789012:task/example-omitted INFO: ECS Slave cidd-9b44dbcdddde (ecs task arn:aws:ecs:us-east-1:123456789012:task/example-omitted) connected WARNING: Making cidd-9b44dbcdddde offline temporarily due to the use of an old slave.jar WARNING: Computer.threadPoolForRemoting [#1753] for cidd-9b44dbcdddde terminated WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding WARNING: Making cidd-9b44dbcdddde offline because it’s not responding
Similar issue node/agent was not removed from the list. But I believe at least this node is using a newer version of remoting than the master.
Will downgrade this to match and try again.
Have the same issue. ericgoedtel have you found any workaround?