-
Bug
-
Resolution: Unresolved
-
Major
-
None
UPDATE 2016-08-31:
I saw it happen today and was able to reproduce it on a completely new Jenkins running the latest Docker image.
This is now easily reproducible for me after doing the following:
- Set up a fresh ECS cluster and add an instance to it.
- Set up a fresh new Jenkins (2.7.2) using the latest Docker image. Choose to install default plugins.
- After initial setup, install Amazon ECS plugin (1.5).
- Configure the plugin with valid credentials and a template for the default jnlp-slave image. I picked 256 MB for memory but it probably doesn't matter.
- Create a job that just runs "sleep 1000" set to allow concurrent builds. Set no restriction on where the build can be executed.
- Start the job enough times to fill your master executor queue, and then one more time (three times by default).
- Watch the log (or your executor list) as the ECS plugin creates a slave, waits for it to connect, then creates another slave, waits for it to connect, etc. The slaves never run the job, but as long as the slaves connect, it keeps creating more. The slaves stay running indefinitely (as far as I know).
If you then stop the jobs, stop all the tasks via AWS, clear out all the now-disconnected nodes, restart Jenkins, and modify the job to be restricted to master, that resolves the issue. Regardless, though, it's got to be unexpected behavior that the plugin would just keep creating unused slaves until the cluster runs out of resources.
ORIGINAL:
Occasionally, I'll notice in the Jenkins web interface that there are a bunch (6-12) ECS slaves that are online, but not doing anything. They don't get pruned or stop on their own.
I've just been deleting them manually by clearing out all running tasks from ECS when I know no jobs are running, but we're hoping to implement autoscaling for our ECS slaves soon and that's not going to work well if we have a bunch of slaves taking up resources but not really being used.
It happened today when we had a job with concurrency allowed running twice at the same time. I'm not sure if that's related or not. Certainly happy to do more troubleshooting, but not sure exactly what to look for.
I've got the log for today here that shows it happening. Relevant area starts on line 77 when the first job kicks off (it had been idle prior to that). The jobs got done, but by line 421 it's now saying it's up to 13 computers. After that, there's basically a loop where they all drop at once and then reconnect.
Let me know if there's anything I can do to help.