[JENKINS-33002] When local job queue fills, incorrectly spawn unlimited workers despite being impossible to schedule on those nodes.

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: amazon-ecs-plugin
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show

When the local job queue fills on the master, this plugin will start spinning up slaves. However, if those jobs have a restriction on where they can run, they will never be scheduled on those slaves.

As a result with 4 threads on the master, and 6 jobs, this will spin up hundreds slaves. The ECS cluster on has capacity for about 30 of those, so they never actually spin up, creating a mess out of the node list for a while.

Attached is an example of what this looks like. All the jobs scheduled have a restriction of running on the master, because they are building docker containers.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Nodes__Jenkins_.png
2016-02-17 19:47
500 kB
George Shammas
screenshot-1.png
2016-05-23 15:11
81 kB
Dima Shmakov

George Shammas added a comment - 2016-02-18 19:13

For those that get hit with this and the nodes aren't expiring, here is a quick groovy script that will delete all the offline ECS nodes. It should not delete any other slaves under any circumstance.

To run it, just go to the /script directory of your jenkins install.

for (aSlave in hudson.model.Hudson.instance.slaves) {
  println('====================');
  println('Name: ' + aSlave.name);
  if (aSlave.getComputer().toString()  =~ /amazonecs.ECSComputer@/ &&
      aSlave.getComputer().isOffline().toString() == 'true') {
    println('Deleting node!!!!');
    aSlave.getComputer().setTemporarilyOffline(true,null);
    aSlave.getComputer().doDoDelete();
  } else {
    println ('Not an offline ECS node')
  }
}

George Shammas added a comment - 2016-02-18 19:13 For those that get hit with this and the nodes aren't expiring, here is a quick groovy script that will delete all the offline ECS nodes. It should not delete any other slaves under any circumstance. To run it, just go to the /script directory of your jenkins install. for (aSlave in hudson.model.Hudson.instance.slaves) { println( '====================' ); println( 'Name: ' + aSlave.name); if (aSlave.getComputer().toString() =~ /amazonecs.ECSComputer@/ && aSlave.getComputer().isOffline().toString() == ' true ' ) { println( 'Deleting node!!!!' ); aSlave.getComputer().setTemporarilyOffline( true , null ); aSlave.getComputer().doDoDelete(); } else { println ( 'Not an offline ECS node' ) } }

Dima Shmakov added a comment - 2016-05-23 15:04 - edited

Hello georgemb, thank you for this fix, it worked.
I noticed also that while i had 200 docker slaves cluttering my jenkins, after running this only once (it deleted all) they started "expiring" now after each build! this is SO weird. It didn't affect the plugin in any way, right? i don't get it, how slave nodes started to behave properly now =)

but what's more interesting, those docker slave nodes were always stuck both in jeknins AND on my ECS instances (stuck as 'stopped' containers) like on this screenshot i just took from my CLI history (before this groovy run):

and now those containers also disappear after build finishes o_O , some kind of voodoo magic it is... i was sure this plugin has a bug (it cluttered instances with containers that not "--rm" themselves). what it might be? i did no changes in my config anywhere .. just tried this script

Dima Shmakov added a comment - 2016-05-23 15:04 - edited Hello georgemb , thank you for this fix, it worked. I noticed also that while i had 200 docker slaves cluttering my jenkins, after running this only once (it deleted all) they started "expiring" now after each build! this is SO weird. It didn't affect the plugin in any way, right? i don't get it, how slave nodes started to behave properly now =) but what's more interesting, those docker slave nodes were always stuck both in jeknins AND on my ECS instances (stuck as 'stopped' containers) like on this screenshot i just took from my CLI history (before this groovy run): and now those containers also disappear after build finishes o_O , some kind of voodoo magic it is... i was sure this plugin has a bug (it cluttered instances with containers that not "--rm" themselves). what it might be? i did no changes in my config anywhere .. just tried this script

George Shammas added a comment - 2016-05-25 16:45

dima_shmakov The script only affects Jenkins and makes no calls to ECS. ECS will clear out the stopped docker containers it starts after a period. I don't remember what the default is, but it is configurable. The dead containers sticking around shouldn't be a problem, but you can remove them manually if you want.

George Shammas added a comment - 2016-05-25 16:45 dima_shmakov The script only affects Jenkins and makes no calls to ECS. ECS will clear out the stopped docker containers it starts after a period. I don't remember what the default is, but it is configurable. The dead containers sticking around shouldn't be a problem, but you can remove them manually if you want.

Details

Description

Attachments

Attachments

Activity

Collapse comment: George Shammas added a comment - 2016-02-18 19:13

Expand comment: George Shammas added a comment - 2016-02-18 19:13

Collapse comment: Dima Shmakov added a comment - 2016-05-23 15:04, Edited by Dima Shmakov - 2016-05-23 15:11

Expand comment: Dima Shmakov added a comment - 2016-05-23 15:04, Edited by Dima Shmakov - 2016-05-23 15:11

Collapse comment: George Shammas added a comment - 2016-05-25 16:45

Expand comment: George Shammas added a comment - 2016-05-25 16:45

People

Dates