Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-33002

When local job queue fills, incorrectly spawn unlimited workers despite being impossible to schedule on those nodes.

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • amazon-ecs-plugin
    • None

      When the local job queue fills on the master, this plugin will start spinning up slaves. However, if those jobs have a restriction on where they can run, they will never be scheduled on those slaves.

      As a result with 4 threads on the master, and 6 jobs, this will spin up hundreds slaves. The ECS cluster on has capacity for about 30 of those, so they never actually spin up, creating a mess out of the node list for a while.

      Attached is an example of what this looks like. All the jobs scheduled have a restriction of running on the master, because they are building docker containers.

        1. Nodes__Jenkins_.png
          500 kB
          George Shammas
        2. screenshot-1.png
          81 kB
          Dima Shmakov

          [JENKINS-33002] When local job queue fills, incorrectly spawn unlimited workers despite being impossible to schedule on those nodes.

          For those that get hit with this and the nodes aren't expiring, here is a quick groovy script that will delete all the offline ECS nodes. It should not delete any other slaves under any circumstance.

          To run it, just go to the /script directory of your jenkins install.

          for (aSlave in hudson.model.Hudson.instance.slaves) {
            println('====================');
            println('Name: ' + aSlave.name);
            if (aSlave.getComputer().toString()  =~ /amazonecs.ECSComputer@/ &&
                aSlave.getComputer().isOffline().toString() == 'true') {
              println('Deleting node!!!!');
              aSlave.getComputer().setTemporarilyOffline(true,null);
              aSlave.getComputer().doDoDelete();
            } else {
              println ('Not an offline ECS node')
            }
          }
          

          George Shammas added a comment - For those that get hit with this and the nodes aren't expiring, here is a quick groovy script that will delete all the offline ECS nodes. It should not delete any other slaves under any circumstance. To run it, just go to the /script directory of your jenkins install. for (aSlave in hudson.model.Hudson.instance.slaves) { println( '====================' ); println( 'Name: ' + aSlave.name); if (aSlave.getComputer().toString() =~ /amazonecs.ECSComputer@/ && aSlave.getComputer().isOffline().toString() == ' true ' ) { println( 'Deleting node!!!!' ); aSlave.getComputer().setTemporarilyOffline( true , null ); aSlave.getComputer().doDoDelete(); } else { println ( 'Not an offline ECS node' ) } }

          Dima Shmakov added a comment - - edited

          Hello georgemb, thank you for this fix, it worked.
          I noticed also that while i had 200 docker slaves cluttering my jenkins, after running this only once (it deleted all) they started "expiring" now after each build! this is SO weird. It didn't affect the plugin in any way, right? i don't get it, how slave nodes started to behave properly now =)

          but what's more interesting, those docker slave nodes were always stuck both in jeknins AND on my ECS instances (stuck as 'stopped' containers) like on this screenshot i just took from my CLI history (before this groovy run):

          and now those containers also disappear after build finishes o_O , some kind of voodoo magic it is... i was sure this plugin has a bug (it cluttered instances with containers that not "--rm" themselves). what it might be? i did no changes in my config anywhere .. just tried this script

          Dima Shmakov added a comment - - edited Hello georgemb , thank you for this fix, it worked. I noticed also that while i had 200 docker slaves cluttering my jenkins, after running this only once (it deleted all) they started "expiring" now after each build! this is SO weird. It didn't affect the plugin in any way, right? i don't get it, how slave nodes started to behave properly now =) but what's more interesting, those docker slave nodes were always stuck both in jeknins AND on my ECS instances (stuck as 'stopped' containers) like on this screenshot i just took from my CLI history (before this groovy run): and now those containers also disappear after build finishes o_O , some kind of voodoo magic it is... i was sure this plugin has a bug (it cluttered instances with containers that not "--rm" themselves). what it might be? i did no changes in my config anywhere .. just tried this script

          dima_shmakov The script only affects Jenkins and makes no calls to ECS. ECS will clear out the stopped docker containers it starts after a period. I don't remember what the default is, but it is configurable. The dead containers sticking around shouldn't be a problem, but you can remove them manually if you want.

          George Shammas added a comment - dima_shmakov The script only affects Jenkins and makes no calls to ECS. ECS will clear out the stopped docker containers it starts after a period. I don't remember what the default is, but it is configurable. The dead containers sticking around shouldn't be a problem, but you can remove them manually if you want.

            ndeloof Nicolas De Loof
            georgemb George Shammas
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: