Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25832

Launch multiple slaves in parallel for jobs with same node label

    XMLWordPrintable

Details

    • New Feature
    • Status: Resolved (View Workflow)
    • Critical
    • Resolution: Fixed
    • ec2-plugin
    • None

    Description

      I use build flow plugin to kick off 200+ jobs. Each of these jobs has the same label. During this scenario, 200+ jobs are queued up. The ec2 plugin seems to process one job at time and this is a painfully slow process.

      It seems that the AWS JAVA SDK allows for launching multiple ec2 instances. Can the ec2 plugin detect the number of jobs with the same label in the queue, and then launch that many ec2 instances in parallel?

      http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/ec2/model/RunInstancesRequest.html#RunInstancesRequest%28java.lang.String,%20java.lang.Integer,%20java.lang.Integer%29

       private EC2AbstractSlave provisionOndemand(TaskListener listener) throws AmazonClientException, IOException {
      PrintStream logger = listener.getLogger();
      AmazonEC2 ec2 = getParent().connect();
      try {
      String msg = "Launching " + ami + " for template " + description;
      logger.println(msg);
      LOGGER.info(msg);
      KeyPair keyPair = getKeyPair(ec2);
      RunInstancesRequest riRequest = new RunInstancesRequest(ami, 1, 1);
      

      Attachments

        Activity

          arazauci arazauci created issue -
          lbeder Leonid Beder added a comment -

          Hi,
          I'm experiencing exactly the same issue, which prevents me from fully parallelizing my builds.

          Is there an expected fix or a work around to this issue?

          I've tried the above, but nothing worked:
          1. Adding another label restriction to the job and adding another AMI with this label - still, the every consequent job was queued.
          2. Adding another label restriction to the job and adding the label as another label to the same AMI.

          I'd really appreciate your help,
          Leonid

          lbeder Leonid Beder added a comment - Hi, I'm experiencing exactly the same issue, which prevents me from fully parallelizing my builds. Is there an expected fix or a work around to this issue? I've tried the above, but nothing worked: 1. Adding another label restriction to the job and adding another AMI with this label - still, the every consequent job was queued. 2. Adding another label restriction to the job and adding the label as another label to the same AMI. I'd really appreciate your help, Leonid
          joshma Joshua Ma added a comment -

          Agreed! Right now I'm running a few permanent slaves to help with this, but it'd be great if the EC2 plugin's response scaled with the # of pending builds. Otherwise, it seems like it'll wait a few minutes to determine a new machine is needed, wait a few minutes to determine a new machine is needed, and so on.

          joshma Joshua Ma added a comment - Agreed! Right now I'm running a few permanent slaves to help with this, but it'd be great if the EC2 plugin's response scaled with the # of pending builds. Otherwise, it seems like it'll wait a few minutes to determine a new machine is needed, wait a few minutes to determine a new machine is needed, and so on.
          krism82 Kris Massey added a comment -

          We're being hit by the same issue. Could this be made a configuration option? I can see why someone people may want to start nodes sequentially to fully ensure the node limit is not exceeded...however when you've got 10+ slaves, having to wait for each slave to start sequentially is extreamly slow

          krism82 Kris Massey added a comment - We're being hit by the same issue. Could this be made a configuration option? I can see why someone people may want to start nodes sequentially to fully ensure the node limit is not exceeded...however when you've got 10+ slaves, having to wait for each slave to start sequentially is extreamly slow
          vthakur Vikas Thakur added a comment -

          We were facing the same issue earlier and then we tuned our jam with these 2 settings

          -Dhudson.model.LoadStatistics.decay=0.1
          -Dhudson.model.LoadStatistics.clock=1000

          and then this issue was gone

          vthakur Vikas Thakur added a comment - We were facing the same issue earlier and then we tuned our jam with these 2 settings -Dhudson.model.LoadStatistics.decay=0.1 -Dhudson.model.LoadStatistics.clock=1000 and then this issue was gone

          Vikas,

          How did you go about changing the Dhudson.model.LoadStatistics.decay and Dhudson.model.LoadStatistics.clock parameters? Does this require a change in the source code, or is there an easier way to set it?

          maxdrib Maxwell Dribinsky added a comment - Vikas, How did you go about changing the Dhudson.model.LoadStatistics.decay and Dhudson.model.LoadStatistics.clock parameters? Does this require a change in the source code, or is there an easier way to set it?
          vthakur Vikas Thakur added a comment -

          These are JVM settings. This doesnt need code change

          vthakur Vikas Thakur added a comment - These are JVM settings. This doesnt need code change
          cheecheeo J C added a comment -

          To add my experience: modifying the LoadStatistics parameters did not help for bursty queues. Looking at the (quite complicated) LoadStatistics code, maybe these parameters help over a longer period of time. I ended up writing a shell script that uses Jenkins' REST api to look at the queue length, the number of offline agents, and then provisions nodes immediately.

          cheecheeo J C added a comment - To add my experience: modifying the LoadStatistics parameters did not help for bursty queues. Looking at the (quite complicated) LoadStatistics code, maybe these parameters help over a longer period of time. I ended up writing a shell script that uses Jenkins' REST api to look at the queue length, the number of offline agents, and then provisions nodes immediately.
          maxdrib Maxwell Dribinsky added a comment - - edited

          Thank you all. Changing the JVM parameters worked for me. For future reference, instructions on changing JVM's for Jenkins can be found here http://stackoverflow.com/questions/14762162/how-do-i-give-jenkins-more-heap-space-when-its-running-as-a-daemon-on-ubuntu. Now if there are multiple jobs in the queue, Jenkins will boot up a new instance, and when it boots, it will re-evaluate the queue. If more jobs remain, it will instantly boot up another instance.

          I would also like to include that the link below suggests NOT using LoadStatistics.clock values of less than 3000. I was able to launch new slaves within ~15 seconds using a clock value of 3000
          https://cloudbees.zendesk.com/hc/en-us/articles/204690520-Why-do-slaves-show-as-suspended-while-jobs-wait-in-the-queue-

          maxdrib Maxwell Dribinsky added a comment - - edited Thank you all. Changing the JVM parameters worked for me. For future reference, instructions on changing JVM's for Jenkins can be found here http://stackoverflow.com/questions/14762162/how-do-i-give-jenkins-more-heap-space-when-its-running-as-a-daemon-on-ubuntu . Now if there are multiple jobs in the queue, Jenkins will boot up a new instance, and when it boots, it will re-evaluate the queue. If more jobs remain, it will instantly boot up another instance. I would also like to include that the link below suggests NOT using LoadStatistics.clock values of less than 3000. I was able to launch new slaves within ~15 seconds using a clock value of 3000 https://cloudbees.zendesk.com/hc/en-us/articles/204690520-Why-do-slaves-show-as-suspended-while-jobs-wait-in-the-queue-
          vthakur Vikas Thakur made changes -
          Field Original Value New Value
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]

          Workaround doesn't launch slaves for each job in queue at once. It still launches one at a time, but slightly faster. The original issue persists.

          maxdrib Maxwell Dribinsky added a comment - Workaround doesn't launch slaves for each job in queue at once. It still launches one at a time, but slightly faster. The original issue persists.
          maxdrib Maxwell Dribinsky made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          rtyler R. Tyler Croy made changes -
          Workflow JNJira [ 159813 ] JNJira + In-Review [ 186249 ]
          vintrojan Vinay Sharma added a comment -

          vthakur Hi,
          Regarding the JVM settings :
          -Dhudson.model.LoadStatistics.decay=0.1
          -Dhudson.model.LoadStatistics.clock=3000

          Did u add these JVM settings in the jenkins.xml file or the EC2 plugin (textbox) in Jenkins ?

          vintrojan Vinay Sharma added a comment - vthakur Hi, Regarding the JVM settings : -Dhudson.model.LoadStatistics.decay=0.1 -Dhudson.model.LoadStatistics.clock=3000 Did u add these JVM settings in the jenkins.xml file or the EC2 plugin (textbox) in Jenkins ?
          timguy2 Tom Lachner made changes -
          Attachment jenkins_18minForNewSlave.png [ 37013 ]
          timguy2 Tom Lachner added a comment -

          Hi,

          made the suggested JVM changes (vintrojan: I did it here: /etc/init.d/jenkins - there is no GUI for this) but it didn't helped me.

          So like user "J C" I didn't had luck.

          Takes 18 minutes to get a new slave up and running.

          Grey is the queue and red are the build processors. So I would like to see the red line following the grey line very quickly .

          See attached picture

           

          AAny suggestions to try other values for LoadStatistics?

          timguy2 Tom Lachner added a comment - Hi, made the suggested JVM changes ( vintrojan : I did it here: /etc/init.d/jenkins - there is no GUI for this) but it didn't helped me. So like user "J C" I didn't had luck. Takes 18 minutes to get a new slave up and running. Grey is the queue and red are the build processors. So I would like to see the red line following the grey line very quickly . See attached picture   AAny suggestions to try other values for LoadStatistics?
          jimilian Alexander A added a comment -

          Original issue was fixed in https://github.com/jenkinsci/ec2-plugin/pull/217

          Now you are facing another issue that was introduced after 1.29 (I can make this assumption based on fact we are using 1.29 + fix from this PR and everything works ok even with default LoadStatistics parameters = 40+ new nodes can be raised in 5 minutes)

          jimilian Alexander A added a comment - Original issue was fixed in https://github.com/jenkinsci/ec2-plugin/pull/217 Now you are facing another issue that was introduced after 1.29 (I can make this assumption based on fact we are using 1.29 + fix from this PR and everything works ok even with default LoadStatistics parameters = 40+ new nodes can be raised in 5 minutes)

          Hi, we are facing the same issue. 6 jobs in parallel give us the latest job to be started in ~50 mins after the main build is run.

          pashastryhelski Pavel Stryhelski added a comment - Hi, we are facing the same issue. 6 jobs in parallel give us the latest job to be started in ~50 mins after the main build is run.

          Hi,

          Plugin is prepared to launch instances until the excess workload is satisfied, but it seems this behavior was broken at some point.

          I have filled a pull request with a fix:

          https://github.com/jenkinsci/ec2-plugin/pull/241

          It's working fine for me.

          luispiedra Luis Piedra-Márquez added a comment - Hi, Plugin is prepared to launch instances until the excess workload is satisfied, but it seems this behavior was broken at some point. I have filled a pull request with a fix: https://github.com/jenkinsci/ec2-plugin/pull/241 It's working fine for me.

          Hi, there is any news on it? 

          We have a critical testing process that consumes quite a lot of time and parallel slaves are mandatory for the process.

          gtunon Guiomar Tuñón added a comment - Hi, there is any news on it?  We have a critical testing process that consumes quite a lot of time and parallel slaves are mandatory for the process.
          jimilian Alexander A added a comment - gtunon , see https://github.com/jenkinsci/ec2-plugin/pull/252

          This has already been fixed in 1.40

          raihaan Raihaan Shouhell added a comment - This has already been fixed in 1.40
          raihaan Raihaan Shouhell made changes -
          Resolution Fixed [ 1 ]
          Status Reopened [ 4 ] Resolved [ 5 ]

          People

            francisu Francis Upton
            arazauci arazauci
            Votes:
            16 Vote for this issue
            Watchers:
            27 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: