Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47724

Pipeline with parallel jobs performance issue with EC2 slaves

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -JENKINS-45553-. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.

      Parallel Pipeline code:

      def stepsForParallel = [:]
      for (int i = 0; i < 150; i++) {
        def s = "subjob_${i}" 
        stepsForParallel[s] = {
          node("JENKINS-SLAVE-NODE-LABEL") {
            sh '''
            date +%c
            '''
          }
        }
      }
      timestamps {
      parallel stepsForParallel
      }
      

      1. Freestyle Matrix VS Parallel Pipeline

      Using 16 slave VMs:
      I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

      Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

      I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.

      2. The first parallel pipeline job on a slave is slower

      I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
      First build after slaves started: 4m 49s
      Subsequent runs: 2m 47s, 2m 44s, 2m 43s, 2m 32s
      After Jenkins slaves were stopped and then started without restarting Jenkins
      the parallel pipeline job took 6m 36s to complete.
      With extra debugging I did see this.

      pipeline performance after slave restart

      Slaves started:
      4m 49s
      timings for OwnerJOBNAME/34:JOBNAME #34:
      Unknown macro: {classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
      Build Reran:
      2m 44s
      timings for OwnerJOBNAME/35:JOBNAME #35:
      Unknown macro: {classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
      Slaves restarted:
      5m 4s
      timings for OwnerJOBNAME/36:JOBNAME #36:
      Unknown macro: {classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}

      I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.

      3. Parallel pipeline job appears to pause when starting on slaves

      This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.

      Duration Branches Slave VMs
      14m 100 100
      3m 100 16
      1m 13s 100 4
      29m 1000 16
      34m 1000 4

      I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

       

       

      log files during pause period

      In the debug log I see this repeat slowly

      Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

      -JENKINS-34021-: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:OwnerJOBNAME #83

      In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

      02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
      02:05:30 [subjob_0] + date +%c
      02:05:39 [subjob_6] + date +%c
      02:05:49 [subjob_5] + date +%c
      02:05:58 [subjob_1] + date +%c
      02:06:07 [subjob_2] + date +%c
      02:06:16 [subjob_4] + date +%c
      02:06:25 [subjob_3] + date +%c
      02:06:34 [subjob_10] + date +%c
      02:06:44 [subjob_7] + date +%c
      02:06:53 [subjob_8] + date +%c
      02:07:02 [subjob_15] + date +%c
      02:07:11 [subjob_12] + date +%c
      02:07:20 [subjob_9] + date +%c
      02:07:29 [subjob_11] + date +%c
      02:07:39 [subjob_13] + date +%c
      02:07:48 [subjob_14] + date +%c
      ..

       

       

      Notes:

      I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors and couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

      Tickets that maybe related:
      -JENKINS-45553-
      JENKINS-33761
      JENKINS-39489
      JENKINS-47170

       

      4. High CPU on Jenkins Masters

      I'm seeing higher CPU usage on our Jenkins masters that are running parallel pipeline code compared to freestyle.  In addition, the more parallel branches that are ran the higher the load.  I tried to quantify the differences in the table below using a freestyle matrix job and parallel pipeline job that performs close to the same code on 2 CPU / 8 GB RAM sized Jenkins Master instances.

       

      Type number of
      branches
      number of
      slaves agents
      peak 1m
      load avg
      peak 5m
      load avg
      15m load avg
      when job finished
      Freestyle 259 55 2.0 .53 .08
      Pipeline 259 60 3.19 1.7 .85
      Pipeline 259 120 4.38 1.7 .89
      Pipeline 259 190 4.34 2.03 1.06
      Pipeline 355 190 6.26 2.12 .98

      In summary, when using parallel pipeline I'm seeing higher load average on the Jenkins masters as the number of branches and slaves agents increases even though all of the testing occurs on Jenkins slave agents.  Matrix freestyle jobs compared to parallel pipeline seem to have a fraction of the Jenkins master load after the initial peak.

      UPDATE: Running two parallel pipeline jobs with a total of 110 slave agents pushed the Jenkins master instance to a peak 1m load avg of 10.77 and peak 5m load avg to 7.18.

       

        Attachments

          Activity

          mkozell Mike Kozell created issue -
          mkozell Mike Kozell made changes -
          Field Original Value New Value
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to ---JENKINS-45553---. We are also using the EC2 plugin to start and stop slaves after idle timeout. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer then subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows some CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          -JENKINS-34021-: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83|#83]
          {panel}
           
          {panel}
          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           ---JENKINS-45553---
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to ----JENKINS-45553----. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer then subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows some CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          --JENKINS-34021--: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83|#83]
          {panel}
           
          {panel}
          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           ----JENKINS-45553----
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          mkozell Mike Kozell made changes -
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to ----JENKINS-45553----. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer then subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows some CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          --JENKINS-34021--: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83|#83]
          {panel}
           
          {panel}
          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           ----JENKINS-45553----
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -----JENKINS-45553-----. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows some CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          ---JENKINS-34021---: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83|#83]
          {panel}
           
          {panel}
          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           -----JENKINS-45553-----
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          mkozell Mike Kozell made changes -
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -----JENKINS-45553-----. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows some CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          ---JENKINS-34021---: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83|#83]
          {panel}
           
          {panel}
          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           -----JENKINS-45553-----
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to ------JENKINS-45553------. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          ----JENKINS-34021----: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83|#83]
          {panel}
           
          {panel}
          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           ------JENKINS-45553------
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          mkozell Mike Kozell made changes -
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to ------JENKINS-45553------. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          ----JENKINS-34021----: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83|#83]
          {panel}
           
          {panel}
          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           ------JENKINS-45553------
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -------JENKINS-45553-------. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}


          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          -----JENKINS-34021-----: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]



          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

           02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           -------JENKINS-45553-------
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          mkozell Mike Kozell made changes -
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -------JENKINS-45553-------. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}


          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          -----JENKINS-34021-----: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]



          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

           02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           -------JENKINS-45553-------
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to JENKINS-45553. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}


          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]



          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

           02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           JENKINS-45553
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          mkozell Mike Kozell made changes -
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to JENKINS-45553. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}


          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]



          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

           02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors, I couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           JENKINS-45553
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to JENKINS-45553. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}


          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]



          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

           02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors and couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           JENKINS-45553
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          Hide
          svanoort Sam Van Oort added a comment -

          Jesse Glick Looks like a bottleneck in the Durable Task execution to me.   Something with the async model and periodic checks perhaps not being set up to efficiently execute tasks that complete immediately?

          Probably easy to reproduce with the Scalability lab environment (though I'll need to give the executors longer-running work probably).

          Show
          svanoort Sam Van Oort added a comment - Jesse Glick  Looks like a bottleneck in the Durable Task execution to me.   Something with the async model and periodic checks perhaps not being set up to efficiently execute tasks that complete immediately? Probably easy to reproduce with the Scalability lab environment (though I'll need to give the executors longer-running work probably).
          mkozell Mike Kozell made changes -
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to JENKINS-45553. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}


          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]



          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

           02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors and couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           JENKINS-45553
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -JENKINS-45553-. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          -JENKINS-34021-: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]

          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors and couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           -JENKINS-45553-
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170

           
          h2. 4. High CPU on Jenkins Masters

          I'm seeing higher CPU usage on our Jenkins masters that are running parallel pipeline code compared to freestyle.  In addition, the more parallel branches that are ran the higher the load.  I tried to quantify the differences in the table below using a freestyle matrix job and parallel pipeline job that performs close to the same code on 2 CPU / 8 GB RAM sized Jenkins Master instances.

           
          ||Type||number of
          branches||number of
          slaves agents||peak 1m
          load avg||peak 5m
          load avg||15m load avg
          when job finished||
          |Freestyle|259|55|2.0|.53|.08|
          |Pipeline|259|60|3.19|1.7|.85|
          |Pipeline|259|120|4.38|1.7|.89|
          |Pipeline|259|190|4.34|2.03|1.06|
          |Pipeline|355|190|6.26|2.12|.98|

          In summary, when using parallel pipeline I'm seeing higher load average on the Jenkins masters as the number of branches and slaves agents increases even though all of the testing occurs on Jenkins slave agents.  Matrix freestyle jobs compared to parallel pipeline seem to have a fraction of the Jenkins master load after the initial peak.

           
          mkozell Mike Kozell made changes -
          Description We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -JENKINS-45553-. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          -JENKINS-34021-: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]

          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors and couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           -JENKINS-45553-
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170

           
          h2. 4. High CPU on Jenkins Masters

          I'm seeing higher CPU usage on our Jenkins masters that are running parallel pipeline code compared to freestyle.  In addition, the more parallel branches that are ran the higher the load.  I tried to quantify the differences in the table below using a freestyle matrix job and parallel pipeline job that performs close to the same code on 2 CPU / 8 GB RAM sized Jenkins Master instances.

           
          ||Type||number of
          branches||number of
          slaves agents||peak 1m
          load avg||peak 5m
          load avg||15m load avg
          when job finished||
          |Freestyle|259|55|2.0|.53|.08|
          |Pipeline|259|60|3.19|1.7|.85|
          |Pipeline|259|120|4.38|1.7|.89|
          |Pipeline|259|190|4.34|2.03|1.06|
          |Pipeline|355|190|6.26|2.12|.98|

          In summary, when using parallel pipeline I'm seeing higher load average on the Jenkins masters as the number of branches and slaves agents increases even though all of the testing occurs on Jenkins slave agents.  Matrix freestyle jobs compared to parallel pipeline seem to have a fraction of the Jenkins master load after the initial peak.

           
          We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to --JENKINS-45553--. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.
          h2. Parallel Pipeline code:
          {code:java}
          def stepsForParallel = [:]
          for (int i = 0; i < 150; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node("JENKINS-SLAVE-NODE-LABEL") {
                sh '''
                date +%c
                '''
              }
            }
          }
          timestamps {
          parallel stepsForParallel
          }
          {code}
          h2. 1. Freestyle Matrix VS Parallel Pipeline

          Using 16 slave VMs:
           I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

          Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

          I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.
          h2. 2. The first parallel pipeline job on a slave is slower

          I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
           *First build after slaves started:* 4m 49s
           *Subsequent runs:* 2m 47s, 2m 44s, 2m 43s, 2m 32s
           After Jenkins slaves were stopped and then started without restarting Jenkins
           the parallel pipeline job took 6m 36s to complete.
           With extra debugging I did see this.
          {panel:title=pipeline performance after slave restart}
          _Slaves started:_
           4m 49s
           timings for Owner[JOBNAME/34:JOBNAME #34|#34]:
           Unknown macro: \{classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
           _Build Reran:_
           2m 44s
           timings for Owner[JOBNAME/35:JOBNAME #35|#35]:
           Unknown macro: \{classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
           _Slaves restarted:_
           5m 4s
           timings for Owner[JOBNAME/36:JOBNAME #36|#36]:
           Unknown macro: \{classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}
          {panel}
          I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.
          h2. 3. Parallel pipeline job appears to pause when starting on slaves

          This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.
          ||Duration||Branches||Slave VMs||
          |14m|100|100|
          |3m|100|16|
          |1m 13s|100|4|
          |29m|1000|16|
          |34m|1000|4|

          I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

           

           
          {panel:title=log files during pause period}
          In the debug log I see this repeat slowly

          Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

          --JENKINS-34021--: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:Owner[JOBNAME #83 |#83]

          In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

          02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
           02:05:30 [subjob_0] + date +%c
           02:05:39 [subjob_6] + date +%c
           02:05:49 [subjob_5] + date +%c
           02:05:58 [subjob_1] + date +%c
           02:06:07 [subjob_2] + date +%c
           02:06:16 [subjob_4] + date +%c
           02:06:25 [subjob_3] + date +%c
           02:06:34 [subjob_10] + date +%c
           02:06:44 [subjob_7] + date +%c
           02:06:53 [subjob_8] + date +%c
           02:07:02 [subjob_15] + date +%c
           02:07:11 [subjob_12] + date +%c
           02:07:20 [subjob_9] + date +%c
           02:07:29 [subjob_11] + date +%c
           02:07:39 [subjob_13] + date +%c
           02:07:48 [subjob_14] + date +%c
           ..
          {panel}
           

           
          h2. Notes:

          I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors and couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

          Tickets that maybe related:
           --JENKINS-45553--
           JENKINS-33761
           JENKINS-39489
           JENKINS-47170

           
          h2. 4. High CPU on Jenkins Masters

          I'm seeing higher CPU usage on our Jenkins masters that are running parallel pipeline code compared to freestyle.  In addition, the more parallel branches that are ran the higher the load.  I tried to quantify the differences in the table below using a freestyle matrix job and parallel pipeline job that performs close to the same code on 2 CPU / 8 GB RAM sized Jenkins Master instances.

           
          ||Type||number of
           branches||number of
           slaves agents||peak 1m
           load avg||peak 5m
           load avg||15m load avg
           when job finished||
          |Freestyle|259|55|2.0|.53|.08|
          |Pipeline|259|60|3.19|1.7|.85|
          |Pipeline|259|120|4.38|1.7|.89|
          |Pipeline|259|190|4.34|2.03|1.06|
          |Pipeline|355|190|6.26|2.12|.98|

          In summary, when using parallel pipeline I'm seeing higher load average on the Jenkins masters as the number of branches and slaves agents increases even though all of the testing occurs on Jenkins slave agents.  Matrix freestyle jobs compared to parallel pipeline seem to have a fraction of the Jenkins master load after the initial peak.

          UPDATE: Running two parallel pipeline jobs with a total of 110 slave agents pushed the Jenkins master instance to a peak 1m load avg of 10.77 and peak 5m load avg to 7.18.

           
          Hide
          svanoort Sam Van Oort added a comment -

          Mike Kozell  I will attempt to reproduce this (using an internal framework I've put together for scaling tests), but my guess is a couple things are involved:

          1. Pipeline is optimized for cases where users have a set of Shell/Batch steps that do long-running work, and a limited set of parallel branches.  It also provides some recoverability guarantees that FreeStyle builds don't, i.e. you should be able to restart the master much of the time without the build failing – but that does come with some costs.
          2. Parallel in pipeline isn't "really" parallel for everything because it is internally using a GreenThreads system – only async steps can operate fully in parallel (generally things like shell/batch steps).  The assumption here is that the majority of the time-consuming work should be occurring in those steps – this assumption is due to the need to be able to persist state in a predictable way for resume.
          3. There are a limited number of threads allocated for interacting with build agents in pipeline, and with very large numbers of active agents this may be a bottleneck (we've seen past issues also if agents disconnect in specific ways,  being addressed by https://issues.jenkins-ci.org/browse/JENKINS-44785).   
          4. There are some a set of updates coming in the next few months that should reduce Pipeline overheads (some smaller ones have already landed).   Much of this is captured in the https://issues.jenkins-ci.org/browse/JENKINS-47170 epic I'm working on.  This will offer options to realize large improvements in I/O performance and should cut a lot of the bottlenecks to make pipeline more competitive with FreeStyle builds even in atypical cases such as this. 
          5. We know that Pipeline isn't well-optimized for very large branch counts – it's been improved a lot in recent releases, but clearly there's some ongoing work to be done here (investigation pending here). 

          For the moment, some concrete advice:

          • Your best performance will probably be realized by using a limited number of branches/agents running larger chunks of work.  This minimizes the impact of pipeline overheads. Using the Parallel Test Executor plugin will make this more efficient – https://wiki.jenkins.io/display/JENKINS/Parallel+Test+Executor+Plugin 
          • Run a few LONG shell steps, not MANY short ones – i.e. if you're running shell steps that complete in <15s the overheads may become significant.
          Show
          svanoort Sam Van Oort added a comment - Mike Kozell   I will attempt to reproduce this (using an internal framework I've put together for scaling tests), but my guess is a couple things are involved: Pipeline is optimized for cases where users have a set of Shell/Batch steps that do long-running work, and a limited set of parallel branches.  It also provides some recoverability guarantees that FreeStyle builds don't, i.e. you should be able to restart the master much of the time without the build failing – but that does come with some costs. Parallel in pipeline isn't "really" parallel for everything because it is internally using a GreenThreads system – only async steps can operate fully in parallel (generally things like shell/batch steps).  The assumption here is that the majority of the time-consuming work should be occurring in those steps – this assumption is due to the need to be able to persist state in a predictable way for resume. There are a limited number of threads allocated for interacting with build agents in pipeline, and with very large numbers of active agents this may be a bottleneck (we've seen past issues also if agents disconnect in specific ways,  being addressed by https://issues.jenkins-ci.org/browse/JENKINS-44785 ).    There are some a set of updates coming in the next few months that should reduce Pipeline overheads (some smaller ones have already landed).   Much of this is captured in the https://issues.jenkins-ci.org/browse/JENKINS-47170  epic I'm working on.  This will offer options to realize large improvements in I/O performance and should cut a lot of the bottlenecks to make pipeline more competitive with FreeStyle builds even in atypical cases such as this.  We know that Pipeline isn't well-optimized for very large branch counts – it's been improved a lot in recent releases, but clearly there's some ongoing work to be done here (investigation pending here).  For the moment, some concrete advice: Your best performance will probably be realized by using a limited number of branches/agents running larger chunks of work.  This minimizes the impact of pipeline overheads. Using the Parallel Test Executor plugin will make this more efficient –  https://wiki.jenkins.io/display/JENKINS/Parallel+Test+Executor+Plugin   Run a few LONG shell steps, not MANY short ones – i.e. if you're running shell steps that complete in <15s the overheads may become significant.
          Hide
          jimilian Alexander A added a comment -

          Five small comments:

          • because of green-thread nature of Pipeline jobs every FullGC pause comes with HUGE cost. Sam Van Oort even wrote article how to mitigate this problem a little bit: https://www.cloudbees.com/blog/joining-big-leagues-tuning-jenkins-gc-responsiveness-and-stability. For sure you need to measure how much time JVM spends in GC pauses. Based on my experience - this is the real top-1 bottleneck for high loaded Pipeline Jenkins
          • if you are using 2.86 you probably use ec2 with java ssh instead of native - it should impact performance as well
          • https://issues.jenkins-ci.org/browse/JENKINS-48348 - should help you as well. because more nodes you have - more `md5`s you need to calculate.
          • Sandbox comes with a lock, so, if you do not really need sandbox - just disable it. So, you will not check every line of your script against some known trusted signatures.
          • Disable UI for this kind of tests it will help to receive more reproducible results.

          In our case in pipeline job the memory footprint comes from compilation (10%), class loading (25%), actual execution (5%) and toString to show step we are executing right now (25%).

          Show
          jimilian Alexander A added a comment - Five small comments: because of green-thread nature of Pipeline jobs every FullGC pause comes with HUGE cost. Sam Van Oort even wrote article how to mitigate this problem a little bit: https://www.cloudbees.com/blog/joining-big-leagues-tuning-jenkins-gc-responsiveness-and-stability . For sure you need to measure how much time JVM spends in GC pauses. Based on my experience - this is the real top-1 bottleneck for high loaded Pipeline Jenkins if you are using 2.86 you probably use ec2 with java ssh instead of native - it should impact performance as well https://issues.jenkins-ci.org/browse/JENKINS-48348 - should help you as well. because more nodes you have - more `md5`s you need to calculate. Sandbox comes with a lock, so, if you do not really need sandbox - just disable it. So, you will not check every line of your script against some known trusted signatures. Disable UI for this kind of tests it will help to receive more reproducible results. In our case in pipeline job the memory footprint comes from compilation (10%), class loading (25%), actual execution (5%) and toString to show step we are executing right now (25%).
          Hide
          jglick Jesse Glick added a comment -

          Looks like a bottleneck in the Durable Task execution to me.   Something with the async model and periodic checks

          Note that the communications in this model is inverted in JENKINS-38381, for better or worse.

          Show
          jglick Jesse Glick added a comment - Looks like a bottleneck in the Durable Task execution to me.   Something with the async model and periodic checks Note that the communications in this model is inverted in JENKINS-38381 , for better or worse.
          svanoort Sam Van Oort made changes -
          Priority Critical [ 2 ] Major [ 3 ]
          Hide
          svanoort Sam Van Oort added a comment -

          Downgrading the severity a bit even though I badly want to tackle this – because this seems to only apply to the rarer case with a job using truly massive numbers of agents.  Also because we're about to release durability work (which should help), and have a couple other things coming up shortly that should mitigate this as well.

          Show
          svanoort Sam Van Oort added a comment - Downgrading the severity a bit even though I badly want to tackle this – because this seems to only apply to the rarer case with a job using truly massive numbers of agents.  Also because we're about to release durability work (which should help), and have a couple other things coming up shortly that should mitigate this as well.
          Hide
          svanoort Sam Van Oort added a comment -

          I think we may be able to consider this one resolved after the latest round of plugin updates. 

          Please apply the latest updates and let me know if you still see the issue Mike Kozell Florian Meser Alexander A

          • First, please grab the absolute LATEST versions of the pipeline plugins and structs plugin (will need LTS 2.73.x+) and take a look at the new Project Cheetah features for scaling – docs here: https://jenkins.io/doc/book/pipeline/scaling-pipeline/  – the performance-optimized durability mode can improve performance significantly by removing gratuitous persistence calls. 
          • The Structs Plugin updates in v1.13 are verified to significantly reduce classloading and per-step overheads.  I don't have an end-to-end before-and-after benchmark yet (pending), but smaller benchmarks show this greatly reduces overheads associated with steps and thus parallels.
          • The Durable Task plugins got a fairly recent update – it may be helping with this
          • Alexander A the performance bottleneck with sandbox (Script Security) will be addressed shortly by https://github.com/jenkinsci/script-security-plugin/pull/180 (in progress) – it's especially an issue with many concurrent Pipelines.

          I've added a similar testcase to my internal Pipeline end-to-end benchmarking tool: 127 parallel branches, with each obtaining a node and workspace then running 3 trivial shell steps.  With I/O throttling to ~200 IOPS, using 1 local Swarm build agent and 4-16 executors, it'll complete in ~40-49 seconds (depending on Durability Setting somewhat) using the latest core and plugins, or ~16/30 seconds with 128 executors in performance-optimized/maximum durability mode.

          With 256 branches, 128 executors, and maximum durability mode, I show a runtime of 1 minute, 11 sec – again, not quite linear scaling, because there's some overhead in Pipeline and when managing so many parallel connections/agents.  I think that's pretty okay though, considering this is all running in a Dockerized local environment with only 4 physical CPU cores to use.

           

          Show
          svanoort Sam Van Oort added a comment - I think we may be able to consider this one resolved after the latest round of plugin updates.  Please apply the latest updates and let me know if you still see the issue  Mike Kozell Florian Meser Alexander A :  First, please grab the absolute LATEST versions of the pipeline plugins and structs plugin (will need LTS 2.73.x+) and take a look at the new Project Cheetah features for scaling – docs here: https://jenkins.io/doc/book/pipeline/scaling-pipeline/   – the performance-optimized durability mode can improve performance significantly by removing gratuitous persistence calls.  The Structs Plugin updates in v1.13 are verified to significantly reduce classloading and per-step overheads.  I don't have an end-to-end before-and-after benchmark yet (pending), but smaller benchmarks show this greatly reduces overheads associated with steps and thus parallels. The Durable Task plugins got a fairly recent update – it may be helping with this Alexander A the performance bottleneck with sandbox (Script Security) will be addressed shortly by https://github.com/jenkinsci/script-security-plugin/pull/180  (in progress) – it's especially an issue with many concurrent Pipelines. I've added a similar testcase to my internal Pipeline end-to-end benchmarking tool: 127 parallel branches, with each obtaining a node and workspace then running 3 trivial shell steps.  With I/O throttling to ~200 IOPS, using 1 local Swarm build agent and 4-16 executors, it'll complete in ~40-49 seconds (depending on Durability Setting somewhat) using the latest core and plugins, or ~16/30 seconds with 128 executors in performance-optimized/maximum durability mode. With 256 branches, 128 executors, and maximum durability mode, I show a runtime of 1 minute, 11 sec – again, not quite linear scaling, because there's some overhead in Pipeline and when managing so many parallel connections/agents.  I think that's pretty okay though, considering this is all running in a Dockerized local environment with only 4 physical CPU cores to use.  
          Hide
          mkozell Mike Kozell added a comment -

          Sam Van Oort,

          I apologize for taking so long to get back to you.  I upgraded Jenkins and all of the plugins and retested the items in this ticket.

          Jenkins 2.89.4
          Pipeline 2.5
          Pipeline API 2.26
          Pipeline Nodes and Processes 2.19
          Pipeline Step API 2.14
          Scripts Security 1.41
          durabilityHint=PERFORMANCE_OPTIMIZED
          org.jenkinsci.plugins.workflow.job.properties.DisableResumeJobProperty
          Groovy Sandbox = disabled
          Java = 1.8.0_162

          3. Parallel pipeline job appears to pause when starting agents

          This issue still occurs. I started 100 agents and used the sample pipeline code in the description of this ticket (except I used 100 branches instead of 150) and the build took 14 minutes to complete. This is consistent with what I saw last year.

          2. The first parallel pipeline job on an agent is slower

          This also still occurs on the newest version of the plugins. As I stated above it took 14 minutes for the pipeline job to complete after starting 100 Jenkins agents. I then ran the exact same build immediately afterwards and it took 26 seconds. I'm unsure why there is a 14 minute pause on the first build ran on our Jenkins agents after they are started. This doesn't occur with freestyle and the sample code just does "sh date +%c".

          4. High CPU on Jenkins Masters

          This issue appears to have been resolved. In previous tests, our Jenkins masters reached a load average of 10 when they were busy. After performing the upgrade and setting durability to performance optimized, I see the load average hover around 1.

          On a different Jenkins installation that uses a lot of master executors but no agents, we also saw the load average hit 10 on Jenkins 2.60.3. That installation was recently upgraded and the load average there also remains under 1 most of the time. Java memory usage may have also decreased.

          Our build durations remain about the same between Jenkins versions with 2.89.4 maybe being slightly faster. With the pause that occurs on the first build on the agents, throwing more agents at our builds doesn't yield faster build times. Please let me know if you have any suggestions.

          Show
          mkozell Mike Kozell added a comment - Sam Van Oort , I apologize for taking so long to get back to you.  I upgraded Jenkins and all of the plugins and retested the items in this ticket. Jenkins 2.89.4 Pipeline 2.5 Pipeline API 2.26 Pipeline Nodes and Processes 2.19 Pipeline Step API 2.14 Scripts Security 1.41 durabilityHint=PERFORMANCE_OPTIMIZED org.jenkinsci.plugins.workflow.job.properties.DisableResumeJobProperty Groovy Sandbox = disabled Java = 1.8.0_162 3. Parallel pipeline job appears to pause when starting agents This issue still occurs. I started 100 agents and used the sample pipeline code in the description of this ticket (except I used 100 branches instead of 150) and the build took 14 minutes to complete. This is consistent with what I saw last year. 2. The first parallel pipeline job on an agent is slower This also still occurs on the newest version of the plugins. As I stated above it took 14 minutes for the pipeline job to complete after starting 100 Jenkins agents. I then ran the exact same build immediately afterwards and it took 26 seconds. I'm unsure why there is a 14 minute pause on the first build ran on our Jenkins agents after they are started. This doesn't occur with freestyle and the sample code just does "sh date +%c". 4. High CPU on Jenkins Masters This issue appears to have been resolved. In previous tests, our Jenkins masters reached a load average of 10 when they were busy. After performing the upgrade and setting durability to performance optimized, I see the load average hover around 1. On a different Jenkins installation that uses a lot of master executors but no agents, we also saw the load average hit 10 on Jenkins 2.60.3. That installation was recently upgraded and the load average there also remains under 1 most of the time. Java memory usage may have also decreased. Our build durations remain about the same between Jenkins versions with 2.89.4 maybe being slightly faster. With the pause that occurs on the first build on the agents, throwing more agents at our builds doesn't yield faster build times. Please let me know if you have any suggestions.
          Hide
          wreggyl Jan Bottek added a comment -

          Hi,

          we are running in the same limitation that paralle is slowing down with the number of parallel steps we have. we tried all the solutions from above, but none of these helped - as Mike Kozell also mentioned.

          The recommended solution to use  Parallel Test Executor Plugin is not working for us, because for that i have to know how many executors are available during the creation of the parallel testGroups. But normally i would provide a NODE label and if any new node is available it will grab some work from the queue - this will not longer working with Parallel Test Executor Plugin - or i am wrong?

          thx in advance.

           

          Show
          wreggyl Jan Bottek added a comment - Hi, we are running in the same limitation that paralle is slowing down with the number of parallel steps we have. we tried all the solutions from above, but none of these helped - as Mike Kozell also mentioned. The recommended solution to use  Parallel Test Executor Plugin is not working for us, because for that i have to know how many executors are available during the creation of the parallel testGroups. But normally i would provide a NODE label and if any new node is available it will grab some work from the queue - this will not longer working with Parallel Test Executor Plugin - or i am wrong? thx in advance.  
          vivek Vivek Pandey made changes -
          Issue Type Bug [ 1 ] Improvement [ 4 ]
          Hide
          jglick Jesse Glick added a comment -

          I would check whether -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true (JENKINS-52165) has any effect in this context. But in general I do not think there is anyone actively tracking down performance issues like this.

          Show
          jglick Jesse Glick added a comment - I would check whether -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true ( JENKINS-52165 ) has any effect in this context. But in general I do not think there is anyone actively tracking down performance issues like this.
          Hide
          jknurek J Knurek added a comment -

          We're experiencing this issue of an increase in parallel agents having degrading performance.

          I've modified the above example to fit our kubernetes use case, and show the difference in time taken to run a step, vs the time reported by jenkins

          def stepsForParallel = [:]
          def parallelCount = 100
          for (int i = 0; i < parallelCount; i++) {
            def s = "subjob_${i}"
            stepsForParallel[s] = {
              node('worker') {
                sh '''#!/bin/bash
                  time date +%c
                '''
              }
            }
          }
          parallel stepsForParallel
          

          As we increase the parallelCount from 10 to 30 to 100,
          the total time to execute the build takes increases from 20s to 50s to 2m40s

          • these times are approximate averages, as the job is run multiple times to make sure that the env is up and running and warmed up.
          • the `node('worker')` maps to an agent/pod template which maps 1:1 to k8s nodes (so the agents themselves don't have performance impact on each other)
          • I've replaced the `sh` to `echo 'hello'` which speeds up execution, but the same performance degradation is visible.
          • the real problems arise when running a full workload in the parallel steps with multiple commands and long running tests, to the point that running concurrent builds is not possible (in the example pipeline, concurrent builds do impact time, but not so noticeably) 

           

          The versions of plugins mentioned in this ticket:

          Jenkins 2.263.3
          Durable Task Plugin 1.35 
          Structs Plugin 1.21 

          Also, using `-Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true` had no benefit to performance

          Show
          jknurek J Knurek added a comment - We're experiencing this issue of an increase in parallel agents having degrading performance. I've modified the above example to fit our kubernetes use case, and show the difference in time taken to run a step, vs the time reported by jenkins def stepsForParallel = [:] def parallelCount = 100 for ( int i = 0; i < parallelCount; i++) { def s = "subjob_${i}" stepsForParallel[s] = { node( 'worker' ) { sh '''#!/bin/bash time date +%c ''' } } } parallel stepsForParallel As we increase the parallelCount from 10 to 30 to 100, the total time to execute the build takes increases from 20s to 50s to 2m40s these times are approximate averages, as the job is run multiple times to make sure that the env is up and running and warmed up. the ` node('worker') ` maps to an agent/pod template which maps 1:1 to k8s nodes (so the agents themselves don't have performance impact on each other) I've replaced the ` sh ` to ` echo 'hello' ` which speeds up execution, but the same performance degradation is visible. the real problems arise when running a full workload in the parallel steps with multiple commands and long running tests, to the point that running concurrent builds is not possible (in the example pipeline, concurrent builds do impact time, but not so noticeably)     The versions of plugins mentioned in this ticket: Jenkins 2.263.3 Durable Task Plugin 1.35 Structs Plugin 1.21 Also, using ` -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true ` had no benefit to performance
          Hide
          xieqigao xie added a comment - - edited

          Manage Jenkins–>Configure System-->Pipeline Speed/Durability Settings --> Performance-optimized: much faster (requires clean shutdown to save running pipelines)

          restart jenkins

          then try again, you can see the log show:

          Running in Durability level: PERFORMANCE_OPTIMIZED

          now , the job is much faster

           

          Jenkins 2.263.1

           

          Show
          xieqigao xie added a comment - - edited Manage Jenkins–>Configure System-->Pipeline Speed/Durability Settings --> Performance-optimized: much faster (requires clean shutdown to save running pipelines) restart jenkins then try again, you can see the log show: Running in Durability level: PERFORMANCE_OPTIMIZED now , the job is much faster   Jenkins 2.263.1  

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            mkozell Mike Kozell
            Votes:
            7 Vote for this issue
            Watchers:
            18 Start watching this issue

              Dates

              Created:
              Updated: