• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical

      Execution of parallel blocks scales poorly for values of N > 100.  With ~50 nodes (each with 4 executors, for a total of ~200 slots), the following pipeline job takes extraordinarily long to execute:

       

      def stepsForParallel = [:]
      for (int i = 0; i < Integer.valueOf(params.SUB_JOBS); i++) {
        def s = "subjob_${i}" 
        stepsForParallel[s] = {
          node("darwin") {
            echo "hello"
          }
        }
      }
      parallel stepsForParallel
      

       

      SUB_JOBS   Time (sec)
      ---------------------
       100         10
       200         40
       300         96
       400        214
       500        392
       600        660
       700        960
       800       1500
       900       2220
      1000       gave up...

      At no point does the underlying system become taxed (CPU utilization is very low, as this is a very beefy system – 28 cores, 128GB RAM, SSDs)

      CPU and Thread CPU Time Sampling (via VisualVM) are attached for reference.

       

       

       

       

       

          [JENKINS-45553] Parallel pipeline execution scales poorly

          Sam Van Oort added a comment -

          manschwetus / florian_meser We appreciate your patience - releases are now cut and this should reflect a fairly comprehensive improvement for your case plus several related one.  It requires Pipeline API Plugin v2.22 and Pipeline Supporting APIs version 2.15 to be installed to get the full benefits. 

          Please give them a try and let us know how they work out for you – based on our testing you should see a tremendous performance improvement from these changes!

          Sam Van Oort added a comment - manschwetus / florian_meser We appreciate your patience - releases are now cut and this should reflect a fairly comprehensive improvement for your case plus several related one.  It requires Pipeline API Plugin v2.22 and Pipeline Supporting APIs version 2.15 to be installed to get the full benefits.  Please give them a try and let us know how they work out for you – based on our testing you should see a tremendous performance improvement from these changes!

          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 20 not equal to parallelBranchEndNodes: 19
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 23 not equal to parallelBranchEndNodes: 22
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 20 not equal to parallelBranchEndNodes: 19
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 23 not equal to parallelBranchEndNodes: 22
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0
          Sep 27, 2017 8:09:24 PM io.jenkins.blueocean.rest.impl.pipeline.PipelineNodeGraphVisitor parallelStart
          SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15

          Jenkins is spamming the logs with this. Jenkins 2.73.1. Pipeline API Plugin v2.22 and Pipeline Supporting APIs version 2.15. How do I fix this?

          Puneeth Nanjundaswamy added a comment - Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 20 not equal to parallelBranchEndNodes: 19 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 23 not equal to parallelBranchEndNodes: 22 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 20 not equal to parallelBranchEndNodes: 19 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 23 not equal to parallelBranchEndNodes: 22 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 1 not equal to parallelBranchEndNodes: 0 Sep 27, 2017 8:09:24 PM io.jenkins.blueocean. rest .impl.pipeline.PipelineNodeGraphVisitor parallelStart SEVERE: nestedBranches size: 16 not equal to parallelBranchEndNodes: 15 Jenkins is spamming the logs with this. Jenkins 2.73.1. Pipeline API Plugin v2.22 and Pipeline Supporting APIs version 2.15. How do I fix this?

          Sam Van Oort added a comment -

          puneeth_n You'll need to open a new bug and provide the pipeline that triggers this error.  It could be an issue in one of two places and there's nothing in that log to permit me to reproduce this.

          Sam Van Oort added a comment - puneeth_n You'll need to open a new bug and provide the pipeline that triggers this error.  It could be an issue in one of two places and there's nothing in that log to permit me to reproduce this.

          Code changed in jenkins
          User: Sam Van Oort
          Path:
          pom.xml
          src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java
          src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java
          src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java
          src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java
          src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java
          http://jenkins-ci.org/commit/workflow-cps-plugin/88ffdfc69c43bd4dde21a6578b5ac466999b4fd4
          Log:
          Revert "JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph"

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: pom.xml src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java http://jenkins-ci.org/commit/workflow-cps-plugin/88ffdfc69c43bd4dde21a6578b5ac466999b4fd4 Log: Revert " JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph"

          Code changed in jenkins
          User: Sam Van Oort
          Path:
          pom.xml
          src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java
          src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java
          src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java
          src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java
          src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java
          src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java
          http://jenkins-ci.org/commit/workflow-cps-plugin/c0daeb5ce9ba55e6f51cb6c8db903cc5fbba324b
          Log:
          Merge pull request #52 from jenkinsci/revert-50-jenkins-27395-block-structure-lookup

          Revert "JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph"

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: pom.xml src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecution.java src/main/java/org/jenkinsci/plugins/workflow/graph/BlockStartNode.java src/main/java/org/jenkinsci/plugins/workflow/graph/FlowNode.java src/main/java/org/jenkinsci/plugins/workflow/graph/GraphLookupView.java src/main/java/org/jenkinsci/plugins/workflow/graph/StandardGraphLookupView.java src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScanningUtils.java src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/NodeStepNamePredicate.java src/test/java/org/jenkinsci/plugins/workflow/graph/FlowNodeTest.java src/test/java/org/jenkinsci/plugins/workflow/graphanalysis/FlowScannerTest.java http://jenkins-ci.org/commit/workflow-cps-plugin/c0daeb5ce9ba55e6f51cb6c8db903cc5fbba324b Log: Merge pull request #52 from jenkinsci/revert-50-jenkins-27395-block-structure-lookup Revert " JENKINS-37573 / JENKINS-45553 Provide a fast view of block structures in the flow graph"

          Florian Meser added a comment -

          Hello svanoort, like you mentioned above I just tested the new versions and there definitely is an improvement. I updated short after you wrote that comment and I'm still using those versions. We pretty much rely on this feature since our whole test infrastructure depends on deploying data on nodes for many branches so we pretty much got a 24/7 running Jenkins (-with up to 1-2k executors in queue).

          Never the less the scaling can not be considered as stable. We got many tests that need ~2m and wait ~10-15min (worst case) for being processed by Jenkins. Like mentioned in https://issues.jenkins-ci.org/browse/JENKINS-45876 there seems to be kind of an quadratic or exponential correlation. That means even if there is a big improvement it gets to it's limits when crossing this edge.

          In my opinion there is still room for further improvements to ensure also large jenkins environments become more effective.

          Florian Meser added a comment - Hello svanoort , like you mentioned above I just tested the new versions and there definitely is an improvement. I updated short after you wrote that comment and I'm still using those versions. We pretty much rely on this feature since our whole test infrastructure depends on deploying data on nodes for many branches so we pretty much got a 24/7 running Jenkins (-with up to 1-2k executors in queue). Never the less the scaling can not be considered as stable. We got many tests that need ~2m and wait ~10-15min (worst case) for being processed by Jenkins. Like mentioned in https://issues.jenkins-ci.org/browse/JENKINS-45876  there seems to be kind of an quadratic or exponential correlation. That means even if there is a big improvement it gets to it's limits when crossing this edge. In my opinion there is still room for further improvements to ensure also large jenkins environments become more effective.

          Sam Van Oort added a comment -

          florian_meser I agree completely that there is some room for further optimization of massively-parallel pipeline execution – the best place to currently follow the work and investigations is https://issues.jenkins-ci.org/browse/JENKINS-47724 now.  That ticket also includes some concrete advice that may help with your scenario. 

          If you'd like to add some quantitative scaling observations to help identify where the bottleneck is, that might be of some assistance – I also expect the work currently in beta release from JENKINS-47170 will help a bit (reduces the per-flownode overheads associated with pipelines quite significantly – that's a small component of parallel execution).

          Very likely you'll see a big improvement from the next phase of that work, https://issues.jenkins-ci.org/browse/JENKINS-38381, which was the culprit here for a lot of the nonlinear behaviors – that's slated to be my next strategic push on performance, along with some tactical fixes that may help with your scenario.

          Sam Van Oort added a comment - florian_meser  I agree completely that there is some room for further optimization of massively-parallel pipeline execution – the best place to currently follow the work and investigations is  https://issues.jenkins-ci.org/browse/JENKINS-47724 now.  That ticket also includes some concrete advice that may help with your scenario.  If you'd like to add some quantitative scaling observations to help identify where the bottleneck is, that might be of some assistance – I also expect the work currently in beta release from JENKINS-47170 will help a bit (reduces the per-flownode overheads associated with pipelines quite significantly – that's a small component of parallel execution). Very likely you'll see a big improvement from the next phase of that work, https://issues.jenkins-ci.org/browse/JENKINS-38381 , which was the culprit here for a lot of the nonlinear behaviors – that's slated to be my next strategic push on performance, along with some tactical fixes that may help with your scenario.

          Sam Van Oort added a comment -

          One other comment: the bottlenecks appears to be only with massive parallels in a single pipeline – if you break your job into smaller ones with fewer parallel branches in each, this overheads per-branch will be less important.

          Pipeline is also never going to achieve fully linear scale-out with large numbers of executors, because only some parts of the execution can take full advantage of parallel execution – primarily the shell/batch/powershell steps that should be doing the bulk of work.  Our work is primarily focused on reducing the other overheads so it can spend more time executing those steps. 

          Amdahl's Law in spades, basically.

          Sam Van Oort added a comment - One other comment: the bottlenecks appears to be only with massive parallels in a single pipeline – if you break your job into smaller ones with fewer parallel branches in each, this overheads per-branch will be less important. Pipeline is also never going to achieve fully linear scale-out with large numbers of executors, because only some parts of the execution can take full advantage of parallel execution – primarily the shell/batch/powershell steps that should be doing the bulk of work.  Our work is primarily focused on reducing the other overheads so it can spend more time executing those steps.  Amdahl's Law in spades, basically.

          Florian Meser added a comment - - edited

          svanoort I'm currently trying to implement some time measurement to get quantitative scaling observations. Currently I don't got much time to spent for that though. As far as I got something i'll let you know.

          I don't know if this is offtopic but it seems that another neck breaker just came in. Therefor the question: are there any observation regarding the Meltdown/Spectre Windows7 updates topic which, again, seem to dramatic reduce the performance of our so called "massive parallels in a single pipeline"?

          I'm observing a dramatic loss of performance although no changes in our Jenkins-Pipeline were made regarding this symptomatic. With KB4056894 there was definitely a patch containing Meltdown/Spectre topics. I'm quiet curious if I'm the only one who is having this kind of trouble.

          Florian Meser added a comment - - edited svanoort I'm currently trying to implement some time measurement to get quantitative scaling observations. Currently I don't got much time to spent for that though. As far as I got something i'll let you know. I don't know if this is offtopic but it seems that another neck breaker just came in. Therefor the question: are there any observation regarding the Meltdown/Spectre Windows7 updates topic which, again, seem to dramatic reduce the performance of our so called "massive parallels in a single pipeline"? I'm observing a dramatic loss of performance although no changes in our Jenkins-Pipeline were made regarding this symptomatic. With KB4056894 there was definitely a patch containing Meltdown/Spectre topics. I'm quiet curious if I'm the only one who is having this kind of trouble.

          Sam Van Oort added a comment -

          florian_meser  I'm not sure what the performance impact of the Meltdown/Spectre updates is on Windows - not really set up for scaling tests on Windows, but it might be related to changes in IO performance. 

          Please try out the advice I just added in the latest comment on  https://issues.jenkins-ci.org/browse/JENKINS-47724 – this should help considerably.  The last few months have been heavily focused on performance improvements to Pipeline and it should show in a big way.

          Sam Van Oort added a comment - florian_meser   I'm not sure what the performance impact of the Meltdown/Spectre updates is on Windows - not really set up for scaling tests on Windows, but it might be related to changes in IO performance.  Please try out the advice I just added in the latest comment on   https://issues.jenkins-ci.org/browse/JENKINS-47724 – this should help considerably.  The last few months have been heavily focused on performance improvements to Pipeline and it should show in a big way.

            jglick Jesse Glick
            tskrainar Tom Skrainar
            Votes:
            4 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: