Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-54073

"Pipeline: Job" plugin causes Jenkins Master CPU to peg to ~100% during pipelines with parallel steps

    • workflow-api 2.31

      The graph here in this description more clearly depicts whats going on. I'm the only one using this jenkins server today...

      Around 5pm, i kicked off a pipeine. The pipeline builds a container and then uses that container to run some pytests...

      It is during that last "postbuild" phase, the CPU runs real hot. While the pytest shards are running, they are just doing a pytest run and then i capture the junit.xml file.

      The reason why this is a problem is that with too many of these running at the same time, Jenkins blips out and we cannot contact the web interface because when the CPU is pegged.. it basically crashes the master.

      Here are details about my jenkins:

      My jenkins is 100% current:

       

          [JENKINS-54073] "Pipeline: Job" plugin causes Jenkins Master CPU to peg to ~100% during pipelines with parallel steps

          Problem still occurs for me. After starting a couple of jobs which run some parallel stuff I'm getting a huge number of "Computer.threadPoolForRemoting" threads until it reaches 30k and then everything breaks down.

          After the first OOM I see lots of errors like

          SEVERE: This command is created here
          Command UserRequest:org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ResetTimer@7e26e8e5 created at
                  at hudson.remoting.Command.<init>(Command.java:79)
                  at hudson.remoting.Request.<init>(Request.java:112)
                  at hudson.remoting.Request.<init>(Request.java:107)
                  at hudson.remoting.UserRequest.<init>(UserRequest.java:77)
                  at hudson.remoting.Channel.callAsync(Channel.java:985)
                  at org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ConsoleLogFilterImpl$1.eol(TimeoutStepExecution.java:288)
                  at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60)
                  at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:56)
                  at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:74)
                  at org.jenkinsci.plugins.pipeline.maven.console.MaskSecretsOutputStream.eol(MaskSecretsOutputStream.java:36)
                  at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60)
                  at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:56)
                  at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:74)
                  at hudson.tasks._maven.MavenConsoleAnnotator.eol(MavenConsoleAnnotator.java:75)
                  at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60)
                  at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:56)
                  at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:74)
                  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1793)
                  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
                  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
                  at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$HandlerImpl.output(DurableTaskStep.java:582)
                  at org.jenkinsci.plugins.durabletask.FileMonitoringTask$Watcher.run(FileMonitoringTask.java:477)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
                  at java.util.concurrent.FutureTask.run(Unknown Source)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
                  at java.lang.Thread.run(Unknown Source)
          

          I'm wondering if the fact that we're wrapping our jobs with a

          timeout(activity: true, time: 2, unit: 'HOURS') {
          

          could be involved in this as well (as I guess this has to reset a counter on each log output).

          Another thing is that a couple of seconds before it runs out of threads I see a huge number of messages like

          Oct 25, 2018 8:24:18 AM org.jenkinsci.plugins.workflow.flow.FlowExecutionList unregister
          WARNING: Owner[Core Mainline Status/PR-2684/1058:Core Mainline Status/PR-2684 #1058] was not in the list to begin with: [Owner[Core Integration/PR-2968/6:Core Integration/PR-2968 #6], Owner[Core Acceptance Installer/release%2F18_1_x/52:Core Acceptance Installer/release%2F18_1_x #52], Owner[Core Acceptance Installer/master/79:Core Acceptance Installer/master #79], Owner[Core Performance/master/30:Core Performance/master #30], .......]
          

          Patrick Ruckstuhl added a comment - Problem still occurs for me. After starting a couple of jobs which run some parallel stuff I'm getting a huge number of "Computer.threadPoolForRemoting" threads until it reaches 30k and then everything breaks down. After the first OOM I see lots of errors like SEVERE: This command is created here Command UserRequest:org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ResetTimer@7e26e8e5 created at at hudson.remoting.Command.<init>(Command.java:79) at hudson.remoting.Request.<init>(Request.java:112) at hudson.remoting.Request.<init>(Request.java:107) at hudson.remoting.UserRequest.<init>(UserRequest.java:77) at hudson.remoting.Channel.callAsync(Channel.java:985) at org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ConsoleLogFilterImpl$1.eol(TimeoutStepExecution.java:288) at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60) at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:56) at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:74) at org.jenkinsci.plugins.pipeline.maven.console.MaskSecretsOutputStream.eol(MaskSecretsOutputStream.java:36) at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60) at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:56) at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:74) at hudson.tasks._maven.MavenConsoleAnnotator.eol(MavenConsoleAnnotator.java:75) at hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:60) at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:56) at hudson.console.LineTransformationOutputStream.write(LineTransformationOutputStream.java:74) at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1793) at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769) at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$HandlerImpl.output(DurableTaskStep.java:582) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$Watcher.run(FileMonitoringTask.java:477) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source) I'm wondering if the fact that we're wrapping our jobs with a timeout(activity: true , time: 2, unit: 'HOURS' ) { could be involved in this as well (as I guess this has to reset a counter on each log output). Another thing is that a couple of seconds before it runs out of threads I see a huge number of messages like Oct 25, 2018 8:24:18 AM org.jenkinsci.plugins.workflow.flow.FlowExecutionList unregister WARNING: Owner[Core Mainline Status/PR-2684/1058:Core Mainline Status/PR-2684 #1058] was not in the list to begin with: [Owner[Core Integration/PR-2968/6:Core Integration/PR-2968 #6], Owner[Core Acceptance Installer/release%2F18_1_x/52:Core Acceptance Installer/release%2F18_1_x #52], Owner[Core Acceptance Installer/master/79:Core Acceptance Installer/master #79], Owner[Core Performance/master/30:Core Performance/master #30], .......]

          Lars Bilke added a comment -

          For me the fix works so far. No problems. We also use timeout() but without the activity argument.

          Thanks a lot!

          Lars Bilke added a comment - For me the fix works so far. No problems. We also use timeout() but without the activity argument. Thanks a lot!

          Maybe also important, we're on LTS 2.138.2

          Patrick Ruckstuhl added a comment - Maybe also important, we're on LTS 2.138.2

          Jesse Glick added a comment -

          tario you seem to have a distinct issue from everyone else. I suspect there is something amiss in my fix of JENKINS-54078 specific to using activity: true.

          Jesse Glick added a comment - tario you seem to have a distinct issue from everyone else. I suspect there is something amiss in my fix of JENKINS-54078 specific to using activity: true .

          Ok, could very well be as I had to install your patch as otherwise nothing was working.

          Patrick Ruckstuhl added a comment - Ok, could very well be as I had to install your patch as otherwise nothing was working.

          Devin Nusbaum added a comment -

          A fix for this was just released in version 2.31 of the Pipeine API Plugin. (Note that this bug should only affect you if you are running workflow-job 2.26)

          Devin Nusbaum added a comment - A fix for this was just released in version 2.31 of the Pipeine API Plugin. (Note that this bug should only affect you if you are running workflow-job 2.26)

          Jon B added a comment -

          Sorry for having disappeared on this jira.. I will upgrade this plugin tonight and rerun my pipeline to see if it still pegs the cpu to 100% and confirm by tomorrow. Thank you for your hard work on this!

          Jon B added a comment - Sorry for having disappeared on this jira.. I will upgrade this plugin tonight and rerun my pipeline to see if it still pegs the cpu to 100% and confirm by tomorrow. Thank you for your hard work on this!

          Jon B added a comment -

          Question: I see this message: "2.27 (2018 Nov 01) - WARNING: Although major issues have been fixed since 2.26, this version carries extra risk and is not fully backwards compatible with 2.25 or older; consider waiting a few days to update in critical environments."

          If I upgrade my plugins and it pegs to 100% cpu again, will I be able to downgrade back to a healthy state with the jenkins downgrade option? Or, should I snapshot my master just in case i have to do a full restore of the master?

          Jon B added a comment - Question: I see this message: "2.27 (2018 Nov 01) - WARNING: Although major issues have been fixed since 2.26, this version carries extra risk and is not fully backwards compatible with 2.25 or older; consider waiting a few days to update in critical environments." If I upgrade my plugins and it pegs to 100% cpu again, will I be able to downgrade back to a healthy state with the jenkins downgrade option? Or, should I snapshot my master just in case i have to do a full restore of the master?

          Devin Nusbaum added a comment -

          piratejohnny I would always encourage you to back up your master before updating plugins just in case. As far as I am aware, the compatibility notice is because if you downgrade from workflow-job 2.26 or 2.27 back to 2.25 or older, the per-step logs for builds you ran in versions 2.26-2.27 will be unreadable, the full build log will be viewable but its formatting will be a little messed up, and any in-progress builds will fail, although I have not specifically tested a downgrade to be aware of all potential issues.

          Devin Nusbaum added a comment - piratejohnny I would always encourage you to back up your master before updating plugins just in case. As far as I am aware, the compatibility notice is because if you downgrade from workflow-job 2.26 or 2.27 back to 2.25 or older, the per-step logs for builds you ran in versions 2.26-2.27 will be unreadable, the full build log will be viewable but its formatting will be a little messed up, and any in-progress builds will fail, although I have not specifically tested a downgrade to be aware of all potential issues.

          Jon B added a comment -

          Sounds good...

           

          A scheduled task snapshots the master each day but I'll run an extra one before doing this upgrade.

           

          Thank you.

           

          Jon B added a comment - Sounds good...   A scheduled task snapshots the master each day but I'll run an extra one before doing this upgrade.   Thank you.  

            jglick Jesse Glick
            piratejohnny Jon B
            Votes:
            2 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: