Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-41791

Build cannot be resumed if parallel was used with Kubernetes plugin

    • workflow-cps 2.66

      I have a lot of Pipeline Jobs that are waiting for user input before deploying to production. It is quite normal, that pipelines are in this state for several days.

      After a Jenkins restart (e.g. because of a Jenkins Update) the Pipelines are still in running state, but the user input controls are missing.

      log

      pipeline view

      paused for input
      Paused for input is available, but the controls are missing.

      blue ocean

        1. blue ocean.png
          blue ocean.png
          39 kB
        2. log.png
          log.png
          57 kB
        3. paused for input.png
          paused for input.png
          138 kB
        4. pipeline view.png
          pipeline view.png
          68 kB

          [JENKINS-41791] Build cannot be resumed if parallel was used with Kubernetes plugin

          Jesse Glick added a comment -

          The pod does actually go away though outside of the node blocks.

          Not in my experience, although I did see it get killed later for unknown reasons. Not sure if that is related to the bug.

          Finally managed to reproduce. FTR:

          • Run Microk8s.
          • Run: java -Dhudson.Main.development=true -jar jenkins-war-2.164.1.war --prefix=/jenkins --httpPort=8090 --httpListenAddress=10.1.1.1
          • Load Jenkins via http://10.1.1.1:8090/jenkins/ which should also set its default URL (needed by the Kubernetes jnlp container callback).
          • Install {blueocean}}, pipeline-stage-view, and kubernetes plugins.
          • Add a Kubernetes cloud (no further configuration needed).
          • Create a Pipeline job based on yours but with that stuff uncommented, and with the resources section removed, to wit:
            def label = "jenkins-input-repro-${UUID.randomUUID().toString()}"
            def scmVars;
            
            def endToEndTests(target) {
              stage('End to end tests') {
                container('ci') {
                  parallel '1': {
                    sh 'sleep 10'
                  }, '2': {
                    sh 'sleep 10'
                  }, '3': {
                    sh 'sleep 10'
                  }
                }
              }
            }
            
            def deploy(target) {
              stage("Deploy to ${target}") {
                container('ci') {
                  sh 'sleep 10'
                }
              }
            }
            
            podTemplate(label: label, podRetention: onFailure(), activeDeadlineSeconds: 600, yaml: """
            apiVersion: v1
            kind: Pod
            spec:
              containers:
                - name: ci
                  image: golang:latest
                  tty: true
            """
              ) {
            
              node(label) {
                this.deploy('staging')
                this.endToEndTests('staging')
              }
            
              stage('Approve prod') {
                input message: 'Deploy to prod?'
              }
            
              node(label) {
                this.deploy('prod')
                this.endToEndTests('prod')
              }
            }
            
          • Start a build.
          • Wait for it to get to approval.
          • /safeRestart
          • When Jenkins comes back up, try to Proceed.

          Not sure exactly how the endToEndTests plays into this, but anyway compared to the passing case, there is no Ready to run at … message, http://10.1.1.1:8090/jenkins/job/p/1/threadDump/ displays

          Program is not yet loaded
          	Looking for path named ‘/home/jenkins/workspace/p’ on computer named ‘jenkins-input-repro-15556b06-275f-494f-84d3-4ba9c006f0c1--8b27g’
          

          and jstack shows various threads waiting for a monitor on InputAction.getExecutions, with one locked like

          "Handling GET /jenkins/job/p/wfapi/runs from 10.1.1.1 : …" … waiting on condition […]
             java.lang.Thread.State: TIMED_WAITING (parking)
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for  <…> (a com.google.common.util.concurrent.AbstractFuture$Sync)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
          	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:258)
          	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)
          	at org.jenkinsci.plugins.workflow.support.steps.input.InputAction.loadExecutions(InputAction.java:71)
          	- locked <…> (a org.jenkinsci.plugins.workflow.support.steps.input.InputAction)
          	at org.jenkinsci.plugins.workflow.support.steps.input.InputAction.getExecutions(InputAction.java:146)
          	- locked <…> (a org.jenkinsci.plugins.workflow.support.steps.input.InputAction)
          	at com.cloudbees.workflow.rest.external.RunExt.isPendingInput(RunExt.java:347)
          	at com.cloudbees.workflow.rest.external.RunExt.initStatus(RunExt.java:377)
          	at com.cloudbees.workflow.rest.external.RunExt.createMinimal(RunExt.java:241)
          	at com.cloudbees.workflow.rest.external.RunExt.createNew(RunExt.java:317)
          	at com.cloudbees.workflow.rest.external.RunExt.create(RunExt.java:309)
          	at com.cloudbees.workflow.rest.external.JobExt.create(JobExt.java:131)
          	at com.cloudbees.workflow.rest.endpoints.JobAPI.doRuns(JobAPI.java:69)
          	at …
          

          all of which does point to JENKINS-37998 as a root cause. The other contributing issue is a durability problem in the kubernetes plugin: the agent has either gone offline or restarted without a persistent workspace.

          Jesse Glick added a comment - The pod does actually go away though outside of the node blocks. Not in my experience, although I did see it get killed later for unknown reasons. Not sure if that is related to the bug. Finally managed to reproduce. FTR: Run Microk8s. Run: java -Dhudson.Main.development=true -jar jenkins-war-2.164.1.war --prefix=/jenkins --httpPort=8090 --httpListenAddress=10.1.1.1 Load Jenkins via http:/ /10.1.1.1:8090/jenkins/ which should also set its default URL (needed by the Kubernetes jnlp container callback). Install {blueocean}}, pipeline-stage-view , and kubernetes plugins. Add a Kubernetes cloud (no further configuration needed). Create a Pipeline job based on yours but with that stuff uncommented, and with the resources section removed, to wit: def label = "jenkins-input-repro-${UUID.randomUUID().toString()}" def scmVars; def endToEndTests(target) { stage( 'End to end tests' ) { container( 'ci' ) { parallel '1' : { sh 'sleep 10' }, '2' : { sh 'sleep 10' }, '3' : { sh 'sleep 10' } } } } def deploy(target) { stage( "Deploy to ${target}" ) { container( 'ci' ) { sh 'sleep 10' } } } podTemplate(label: label, podRetention: onFailure(), activeDeadlineSeconds: 600, yaml: """ apiVersion: v1 kind: Pod spec: containers: - name: ci image: golang:latest tty: true """ ) { node(label) { this .deploy( 'staging' ) this .endToEndTests( 'staging' ) } stage( 'Approve prod' ) { input message: 'Deploy to prod?' } node(label) { this .deploy( 'prod' ) this .endToEndTests( 'prod' ) } } Start a build. Wait for it to get to approval. /safeRestart When Jenkins comes back up, try to Proceed . Not sure exactly how the endToEndTests plays into this, but anyway compared to the passing case, there is no Ready to run at … message, http: //10.1.1.1:8090/jenkins/job/p/1/threadDump/ displays Program is not yet loaded Looking for path named ‘/home/jenkins/workspace/p’ on computer named ‘jenkins-input-repro-15556b06-275f-494f-84d3-4ba9c006f0c1--8b27g’ and jstack shows various threads waiting for a monitor on InputAction.getExecutions , with one locked like "Handling GET /jenkins/job/p/wfapi/runs from 10.1.1.1 : …" … waiting on condition […] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <…> (a com.google.common.util.concurrent.AbstractFuture$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:258) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91) at org.jenkinsci.plugins.workflow.support.steps.input.InputAction.loadExecutions(InputAction.java:71) - locked <…> (a org.jenkinsci.plugins.workflow.support.steps.input.InputAction) at org.jenkinsci.plugins.workflow.support.steps.input.InputAction.getExecutions(InputAction.java:146) - locked <…> (a org.jenkinsci.plugins.workflow.support.steps.input.InputAction) at com.cloudbees.workflow.rest.external.RunExt.isPendingInput(RunExt.java:347) at com.cloudbees.workflow.rest.external.RunExt.initStatus(RunExt.java:377) at com.cloudbees.workflow.rest.external.RunExt.createMinimal(RunExt.java:241) at com.cloudbees.workflow.rest.external.RunExt.createNew(RunExt.java:317) at com.cloudbees.workflow.rest.external.RunExt.create(RunExt.java:309) at com.cloudbees.workflow.rest.external.JobExt.create(JobExt.java:131) at com.cloudbees.workflow.rest.endpoints.JobAPI.doRuns(JobAPI.java:69) at … all of which does point to JENKINS-37998 as a root cause. The other contributing issue is a durability problem in the kubernetes plugin: the agent has either gone offline or restarted without a persistent workspace.

          Jesse Glick added a comment -

          I had assumed that a workaround in this case would be to actually release the pod before pausing the input (as you ought to do anyway):

          def endToEndTests(target) {
            stage('End to end tests') {
              container('ci') {
                parallel '1': {
                  sh 'sleep 10'
                }, '2': {
                  sh 'sleep 10'
                }, '3': {
                  sh 'sleep 10'
                }
              }
            }
          }
          
          def deploy(target) {
            stage("Deploy to ${target}") {
              container('ci') {
                sh 'sleep 10'
              }
            }
          }
          
          def everything(target) {
              def label = "jenkins-input-repro-${UUID.randomUUID().toString()}"
              podTemplate(label: label, podRetention: onFailure(), activeDeadlineSeconds: 600, yaml: '''
          apiVersion: v1
          kind: Pod
          spec:
            containers:
              - name: ci
                image: golang:latest
                tty: true
          ''') {
                  node(label) {
                      deploy(target)
                      endToEndTests(target)
                  }
              }
          }
          
          everything 'staging'
          
          stage('Approve prod') {
              input message: 'Deploy to prod?'
          }
          
          everything 'prod'
          

          Unfortunately it does not seem to work. The virtual thread dump shows that the program load is trying to deserialize a FilePath from jenkins-input-repro-390fc568-acbd-434d-9083-8de662760e28-khwqp, while the currently active agent is named jenkins-input-repro-390fc568-acbd-434d-9083-8de662760e28-5l5jr. Why the kubernetes plugin is reconstructing this agent, I have no idea (it should have been deleted as soon as the first node block exited, and in fact the second agent does disconnect on its own after a while); but that is less important than the fact that program.dat includes a FilePathPickle for a node block which has already closed.

          Jesse Glick added a comment - I had assumed that a workaround in this case would be to actually release the pod before pausing the input (as you ought to do anyway): def endToEndTests(target) { stage( 'End to end tests' ) { container( 'ci' ) { parallel '1' : { sh 'sleep 10' }, '2' : { sh 'sleep 10' }, '3' : { sh 'sleep 10' } } } } def deploy(target) { stage( "Deploy to ${target}" ) { container( 'ci' ) { sh 'sleep 10' } } } def everything(target) { def label = "jenkins-input-repro-${UUID.randomUUID().toString()}" podTemplate(label: label, podRetention: onFailure(), activeDeadlineSeconds: 600, yaml: ''' apiVersion: v1 kind: Pod spec: containers: - name: ci image: golang:latest tty: true ''') { node(label) { deploy(target) endToEndTests(target) } } } everything 'staging' stage( 'Approve prod' ) { input message: 'Deploy to prod?' } everything 'prod' Unfortunately it does not seem to work. The virtual thread dump shows that the program load is trying to deserialize a FilePath from jenkins-input-repro-390fc568-acbd-434d-9083-8de662760e28- khwqp , while the currently active agent is named jenkins-input-repro-390fc568-acbd-434d-9083-8de662760e28 -5l5jr . Why the kubernetes plugin is reconstructing this agent, I have no idea (it should have been deleted as soon as the first node block exited, and in fact the second agent does disconnect on its own after a while); but that is less important than the fact that program.dat includes a FilePathPickle for a node block which has already closed.

          Tim Myers added a comment - - edited

          Hmm interesting.  My pods have definitely been destroying and recreating anew for each node() scope block, whether that's expected/intended or not..  As you said though, that could well be related to the bug.

          Tim Myers added a comment - - edited Hmm interesting.  My pods have definitely been destroying and recreating anew for each node()  scope block, whether that's expected/intended or not..  As you said though, that could well be related to the bug.

          Jesse Glick added a comment -

          Removing the parallel step and running the same sh steps sequentially does work around the issue, confirming that on top of everything else there is a serial form leak bug in workflow-cps.

          Jesse Glick added a comment - Removing the parallel step and running the same sh steps sequentially does work around the issue, confirming that on top of everything else there is a serial form leak bug in workflow-cps .

          Tim Myers added a comment -

          Haha awesome, well maybe I can workaround for now by just removing any usage or parallel.  I'll give that a shot on the actual workflow.

          Tim Myers added a comment - Haha awesome, well maybe I can workaround for now by just removing any usage or parallel .  I'll give that a shot on the actual workflow.

          Jesse Glick added a comment -

          I was hoping that my proposed fix of JENKINS-41854 would solve the symptom in at least some cases, by allowing a PickleDynamicContext to be saved in program.dat rather than the actual FilePath in a DryCapsule. Unfortunately it does not seem to work—something is apparently still trying to rehydrate the bogus pickle—though at least the override of TryRepeatedly.getOwner from FilePathPickle makes the problem a bit more apparent, as the resumed build will repeatedly print

          Still trying to load Looking for path named ‘/home/jenkins/workspace/workaround’ on computer named ‘jenkins-input-repro-48e88ac1-40ab-4d74-9ba2-1f25e728be3c--3h6s8’
          

          Jesse Glick added a comment - I was hoping that my proposed fix of JENKINS-41854 would solve the symptom in at least some cases, by allowing a PickleDynamicContext to be saved in program.dat rather than the actual FilePath in a DryCapsule . Unfortunately it does not seem to work—something is apparently still trying to rehydrate the bogus pickle—though at least the override of TryRepeatedly.getOwner from FilePathPickle makes the problem a bit more apparent, as the resumed build will repeatedly print Still trying to load Looking for path named ‘/home/jenkins/workspace/workaround’ on computer named ‘jenkins-input-repro-48e88ac1-40ab-4d74-9ba2-1f25e728be3c--3h6s8’

          Jesse Glick added a comment -

          Digging into the program state confirms that all kinds of stuff including the ContainerExecDecorator is still there even after the container step exited. Seems like a bug in ParallelStep.

          Jesse Glick added a comment - Digging into the program state confirms that all kinds of stuff including the ContainerExecDecorator is still there even after the container step exited. Seems like a bug in ParallelStep .

          Jesse Glick added a comment -

          JENKINS-53709 was a similar issue.

          Jesse Glick added a comment - JENKINS-53709 was a similar issue.

          Jesse Glick added a comment -

          Have a fix for the workflow-cps problem, and it seems to correct the symptom as reported here. JENKINS-37998 is still valid, since it could affect other scenarios, but this seems to be the important fix.

          Jesse Glick added a comment - Have a fix for the workflow-cps problem, and it seems to correct the symptom as reported here. JENKINS-37998 is still valid, since it could affect other scenarios, but this seems to be the important fix.

          Devin Nusbaum added a comment -

          A fix for this issue was just released in version 2.66 of Pipeline: Groovy Plugin.

          Devin Nusbaum added a comment - A fix for this issue was just released in version 2.66 of Pipeline: Groovy Plugin.

            jglick Jesse Glick
            dawi Daniel Wilmer
            Votes:
            14 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: