After restarting Jenkins with a running flow that has some "pickled" object references (such as slave/workspace pairs from the node step), the flow does not resume until all pickles are resolved. This delay could be long, and the user may have no idea what is happening, because nothing is shown in the console.

          [JENKINS-26130] Print progress of pending pickles

          Jesse Glick added a comment -

          Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?

          Jesse Glick added a comment - Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?

          Jesse Glick added a comment -

          A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item.

          By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but kohsuke overrode me, insisting that the serialized program state should include a representation of the FilePath, and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.).

          So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run, and use that also instead of accessControlled. If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.

          Jesse Glick added a comment - A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item. By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but kohsuke overrode me, insisting that the serialized program state should include a representation of the FilePath , and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.). So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run , and use that also instead of accessControlled . If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.

          Jesse Glick added a comment -

          Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.

          Jesse Glick added a comment - Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.

          rsandell added a comment -

          Interesting behaviour in 1.13

          After a restart it does now report that it can't reconnect to the node. So I aborted and after doing a Hard Kill it did abort, but it left some fragment in the Build Queue after it had "successfully" aborted and completed.
          {{
          Aborted by anonymous
          Resuming build
          [ath-oss-split-1] Could not connect to 4fd7ec39 to send interrupt signal to process
          Aborted by Robert Sandell
          Click here to forcibly terminate running steps
          Click here to forcibly kill entire build
          Hard kill!
          Finished: ABORTED}}

          And after this the (or some other fragment of) item was still in the queue.

          rsandell added a comment - Interesting behaviour in 1.13 After a restart it does now report that it can't reconnect to the node. So I aborted and after doing a Hard Kill it did abort, but it left some fragment in the Build Queue after it had "successfully" aborted and completed. {{ Aborted by anonymous Resuming build [ath-oss-split-1] Could not connect to 4fd7ec39 to send interrupt signal to process Aborted by Robert Sandell Click here to forcibly terminate running steps Click here to forcibly kill entire build Hard kill! Finished: ABORTED}} And after this the (or some other fragment of) item was still in the queue.

          Jesse Glick added a comment -

          Have a rudimentary implementation. Have not yet tried to implement cancellability of pickle dehydration.

          Jesse Glick added a comment - Have a rudimentary implementation. Have not yet tried to implement cancellability of pickle dehydration.

          Jesse Glick added a comment -

          Implementation complete, except for the last suggestion to automatically abort an ExecutorPickle which is determined to be unloadable due to a deleted ephemeral node. This could be done later if desired. In the meantime, it is now much easier to identify and cancel builds affected by such issues.

          Jesse Glick added a comment - Implementation complete, except for the last suggestion to automatically abort an ExecutorPickle which is determined to be unloadable due to a deleted ephemeral node. This could be done later if desired. In the meantime, it is now much easier to identify and cancel builds affected by such issues.

          Jesse Glick added a comment -

          Released as five plugin updates.

          Jesse Glick added a comment - Released as five plugin updates.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java
          src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
          http://jenkins-ci.org/commit/workflow-cps-plugin/cbd00d462ebfb8d320e87228f88baf6d6f0f90f3
          Log:
          JENKINS-26130 Way to print progress from pickles.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/cbd00d462ebfb8d320e87228f88baf6d6f0f90f3 Log: JENKINS-26130 Way to print progress from pickles.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
          src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java
          src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
          http://jenkins-ci.org/commit/workflow-cps-plugin/d85f3c2d60f7d4a7f0f4e2687f6e31069b6d0f28
          Log:
          Merge pull request #5 from jglick/PPPP-JENKINS-26130

          JENKINS-26130 Way to print progress from pickles

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/d85f3c2d60f7d4a7f0f4e2687f6e31069b6d0f28 Log: Merge pull request #5 from jglick/PPPP- JENKINS-26130 JENKINS-26130 Way to print progress from pickles

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
          http://jenkins-ci.org/commit/workflow-cps-plugin/3705e3f13c84ae5a29a1cba0884e5e67f54e7caa
          Log:
          JENKINS-26130 JENKINS-31842 Request that pickle futures be printable.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/3705e3f13c84ae5a29a1cba0884e5e67f54e7caa Log: JENKINS-26130 JENKINS-31842 Request that pickle futures be printable.

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: