After restarting Jenkins with a running flow that has some "pickled" object references (such as slave/workspace pairs from the node step), the flow does not resume until all pickles are resolved. This delay could be long, and the user may have no idea what is happening, because nothing is shown in the console.

          [JENKINS-26130] Print progress of pending pickles

          Jesse Glick created issue -
          Jesse Glick made changes -
          Link New: This issue is blocking JENKINS-25890 [ JENKINS-25890 ]
          Jesse Glick made changes -
          Labels New: testing

          Jesse Glick added a comment -

          The lack of this can be a serious problem, since there is just no information in either the build log or the thread dump about what is (not) happening. I am seeing random failures

          "Executing resumeTwice(com.cloudbees.workflow.cps.checkpoint.CheckpointTest)" #1 … in Object.wait() […]
             java.lang.Thread.State: WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Object.wait(Object.java:502)
          	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
          	- locked <0x…> (a hudson.model.queue.FutureImpl)
          	at java_util_concurrent_Future$get.call(Unknown Source)
          	at com.cloudbees.workflow.cps.checkpoint.CheckpointTest$_resumeTwice_closure5.doCall(CheckpointTest.groovy:181)
          

          with no diagnostic information available.

          Jesse Glick added a comment - The lack of this can be a serious problem, since there is just no information in either the build log or the thread dump about what is (not) happening. I am seeing random failures "Executing resumeTwice(com.cloudbees.workflow.cps.checkpoint.CheckpointTest)" #1 … in Object.wait() […] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) - locked <0x…> (a hudson.model.queue.FutureImpl) at java_util_concurrent_Future$get.call(Unknown Source) at com.cloudbees.workflow.cps.checkpoint.CheckpointTest$_resumeTwice_closure5.doCall(CheckpointTest.groovy:181) with no diagnostic information available.
          Jesse Glick made changes -
          Labels Original: testing New: diagnostics robustness testing
          Jesse Glick made changes -
          Link New: This issue is related to JENKINS-29705 [ JENKINS-29705 ]

          Jesse Glick added a comment -

          Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?

          Jesse Glick added a comment - Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?
          Jesse Glick made changes -
          Link New: This issue is related to JENKINS-25550 [ JENKINS-25550 ]

          Jesse Glick added a comment -

          A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item.

          By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but kohsuke overrode me, insisting that the serialized program state should include a representation of the FilePath, and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.).

          So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run, and use that also instead of accessControlled. If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.

          Jesse Glick added a comment - A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item. By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but kohsuke overrode me, insisting that the serialized program state should include a representation of the FilePath , and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.). So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run , and use that also instead of accessControlled . If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.

          Jesse Glick added a comment -

          Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.

          Jesse Glick added a comment - Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: