Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26130

Print progress of pending pickles

    XMLWordPrintable

Details

    Description

      After restarting Jenkins with a running flow that has some "pickled" object references (such as slave/workspace pairs from the node step), the flow does not resume until all pickles are resolved. This delay could be long, and the user may have no idea what is happening, because nothing is shown in the console.

      Attachments

        Issue Links

          Activity

            jglick Jesse Glick added a comment -

            The lack of this can be a serious problem, since there is just no information in either the build log or the thread dump about what is (not) happening. I am seeing random failures

            "Executing resumeTwice(com.cloudbees.workflow.cps.checkpoint.CheckpointTest)" #1 … in Object.wait() […]
               java.lang.Thread.State: WAITING (on object monitor)
            	at java.lang.Object.wait(Native Method)
            	at java.lang.Object.wait(Object.java:502)
            	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
            	- locked <0x…> (a hudson.model.queue.FutureImpl)
            	at java_util_concurrent_Future$get.call(Unknown Source)
            	at com.cloudbees.workflow.cps.checkpoint.CheckpointTest$_resumeTwice_closure5.doCall(CheckpointTest.groovy:181)
            

            with no diagnostic information available.

            jglick Jesse Glick added a comment - The lack of this can be a serious problem, since there is just no information in either the build log or the thread dump about what is (not) happening. I am seeing random failures "Executing resumeTwice(com.cloudbees.workflow.cps.checkpoint.CheckpointTest)" #1 … in Object.wait() […] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) - locked <0x…> (a hudson.model.queue.FutureImpl) at java_util_concurrent_Future$get.call(Unknown Source) at com.cloudbees.workflow.cps.checkpoint.CheckpointTest$_resumeTwice_closure5.doCall(CheckpointTest.groovy:181) with no diagnostic information available.
            jglick Jesse Glick added a comment -

            Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?

            jglick Jesse Glick added a comment - Need to test the common scenario that a node block cannot resume because its slave was ephemeral. Can we recover with a hard kill?
            jglick Jesse Glick added a comment -

            A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item.

            By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but kohsuke overrode me, insisting that the serialized program state should include a representation of the FilePath, and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.).

            So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run, and use that also instead of accessControlled. If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.

            jglick Jesse Glick added a comment - A hard kill from JENKINS-25550 works around it. Still, should behave better: should show what it is trying to resume, and a regular interruption should stop that. Currently you are given no information, and have to escalate from a regular interrupt, to a step termination, to a hard kill, and then also separately kill the unschedulable queue item. By the design of pickle resolution, we cannot really recover the build, unfortunately. I argued during the initial design of Workflow that the node step’s state should include just the slave name and workspace path, and its onResume should be responsible for trying to get that workspace back (so that this step could handle stop gracefully, for example by throwing an exception up that the script could catch and handle); but kohsuke overrode me, insisting that the serialized program state should include a representation of the FilePath , and that script execution shall not resume until that pickle is successfully dehydrated (even if there were other branches able to proceed, etc.). So the best we can do is display more clearly what is wrong and offer a hard kill right away. W.r.t. to the queue item, PlaceholderTask.run can tell via StepContext.isReady that it is still being unpickled, but it cannot use that to differentiate the case of a normal startup when we are waiting for a sluggish slave to come back online; it cannot even find the Run to tell whether it was already aborted, since it cannot call get yet. Probably it will need to persist a Run.externalizableId to implement run , and use that also instead of accessControlled . If the Run turns out to be finished, it could use Queue.getItems(Task) to cancel itself, so that cleanup from the whole process reduces to pressing the stop button once on the console page or on the flyweight executor in the executor widget.
            jglick Jesse Glick added a comment -

            Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.

            jglick Jesse Glick added a comment - Would be possible to differentiate between an EphemeralNode and a slave which is simply slow to come back online by checking whether the node is defined while it is offline—if so, continue to wait, if not, just abort right away. A slave using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be killed. I suppose in that case it will be removed after a few minutes and the pickle can abort itself.
            rsandell rsandell added a comment -

            Interesting behaviour in 1.13

            After a restart it does now report that it can't reconnect to the node. So I aborted and after doing a Hard Kill it did abort, but it left some fragment in the Build Queue after it had "successfully" aborted and completed.
            {{
            Aborted by anonymous
            Resuming build
            [ath-oss-split-1] Could not connect to 4fd7ec39 to send interrupt signal to process
            Aborted by Robert Sandell
            Click here to forcibly terminate running steps
            Click here to forcibly kill entire build
            Hard kill!
            Finished: ABORTED}}

            And after this the (or some other fragment of) item was still in the queue.

            rsandell rsandell added a comment - Interesting behaviour in 1.13 After a restart it does now report that it can't reconnect to the node. So I aborted and after doing a Hard Kill it did abort, but it left some fragment in the Build Queue after it had "successfully" aborted and completed. {{ Aborted by anonymous Resuming build [ath-oss-split-1] Could not connect to 4fd7ec39 to send interrupt signal to process Aborted by Robert Sandell Click here to forcibly terminate running steps Click here to forcibly kill entire build Hard kill! Finished: ABORTED}} And after this the (or some other fragment of) item was still in the queue.
            jglick Jesse Glick added a comment -

            Have a rudimentary implementation. Have not yet tried to implement cancellability of pickle dehydration.

            jglick Jesse Glick added a comment - Have a rudimentary implementation. Have not yet tried to implement cancellability of pickle dehydration.
            jglick Jesse Glick added a comment -

            Implementation complete, except for the last suggestion to automatically abort an ExecutorPickle which is determined to be unloadable due to a deleted ephemeral node. This could be done later if desired. In the meantime, it is now much easier to identify and cancel builds affected by such issues.

            jglick Jesse Glick added a comment - Implementation complete, except for the last suggestion to automatically abort an ExecutorPickle which is determined to be unloadable due to a deleted ephemeral node. This could be done later if desired. In the meantime, it is now much easier to identify and cancel builds affected by such issues.
            jglick Jesse Glick added a comment -

            Released as five plugin updates.

            jglick Jesse Glick added a comment - Released as five plugin updates.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java
            src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
            http://jenkins-ci.org/commit/workflow-cps-plugin/cbd00d462ebfb8d320e87228f88baf6d6f0f90f3
            Log:
            JENKINS-26130 Way to print progress from pickles.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/cbd00d462ebfb8d320e87228f88baf6d6f0f90f3 Log: JENKINS-26130 Way to print progress from pickles.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java
            src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
            http://jenkins-ci.org/commit/workflow-cps-plugin/d85f3c2d60f7d4a7f0f4e2687f6e31069b6d0f28
            Log:
            Merge pull request #5 from jglick/PPPP-JENKINS-26130

            JENKINS-26130 Way to print progress from pickles

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionOwner.java src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/d85f3c2d60f7d4a7f0f4e2687f6e31069b6d0f28 Log: Merge pull request #5 from jglick/PPPP- JENKINS-26130 JENKINS-26130 Way to print progress from pickles

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java
            http://jenkins-ci.org/commit/workflow-cps-plugin/3705e3f13c84ae5a29a1cba0884e5e67f54e7caa
            Log:
            JENKINS-26130 JENKINS-31842 Request that pickle futures be printable.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/pickles/Pickle.java http://jenkins-ci.org/commit/workflow-cps-plugin/3705e3f13c84ae5a29a1cba0884e5e67f54e7caa Log: JENKINS-26130 JENKINS-31842 Request that pickle futures be printable.

            People

              jglick Jesse Glick
              jglick Jesse Glick
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: