Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60507

Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled

       Our build system is sometimes showing this in the Thread Dump of a Pipeline while waiting for free executors

      Thread #94
      at DSL.node(node block appears to be neither running nor scheduled)
      at WorkflowScript.runOnNode(WorkflowScript:1798)
      at DSL.timeout(body has another 3 hr 14 min to run)
      at WorkflowScript.runOnNode(WorkflowScript:1783)
      at DSL.retry(Native Method)
      at WorkflowScript.runOnNode(WorkflowScript:1781)
      at WorkflowScript.getClosure(WorkflowScript:1901)

       
      In BlueOcean this appears, but the build queue is empty, and executors are available with those labels.

      Still waiting to schedule task
      Waiting for next available executor on pr&&prod&&mac&&build

       

      The job can only be completed by aborting or waiting for the timeout step to do it’s work.

      We started observing it since v2.121.3 (workflow-durable-task-step v2.19) but recently we updated to v2.190.1 (workflow-durable-task-step v2.28) and still seeing stuck pipelines when waiting for executors.

      The only reference I could find was in the last comment of this issue: https://issues.jenkins-ci.org/browse/JENKINS-42556 and there’s no way we can reproduce it. We’ve noticed this fix made by jglick but not sure if it will help us. We tried turning on Anonymous for a week and we still saw the problem.

      Please let me know if there’s more information/logs that I can help with to track down what might be the cause of this. Thanks.

      I've attached FINEST level logs on hudson.model.Queue, not sure if that will help a lot.
      Our Jenkins runs on RedHat, on Tomcat/9.0.14 and Java 1.8.0_171.

        1. plugins_versions.txt
          5 kB
        2. queue.logs.zip
          1.26 MB
        3. screenshot-1.png
          screenshot-1.png
          40 kB

          [JENKINS-60507] Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled

          Konstantin Demenkov added a comment - - edited

          I have the same issue on latest 2.204.1 LTS. It appears pretty often (10% of jobs) in working with proxmox slaves over proxmox cloud plugin and jnlp. I suspect some incompatibility in timeouts/ connection's logic between master and proxmox slaves, but really don't know, why it happens.

          Konstantin Demenkov added a comment - - edited I have the same issue on latest 2.204.1 LTS. It appears pretty often (10% of jobs) in working with proxmox slaves over proxmox cloud plugin and jnlp. I suspect some incompatibility in timeouts/ connection's logic between master and proxmox slaves, but really don't know, why it happens.

          We are still being hit by the issue from time to time, any ideas/workarounds/help to debug would be appreciated. Thanks

          Mihai Stoichitescu added a comment - We are still being hit by the issue from time to time, any ideas/workarounds/help to debug would be appreciated. Thanks

          Jesse Glick added a comment -

          Encountered in https://ci.jenkins.io/job/Tools/job/bom/job/PR-1832/1/threadDump/ a massively parallel build that hit a bunch of agent retries

          Thread #1482
          	at DSL.node(node block appears to be neither running nor scheduled)
          	at WorkflowScript.mavenEnv(WorkflowScript:8)
          	at DSL.retry(Native Method)
          	at WorkflowScript.mavenEnv(WorkflowScript:6)
          	at WorkflowScript.run(WorkflowScript:53)
          

          Still no clue about the root cause, but to align with JENKINS-49707 it would make sense to detect this anomalous condition after a few minutes and fail the step—let the build either fail or go into a retry .

          Jesse Glick added a comment - Encountered in https://ci.jenkins.io/job/Tools/job/bom/job/PR-1832/1/threadDump/ a massively parallel build that hit a bunch of agent retries Thread #1482 at DSL.node(node block appears to be neither running nor scheduled) at WorkflowScript.mavenEnv(WorkflowScript:8) at DSL.retry(Native Method) at WorkflowScript.mavenEnv(WorkflowScript:6) at WorkflowScript.run(WorkflowScript:53) Still no clue about the root cause, but to align with JENKINS-49707 it would make sense to detect this anomalous condition after a few minutes and fail the step—let the build either fail or go into a retry .

          While doing some performance testing of our Jenkins setup I see jobs being stuck in a similar state, the thread dump looks a bit different here:
           
          Thread #4
          at DSL.node(waiting for part of <job-name> #1 to be scheduled; blocked: Stopping part of <job-name> #1)
          at WorkflowScript.run(WorkflowScript:87)
          at DSL.timestamps(Native Method)
          at WorkflowScript.run(WorkflowScript:85)
          at DSL.timeout(body has another 1 day 23 hr to run)
          at WorkflowScript.run(WorkflowScript:84)
          But sometimes (rare) I get also the message (here the job is aborted after some time, see the linked PR):
           
          Thread #4
          at DSL.node(node block appears to be neither running nor scheduled)
          at WorkflowScript.run(WorkflowScript:87)
          at DSL.timestamps(Native Method)
          at WorkflowScript.run(WorkflowScript:85)
          at DSL.timeout(body has another 1 day 23 hr to run)
          at WorkflowScript.run(WorkflowScript:84)
           
          Any hits how to proceed with this? Which logs would help to analyze the problem?

          Christoph Kulla added a comment - While doing some performance testing of our Jenkins setup I see jobs being stuck in a similar state, the thread dump looks a bit different here:   Thread #4 at DSL.node(waiting for part of <job-name> #1 to be scheduled; blocked: Stopping part of <job-name> #1) at WorkflowScript.run(WorkflowScript:87) at DSL.timestamps(Native Method) at WorkflowScript.run(WorkflowScript:85) at DSL.timeout(body has another 1 day 23 hr to run) at WorkflowScript.run(WorkflowScript:84) But sometimes (rare) I get also the message (here the job is aborted after some time, see the linked PR):   Thread #4 at DSL.node( node block appears to be neither running nor scheduled ) at WorkflowScript.run(WorkflowScript:87) at DSL.timestamps(Native Method) at WorkflowScript.run(WorkflowScript:85) at DSL.timeout(body has another 1 day 23 hr to run) at WorkflowScript.run(WorkflowScript:84)   Any hits how to proceed with this? Which logs would help to analyze the problem?

          Jesse Glick added a comment - - edited

          org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution mainly; FINE logs should suffice. Please make sure you are running the latest version of workflow-durable-task-step as there are frequent logic changes so it would not be helpful to debug an old release. (Versions of Jenkins core or other plugins are unlikely to matter, though it is possible.) Was there any controller restart in the middle of the build, or was this all inside one Jenkins controller session?

          Note that PR-299 attempts to clean up periodically (AnomalousStatus) but you may need to wait up to 30m. Obviously it would be better to avoid getting into the broken state to begin with, which should be possible if someone can figure out how to reproduce from scratch.

          The first virtual thread dump suggests PlaceholderTask.stopping is set, yet there is a Queue.Item for it. This is anomalous because methods which set stopping (there are several possible reasons) also attempt to cancel any associated queue item. Did it fail to do so (Queue.cancel(Task) returns false)? Were there (somehow) multiple items with the same PlaceholderTask and only one got cancelled? One problem point is that Callback.finished does not check the return value of cancel and so neglects to log a warning if it failed, which I can fix: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/357

          Jesse Glick added a comment - - edited org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution mainly; FINE logs should suffice. Please make sure you are running the latest version of workflow-durable-task-step as there are frequent logic changes so it would not be helpful to debug an old release. (Versions of Jenkins core or other plugins are unlikely to matter, though it is possible.) Was there any controller restart in the middle of the build, or was this all inside one Jenkins controller session? Note that PR-299 attempts to clean up periodically ( AnomalousStatus ) but you may need to wait up to 30m. Obviously it would be better to avoid getting into the broken state to begin with, which should be possible if someone can figure out how to reproduce from scratch. The first virtual thread dump suggests PlaceholderTask.stopping is set, yet there is a Queue.Item for it. This is anomalous because methods which set stopping (there are several possible reasons) also attempt to cancel any associated queue item. Did it fail to do so ( Queue.cancel(Task) returns false)? Were there (somehow) multiple items with the same PlaceholderTask and only one got cancelled? One problem point is that Callback.finished does not check the return value of cancel and so neglects to log a warning if it failed, which I can fix: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/357

          Christoph Kulla added a comment - - edited

          For the first thread dump (waiting for part of <job-name> #1 to be scheduled; blocked: Stopping part of <job-name> #1): I see that this is caused by left overs in ExecutorStepExecution.RunningTasks from previous runs. My use case is as follows:

          1. Create jobs
          2. Run a sequence of jobs, some of the builds are aborted while others complete. The aborting is intentionally part of the test sequence.
          3. RunningTasks consists of entries for the aborted builds (stopping set to true). I see that these are removed later by the anomalous removal task, but I don't want to wait that long.
          4. Delete all jobs
          5. Repeat 1. and 2.
          6.  Now we see jobs being stuck as given in the first thread dump (I guess because the same job names and build numbers are used, and therefore the Context equals!?).

          The aborted jobs are stopped immediately after starting them. I see aborts happening while loading pipeline libraries, before the node step is executed. I have to admit the sequence is a bit weird, but it's part of an automated test sequence.

          If I clear ExecutorStepExecution.RunningTasks before repeating, the jobs run as expected again and the node step does not hang.

          Christoph Kulla added a comment - - edited For the first thread dump (waiting for part of <job-name> #1 to be scheduled; blocked: Stopping part of <job-name> #1): I see that this is caused by left overs in ExecutorStepExecution.RunningTasks from previous runs. My use case is as follows: Create jobs Run a sequence of jobs, some of the builds are aborted while others complete. The aborting is intentionally part of the test sequence. RunningTasks consists of entries for the aborted builds (stopping set to true). I see that these are removed later by the anomalous removal task, but I don't want to wait that long. Delete all jobs Repeat 1. and 2.  Now we see jobs being stuck as given in the first thread dump (I guess because the same job names and build numbers are used, and therefore the Context equals!?). The aborted jobs are stopped immediately after starting them. I see aborts happening while loading pipeline libraries, before the node step is executed. I have to admit the sequence is a bit weird, but it's part of an automated test sequence. If I clear ExecutorStepExecution.RunningTasks before repeating, the jobs run as expected again and the node step does not hang.

          Jesse Glick added a comment -

          Run a sequence of jobs

          Do you really mean a sequence of jobs? Or a sequence of builds of a single job?

          aborts happening while loading pipeline libraries, before the node step is executed

          Are you sure? It seems more likely the abort is processed after the node step starts and schedules a queue item but before an executor is allocated and the body of the node step starts running.

          Does each (successful) build run node exactly once, or is there some serial/parallel repetition involved?

          Hmm, if by

          Create jobs

          Delete all jobs

          you literally mean create & delete WorkflowJob’s then the problem might be that you are reusing build numbers (1, 2, etc.) from now-deleted projects. RunningTasks is keyed by CpsStepContext which in turn encodes both a step counter within the build and a WorkflowRun.Owner which is a pair of project (full) name + build number, so (without an intervening controller restart) I suppose it is possible for an entry to remain that interferes with a recreated project. Normally this would not matter because the entry would be cleaned up when the node step is aborted, and project deletion waits for all running builds to complete. Maybe RunningTasks.remove needs to be called from ExecutorStepExecution.stop in the first clause (Queue.cancel).

          I suppose there should also be some safety mechanism to remove any leftover entries when a build completes for any reason. That would probably require https://github.com/jenkinsci/workflow-support-plugin/pull/252 so that CpsStepContent.get(Run.class) can return a value regardless of the condition of the build, even if the Groovy program is irrecoverably broken.

          Jesse Glick added a comment - Run a sequence of jobs Do you really mean a sequence of jobs ? Or a sequence of builds of a single job? aborts happening while loading pipeline libraries, before the node step is executed Are you sure? It seems more likely the abort is processed after the node step starts and schedules a queue item but before an executor is allocated and the body of the node step starts running. Does each (successful) build run node exactly once, or is there some serial/parallel repetition involved? Hmm, if by Create jobs Delete all jobs you literally mean create & delete WorkflowJob ’s then the problem might be that you are reusing build numbers (1, 2, etc.) from now-deleted projects. RunningTasks is keyed by CpsStepContext which in turn encodes both a step counter within the build and a WorkflowRun.Owner which is a pair of project (full) name + build number, so (without an intervening controller restart) I suppose it is possible for an entry to remain that interferes with a recreated project. Normally this would not matter because the entry would be cleaned up when the node step is aborted, and project deletion waits for all running builds to complete. Maybe RunningTasks.remove needs to be called from ExecutorStepExecution.stop in the first clause ( Queue.cancel ). I suppose there should also be some safety mechanism to remove any leftover entries when a build completes for any reason. That would probably require https://github.com/jenkinsci/workflow-support-plugin/pull/252 so that CpsStepContent.get(Run.class) can return a value regardless of the condition of the build, even if the Groovy program is irrecoverably broken.

          The problem is caused by aborting a build with help of: run.finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborting build")). If I use "run.doStop()" instead, no entries are left in RunningTasks. What is the correct way to stop a build programmatically from Groovy script?

          Calling the finish() method give following warnings (while doStop() causes no entries in the log):

           
          WARNING: Refusing to build ExecutorStepExecution.PlaceholderTask{label=master,context=CpsStepContext[7:node]:Owner<job-name>/2:<job-name> #2} and going to cancel it, even though it was supposedly stopped already, because associated build is complete
          org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask lambda$getCauseOfBlockage$6
          WARNING: Refusing to build ExecutorStepExecution.PlaceholderTask{label=master,context=CpsStepContext[7:node]:Owner<job-name>/2:<job-name> #2} and cancelling it because associated build is complete
          org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask lambda$getCauseOfBlockage$6
          WARNING: Refusing to build ExecutorStepExecution.PlaceholderTask{label=master,context=CpsStepContext[7:node]:Owner<job-name>/2:<job-name> #2} because associated build is complete, but failed to cancel it
          org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask getCauseOfBlockage{{}}

          Christoph Kulla added a comment - The problem is caused by aborting a build with help of: run.finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborting build")). If I use "run.doStop()" instead, no entries are left in RunningTasks. What is the correct way to stop a build programmatically from Groovy script? Calling the finish() method give following warnings (while doStop() causes no entries in the log):   WARNING: Refusing to build ExecutorStepExecution.PlaceholderTask{label=master,context=CpsStepContext [7:node] :Owner <job-name>/2:<job-name> #2 } and going to cancel it, even though it was supposedly stopped already, because associated build is complete org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask lambda$getCauseOfBlockage$6 WARNING: Refusing to build ExecutorStepExecution.PlaceholderTask{label=master,context=CpsStepContext [7:node] :Owner <job-name>/2:<job-name> #2 } and cancelling it because associated build is complete org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask lambda$getCauseOfBlockage$6 WARNING: Refusing to build ExecutorStepExecution.PlaceholderTask{label=master,context=CpsStepContext [7:node] :Owner <job-name>/2:<job-name> #2 } because associated build is complete, but failed to cancel it org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask getCauseOfBlockage{{}}

          Jesse Glick added a comment -

          You should certainly not call WorkflowRun.finish directly. WorkflowRun.doStop is OK, though more generic and idiomatic would be run.executor.interrupt() (or one of its overloads if desired).

          I had managed to reproduce some misbehavior at the level of fine logging when cancelling a PlaceholderTask (leading to the owning build being aborted) and then deleting and recreating a project, but not the reported blocked: Stopping part of … bug. It may be possible to reproduce the reported bug using for example WorkflowRun.doKill, which is not exactly realistic (as this option would not be offered until two previous attempts at terminating the build cleanly had failed) but would be useful for exercising fallback logic.

          Jesse Glick added a comment - You should certainly not call WorkflowRun.finish directly. WorkflowRun.doStop is OK, though more generic and idiomatic would be run.executor.interrupt() (or one of its overloads if desired). I had managed to reproduce some misbehavior at the level of fine logging when cancelling a PlaceholderTask (leading to the owning build being aborted) and then deleting and recreating a project, but not the reported blocked: Stopping part of … bug. It may be possible to reproduce the reported bug using for example WorkflowRun.doKill , which is not exactly realistic (as this option would not be offered until two previous attempts at terminating the build cleanly had failed) but would be useful for exercising fallback logic.

            jglick Jesse Glick
            stoiky Mihai Stoichitescu
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: