Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60507

Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled

       Our build system is sometimes showing this in the Thread Dump of a Pipeline while waiting for free executors

      Thread #94
      at DSL.node(node block appears to be neither running nor scheduled)
      at WorkflowScript.runOnNode(WorkflowScript:1798)
      at DSL.timeout(body has another 3 hr 14 min to run)
      at WorkflowScript.runOnNode(WorkflowScript:1783)
      at DSL.retry(Native Method)
      at WorkflowScript.runOnNode(WorkflowScript:1781)
      at WorkflowScript.getClosure(WorkflowScript:1901)

       
      In BlueOcean this appears, but the build queue is empty, and executors are available with those labels.

      Still waiting to schedule task
      Waiting for next available executor on pr&&prod&&mac&&build

       

      The job can only be completed by aborting or waiting for the timeout step to do it’s work.

      We started observing it since v2.121.3 (workflow-durable-task-step v2.19) but recently we updated to v2.190.1 (workflow-durable-task-step v2.28) and still seeing stuck pipelines when waiting for executors.

      The only reference I could find was in the last comment of this issue: https://issues.jenkins-ci.org/browse/JENKINS-42556 and there’s no way we can reproduce it. We’ve noticed this fix made by jglick but not sure if it will help us. We tried turning on Anonymous for a week and we still saw the problem.

      Please let me know if there’s more information/logs that I can help with to track down what might be the cause of this. Thanks.

      I've attached FINEST level logs on hudson.model.Queue, not sure if that will help a lot.
      Our Jenkins runs on RedHat, on Tomcat/9.0.14 and Java 1.8.0_171.

        1. screenshot-1.png
          screenshot-1.png
          40 kB
        2. plugins_versions.txt
          5 kB
        3. queue.logs.zip
          1.26 MB

          [JENKINS-60507] Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled

          Mihai Stoichitescu created issue -
          Mihai Stoichitescu made changes -
          Description Original:  

          Our build system is sometimes showing this in the Thread Dump of a Pipeline while waiting for free executors

           
          {code:java}
          Thread #94
          at DSL.node(node block appears to be neither running nor scheduled)
          at WorkflowScript.runOnNode(WorkflowScript:1798)
          at DSL.timeout(body has another 3 hr 14 min to run)
          at WorkflowScript.runOnNode(WorkflowScript:1783)
          at DSL.retry(Native Method)
          at WorkflowScript.runOnNode(WorkflowScript:1781)
          at WorkflowScript.getClosure(WorkflowScript:1901){code}
           

           

          In BlueOcean this appears, but the build queue is empty, and executors are available with those labels.

           
          {code:java}
          Still waiting to schedule task
          Waiting for next available executor on pr&&prod&&mac&&build{code}
           

           

          The job can only be completed by aborting or waiting for the timeout step to do it’s work.

           

          We started observing it since v2.121.3 (workflow-durable-task-step v2.19) but recently we updated to v2.190.1 (workflow-durable-task-step v2.28) and still seeing stuck pipelines when waiting for executors.

           

          The only reference I could find was in the last comment of this issue: https://issues.jenkins-ci.org/browse/JENKINS-42556 and there’s no way we can reproduce it. We’ve noticed this fix made by [~jglick] but not sure if it will help us. We tried turning on Anonymous for a week and we still saw the problem.

           

          Please let me know if there’s more information/logs that I can help with to track down what might be the cause of this. Thanks.

           

          I've attached _FINEST_ level logs on _hudson.model.Queue_, not sure if that will help a lot.

          Our Jenkins runs on RedHat, on Tomcat/9.0.14 and Java 1.8.0_171.
          New:  Our build system is sometimes showing this in the Thread Dump of a Pipeline while waiting for free executors
          {code:java}
          Thread #94
          at DSL.node(node block appears to be neither running nor scheduled)
          at WorkflowScript.runOnNode(WorkflowScript:1798)
          at DSL.timeout(body has another 3 hr 14 min to run)
          at WorkflowScript.runOnNode(WorkflowScript:1783)
          at DSL.retry(Native Method)
          at WorkflowScript.runOnNode(WorkflowScript:1781)
          at WorkflowScript.getClosure(WorkflowScript:1901){code}
           
           In BlueOcean this appears, but the build queue is empty, and executors are available with those labels.
          {code:java}
          Still waiting to schedule task
          Waiting for next available executor on pr&&prod&&mac&&build{code}
           

          The job can only be completed by aborting or waiting for the timeout step to do it’s work.

          We started observing it since v2.121.3 (workflow-durable-task-step v2.19) but recently we updated to v2.190.1 (workflow-durable-task-step v2.28) and still seeing stuck pipelines when waiting for executors.

          The only reference I could find was in the last comment of this issue: https://issues.jenkins-ci.org/browse/JENKINS-42556 and there’s no way we can reproduce it. We’ve noticed this fix made by [~jglick] but not sure if it will help us. We tried turning on Anonymous for a week and we still saw the problem.

          Please let me know if there’s more information/logs that I can help with to track down what might be the cause of this. Thanks.

          I've attached _FINEST_ level logs on _hudson.model.Queue_, not sure if that will help a lot.
           Our Jenkins runs on RedHat, on Tomcat/9.0.14 and Java 1.8.0_171.
          Mihai Stoichitescu made changes -
          Attachment New: screenshot-1.png [ 49785 ]

          Konstantin Demenkov added a comment - - edited

          I have the same issue on latest 2.204.1 LTS. It appears pretty often (10% of jobs) in working with proxmox slaves over proxmox cloud plugin and jnlp. I suspect some incompatibility in timeouts/ connection's logic between master and proxmox slaves, but really don't know, why it happens.

          Konstantin Demenkov added a comment - - edited I have the same issue on latest 2.204.1 LTS. It appears pretty often (10% of jobs) in working with proxmox slaves over proxmox cloud plugin and jnlp. I suspect some incompatibility in timeouts/ connection's logic between master and proxmox slaves, but really don't know, why it happens.
          Konstantin Demenkov made changes -
          Priority Original: Minor [ 4 ] New: Major [ 3 ]

          We are still being hit by the issue from time to time, any ideas/workarounds/help to debug would be appreciated. Thanks

          Mihai Stoichitescu added a comment - We are still being hit by the issue from time to time, any ideas/workarounds/help to debug would be appreciated. Thanks
          Jesse Glick made changes -
          Link New: This issue relates to JENKINS-42556 [ JENKINS-42556 ]
          Jesse Glick made changes -
          Component/s Original: core [ 15593 ]

          Jesse Glick added a comment -

          Encountered in https://ci.jenkins.io/job/Tools/job/bom/job/PR-1832/1/threadDump/ a massively parallel build that hit a bunch of agent retries

          Thread #1482
          	at DSL.node(node block appears to be neither running nor scheduled)
          	at WorkflowScript.mavenEnv(WorkflowScript:8)
          	at DSL.retry(Native Method)
          	at WorkflowScript.mavenEnv(WorkflowScript:6)
          	at WorkflowScript.run(WorkflowScript:53)
          

          Still no clue about the root cause, but to align with JENKINS-49707 it would make sense to detect this anomalous condition after a few minutes and fail the step—let the build either fail or go into a retry .

          Jesse Glick added a comment - Encountered in https://ci.jenkins.io/job/Tools/job/bom/job/PR-1832/1/threadDump/ a massively parallel build that hit a bunch of agent retries Thread #1482 at DSL.node(node block appears to be neither running nor scheduled) at WorkflowScript.mavenEnv(WorkflowScript:8) at DSL.retry(Native Method) at WorkflowScript.mavenEnv(WorkflowScript:6) at WorkflowScript.run(WorkflowScript:53) Still no clue about the root cause, but to align with JENKINS-49707 it would make sense to detect this anomalous condition after a few minutes and fail the step—let the build either fail or go into a retry .
          Jesse Glick made changes -
          Link New: This issue relates to JENKINS-49707 [ JENKINS-49707 ]

            jglick Jesse Glick
            stoiky Mihai Stoichitescu
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: