[JENKINS-60507] Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled - Jenkins Jira

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: workflow-durable-task-step-plugin
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show
Released As:
https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1244.vee71f675dee6

Our build system is sometimes showing this in the Thread Dump of a Pipeline while waiting for free executors

Thread #94
at DSL.node(node block appears to be neither running nor scheduled)
at WorkflowScript.runOnNode(WorkflowScript:1798)
at DSL.timeout(body has another 3 hr 14 min to run)
at WorkflowScript.runOnNode(WorkflowScript:1783)
at DSL.retry(Native Method)
at WorkflowScript.runOnNode(WorkflowScript:1781)
at WorkflowScript.getClosure(WorkflowScript:1901)

In BlueOcean this appears, but the build queue is empty, and executors are available with those labels.

Still waiting to schedule task
Waiting for next available executor on pr&&prod&&mac&&build

The job can only be completed by aborting or waiting for the timeout step to do it’s work.

We started observing it since v2.121.3 (workflow-durable-task-step v2.19) but recently we updated to v2.190.1 (workflow-durable-task-step v2.28) and still seeing stuck pipelines when waiting for executors.

The only reference I could find was in the last comment of this issue: https://issues.jenkins-ci.org/browse/JENKINS-42556 and there’s no way we can reproduce it. We’ve noticed this fix made by jglick but not sure if it will help us. We tried turning on Anonymous for a week and we still saw the problem.

Please let me know if there’s more information/logs that I can help with to track down what might be the cause of this. Thanks.

I've attached FINEST level logs on hudson.model.Queue, not sure if that will help a lot.
Our Jenkins runs on RedHat, on Tomcat/9.0.14 and Java 1.8.0_171.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

screenshot-1.png
40 kB
2019-12-16 15:35
plugins_versions.txt
5 kB
2019-12-16 15:28
queue.logs.zip
1.26 MB
2019-12-16 15:28

relates to

JENKINS-42556 PlaceholderTask.runForDisplay vulnerable to AccessDeniedException

Resolved

JENKINS-49707 Auto retry for elastic agents after channel closure

Resolved

JENKINS-71692 Pipeline sometimes leaks Execution on heavyweight executors

Resolved

links to

workflow-durable-task-step-plugin #299

workflow-durable-task-step-plugin #364

Mihai Stoichitescu created issue - 2019-12-16 15:33

Mihai Stoichitescu made changes - 2019-12-16 15:35

Description

Original:

Our build system is sometimes showing this in the Thread Dump of a Pipeline while waiting for free executors

{code:java}
Thread #94
at DSL.node(node block appears to be neither running nor scheduled)
at WorkflowScript.runOnNode(WorkflowScript:1798)
at DSL.timeout(body has another 3 hr 14 min to run)
at WorkflowScript.runOnNode(WorkflowScript:1783)
at DSL.retry(Native Method)
at WorkflowScript.runOnNode(WorkflowScript:1781)
at WorkflowScript.getClosure(WorkflowScript:1901){code}

In BlueOcean this appears, but the build queue is empty, and executors are available with those labels.

{code:java}
Still waiting to schedule task
Waiting for next available executor on pr&&prod&&mac&&build{code}

The job can only be completed by aborting or waiting for the timeout step to do it’s work.

We started observing it since v2.121.3 (workflow-durable-task-step v2.19) but recently we updated to v2.190.1 (workflow-durable-task-step v2.28) and still seeing stuck pipelines when waiting for executors.

The only reference I could find was in the last comment of this issue: https://issues.jenkins-ci.org/browse/JENKINS-42556 and there’s no way we can reproduce it. We’ve noticed this fix made by [~jglick] but not sure if it will help us. We tried turning on Anonymous for a week and we still saw the problem.

Please let me know if there’s more information/logs that I can help with to track down what might be the cause of this. Thanks.

I've attached _FINEST_ level logs on _hudson.model.Queue_, not sure if that will help a lot.

Our Jenkins runs on RedHat, on Tomcat/9.0.14 and Java 1.8.0_171.

New: Our build system is sometimes showing this in the Thread Dump of a Pipeline while waiting for free executors
{code:java}
Thread #94
at DSL.node(node block appears to be neither running nor scheduled)
at WorkflowScript.runOnNode(WorkflowScript:1798)
at DSL.timeout(body has another 3 hr 14 min to run)
at WorkflowScript.runOnNode(WorkflowScript:1783)
at DSL.retry(Native Method)
at WorkflowScript.runOnNode(WorkflowScript:1781)
at WorkflowScript.getClosure(WorkflowScript:1901){code}

In BlueOcean this appears, but the build queue is empty, and executors are available with those labels.
{code:java}
Still waiting to schedule task
Waiting for next available executor on pr&&prod&&mac&&build{code}

The job can only be completed by aborting or waiting for the timeout step to do it’s work.

We started observing it since v2.121.3 (workflow-durable-task-step v2.19) but recently we updated to v2.190.1 (workflow-durable-task-step v2.28) and still seeing stuck pipelines when waiting for executors.

The only reference I could find was in the last comment of this issue: https://issues.jenkins-ci.org/browse/JENKINS-42556 and there’s no way we can reproduce it. We’ve noticed this fix made by [~jglick] but not sure if it will help us. We tried turning on Anonymous for a week and we still saw the problem.

Please let me know if there’s more information/logs that I can help with to track down what might be the cause of this. Thanks.

I've attached _FINEST_ level logs on _hudson.model.Queue_, not sure if that will help a lot.
Our Jenkins runs on RedHat, on Tomcat/9.0.14 and Java 1.8.0_171.

Mihai Stoichitescu made changes - 2019-12-16 15:35

Attachment

New: screenshot-1.png [ 49785 ]

Konstantin Demenkov added a comment - 2020-01-10 12:03 - edited

I have the same issue on latest 2.204.1 LTS. It appears pretty often (10% of jobs) in working with proxmox slaves over proxmox cloud plugin and jnlp. I suspect some incompatibility in timeouts/ connection's logic between master and proxmox slaves, but really don't know, why it happens.

Konstantin Demenkov added a comment - 2020-01-10 12:03 - edited I have the same issue on latest 2.204.1 LTS. It appears pretty often (10% of jobs) in working with proxmox slaves over proxmox cloud plugin and jnlp. I suspect some incompatibility in timeouts/ connection's logic between master and proxmox slaves, but really don't know, why it happens.

Konstantin Demenkov made changes - 2020-01-10 12:10

Priority

Original: Minor [ 4 ]

New: Major [ 3 ]

Mihai Stoichitescu added a comment - 2020-03-18 08:37

We are still being hit by the issue from time to time, any ideas/workarounds/help to debug would be appreciated. Thanks

Mihai Stoichitescu added a comment - 2020-03-18 08:37 We are still being hit by the issue from time to time, any ideas/workarounds/help to debug would be appreciated. Thanks

Jesse Glick made changes - 2022-10-18 12:56

Link

New: This issue relates to ~~JENKINS-42556~~ [ ~~JENKINS-42556~~ ]

Jesse Glick made changes - 2022-10-18 12:57

Component/s

Original: core [ 15593 ]

Jesse Glick added a comment - 2023-03-07 19:25

Encountered in https://ci.jenkins.io/job/Tools/job/bom/job/PR-1832/1/threadDump/ a massively parallel build that hit a bunch of agent retries

Thread #1482
	at DSL.node(node block appears to be neither running nor scheduled)
	at WorkflowScript.mavenEnv(WorkflowScript:8)
	at DSL.retry(Native Method)
	at WorkflowScript.mavenEnv(WorkflowScript:6)
	at WorkflowScript.run(WorkflowScript:53)

Still no clue about the root cause, but to align with ~~JENKINS-49707~~ it would make sense to detect this anomalous condition after a few minutes and fail the step—let the build either fail or go into a retry .

Jesse Glick added a comment - 2023-03-07 19:25 Encountered in https://ci.jenkins.io/job/Tools/job/bom/job/PR-1832/1/threadDump/ a massively parallel build that hit a bunch of agent retries Thread #1482 at DSL.node(node block appears to be neither running nor scheduled) at WorkflowScript.mavenEnv(WorkflowScript:8) at DSL.retry(Native Method) at WorkflowScript.mavenEnv(WorkflowScript:6) at WorkflowScript.run(WorkflowScript:53) Still no clue about the root cause, but to align with JENKINS-49707 it would make sense to detect this anomalous condition after a few minutes and fail the step—let the build either fail or go into a retry .

Jesse Glick made changes - 2023-03-07 19:25

Link

New: This issue relates to ~~JENKINS-49707~~ [ ~~JENKINS-49707~~ ]

Assignee:: Jesse Glick

Reporter:: Mihai Stoichitescu

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2019-12-16 15:33

Updated:: 2024-04-13 15:30

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

[JENKINS-60507] Pipeline stuck when allocating machine | node block appears to be neither running nor scheduled

Collapse comment: Konstantin Demenkov added a comment - 2020-01-10 12:03, Edited by Konstantin Demenkov - 2020-01-10 12:48

Expand comment: Konstantin Demenkov added a comment - 2020-01-10 12:03, Edited by Konstantin Demenkov - 2020-01-10 12:48

Collapse comment: Mihai Stoichitescu added a comment - 2020-03-18 08:37

Expand comment: Mihai Stoichitescu added a comment - 2020-03-18 08:37

Collapse comment: Jesse Glick added a comment - 2023-03-07 19:25

Expand comment: Jesse Glick added a comment - 2023-03-07 19:25

People

Dates