[JENKINS-45571] "likely stuck" job is not actually stuck.

Type: Bug
Resolution: Unresolved
Priority: Trivial
Component/s: core, workflow-cps-plugin, workflow-job-plugin
Labels:
- api
- triaged-2018-11

Similar Issues:
Powered by SuggestiMate

Show

I doubt this one is reproductible.

Going to yourjiraurlhere/computer/api/json?pretty=true&tree=computer[oneOffExecutors[likelyStuck,currentExecutable[result,url]]]{0}

gives you the jobs currently running.

One of my job is marked as "likely stuck", but his state is result is "SUCCESS" (and has been "SUCCESS" since 2h30, making me doubt about the veracity of the "likely stuck".

The job isn't running either. It's completed, but is still somehow showing as "likely stuck".

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screen Shot 2017-07-17 at 13.45.35.png
30 kB
2017-07-17 12:00

is related to

JENKINS-50199 Failed pipeline jobs stuck running after incorrect resume

Resolved

relates to

JENKINS-53158 Pipelines leak OneOffExecutors when triggered on a temporary offline master

Resolved

JENKINS-60348 Followup - Investigate root cause of idle executors

Open

JENKINS-61087 Use FlowExecutionList to calculate the number of running Pipeline jobs

Resolved

links to

jenkinsci/workflow-cps-plugin#234

Daniel Beck added a comment - 2017-07-20 08:09

Builds are successful until they're not, even while running. It doesn't mean it finished. In fact, while an executor is in use, it's very likely to not be finished.

Daniel Beck added a comment - 2017-07-20 08:09 Builds are successful until they're not, even while running. It doesn't mean it finished. In fact, while an executor is in use, it's very likely to not be finished.

Daniel Beck added a comment - 2018-07-27 16:24

From another source I got a few logs of something that looks a lot like this – could you check whether the listed builds appear in the Jenkins system log as having failed to resume after restart, or similar?

Daniel Beck added a comment - 2018-07-27 16:24 From another source I got a few logs of something that looks a lot like this – could you check whether the listed builds appear in the Jenkins system log as having failed to resume after restart, or similar?

Devin Nusbaum added a comment - 2018-07-27 17:55

The best theory we have is that something is causing the OneOffExecutors to not be cleaned up correctly, and that it might be related to resuming pipelines at startup. zeal_iskander Are you still seeing this issue? If so, what versions of the workflow-cps and workflow-job plugins do you have installed? Do you see any log messages about the builds that completed but are showing as likely stuck?

Devin Nusbaum added a comment - 2018-07-27 17:55 The best theory we have is that something is causing the OneOffExecutors to not be cleaned up correctly, and that it might be related to resuming pipelines at startup. zeal_iskander Are you still seeing this issue? If so, what versions of the workflow-cps and workflow-job plugins do you have installed? Do you see any log messages about the builds that completed but are showing as likely stuck?

Stark Gabriel added a comment - 2018-07-27 19:18 - edited

I wouldn't know, I don't work at that company anymore. Sorry!

Stark Gabriel added a comment - 2018-07-27 19:18 - edited I wouldn't know, I don't work at that company anymore. Sorry!

Devin Nusbaum added a comment - 2018-07-27 19:27

zeal_iskander No problem, thanks for replying!

Devin Nusbaum added a comment - 2018-07-27 19:27 zeal_iskander No problem, thanks for replying!

Sam Van Oort added a comment - 2018-07-27 19:56

dnusbaum danielbeck IIUC what is being described, this actually maps to a really obnoxious bug I've been investigating for several weeks that has been blocking release of a fix/improvement to persistence (with threading implications due to the use of synchronization during I/O).

It seems to be on the Pipeline end itself though – it relates to how the Pipeline job interacts with the OneOffExecutor created when it throws an AsynchronousExecution upon running. The Pipeline may even get marked as completed, but somehow the listener that terminates the Pipeline is not invoked – I'm simplifying grossly here, of course, in reality there's a very complex asynchronous chain of events with a complex threading model underlying all this.

The behavior can be traced to the upon-resume situation if state was incompletely persisted AFAICT, but it requires somewhat precise timing to trigger the events.

Some situations that cause this behavior have probably been solved by prior fixes, but clearly not all of them.

Sam Van Oort added a comment - 2018-07-27 19:56 dnusbaum danielbeck IIUC what is being described, this actually maps to a really obnoxious bug I've been investigating for several weeks that has been blocking release of a fix/improvement to persistence (with threading implications due to the use of synchronization during I/O). It seems to be on the Pipeline end itself though – it relates to how the Pipeline job interacts with the OneOffExecutor created when it throws an AsynchronousExecution upon running. The Pipeline may even get marked as completed, but somehow the listener that terminates the Pipeline is not invoked – I'm simplifying grossly here, of course, in reality there's a very complex asynchronous chain of events with a complex threading model underlying all this. The behavior can be traced to the upon-resume situation if state was incompletely persisted AFAICT, but it requires somewhat precise timing to trigger the events. Some situations that cause this behavior have probably been solved by prior fixes, but clearly not all of them.

Anna Tikhonova added a comment - 2018-08-08 08:59

I'm seeing this issue as well. Lots of executors listed in /computer/api/json?pretty=true&tree=computer[oneOffExecutors[likelyStuck,currentExecutable[building,result,url]]]{0} in the following state:

{
 "currentExecutable" : {
 "_class" : "org.jenkinsci.plugins.workflow.job.WorkflowRun",
 "building" : false,
 "result" : "SUCCESS",
 "url" : url
 },
 "likelyStuck" : true
}

However, in my case it doesn't seem to be related to resuming pipelines at Jenkins startup. I have written a script to cleanup such executors. Haven't restarted Jenkins since the script has run, and still I see the new executors like those.

Anna Tikhonova added a comment - 2018-08-08 08:59 I'm seeing this issue as well. Lots of executors listed in /computer/api/json?pretty=true&tree=computer[oneOffExecutors[likelyStuck,currentExecutable [building,result,url] ]]{0} in the following state: { "currentExecutable" : { "_class" : "org.jenkinsci.plugins.workflow.job.WorkflowRun" , "building" : false , "result" : "SUCCESS" , "url" : url }, "likelyStuck" : true } However, in my case it doesn't seem to be related to resuming pipelines at Jenkins startup. I have written a script to cleanup such executors. Haven't restarted Jenkins since the script has run, and still I see the new executors like those.

Anna Tikhonova added a comment - 2018-08-08 09:10

Why this bug could be of more interest is that it intervenes Throttle Concurrent Build plugin scheduling. TCP prevents scheduling more builds because it considers those hanging executors. Once there are more hanging executors than maximum total concurrent builds configured for a job (N), the job is forever stuck ("pending—Already running N builds across all nodes").

Anna Tikhonova added a comment - 2018-08-08 09:10 Why this bug could be of more interest is that it intervenes Throttle Concurrent Build plugin scheduling. TCP prevents scheduling more builds because it considers those hanging executors. Once there are more hanging executors than maximum total concurrent builds configured for a job (N), the job is forever stuck ("pending—Already running N builds across all nodes").

Devin Nusbaum added a comment - 2018-08-23 17:39

atikhonova The fact that you are seeing the issue without restarting Jenkins is very interesting. Do you have a pipeline which is able to reproduce the problem consistently?

Devin Nusbaum added a comment - 2018-08-23 17:39 atikhonova The fact that you are seeing the issue without restarting Jenkins is very interesting. Do you have a pipeline which is able to reproduce the problem consistently?

Sam Van Oort added a comment - 2018-08-28 21:38

Note from investigation: so, separate from ~~JENKINS-50199~~ there appears to be a different but related failure mode:

1. The symptoms described by Anna will be reproduced if the build completes (WorkflowRun#finish is called), but the copyLogsTask never gets invoked or fails, since that is what actually removes the FlyWeightTask and kills the OneOffExecutor. See the CopyLogsTask logic - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L403
2. If the AsynchronousExecution is never completed, we'll see a "likelyStuck" executor for each OneOffExecutor

Sam Van Oort added a comment - 2018-08-28 21:38 Note from investigation: so, separate from JENKINS-50199 there appears to be a different but related failure mode: 1. The symptoms described by Anna will be reproduced if the build completes (WorkflowRun#finish is called), but the copyLogsTask never gets invoked or fails, since that is what actually removes the FlyWeightTask and kills the OneOffExecutor. See the CopyLogsTask logic - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L403 2. If the AsynchronousExecution is never completed, we'll see a "likelyStuck" executor for each OneOffExecutor

Anna Tikhonova added a comment - 2018-08-29 11:02 - edited

dnusbaum unfortunately, I don't. I've got a few 1000+ LOC pipelines running continuously. I do not know how to tell which one leaves executors and when.

Pipeline build that has such "likelyStuck" executor looks completed on its build page (no progress bars, build status is set). But I still can see a matching OneOffExecutor on master:

      "_class" : "hudson.model.Hudson$MasterComputer",
      "oneOffExecutors" : [
        {
          "currentExecutable" : {
            "_class" : "org.jenkinsci.plugins.workflow.job.WorkflowRun",
            "building" : false,    // always false for these lost executors
            "result" : "SUCCESS",    // always set to some valid build status != null
            "url" : "JENKINS/job/PIPELINE/BUILD_NUMBER/"
          },
          "likelyStuck" : false    // can be true or false
        }, ...

Anna Tikhonova added a comment - 2018-08-29 11:02 - edited dnusbaum unfortunately, I don't. I've got a few 1000+ LOC pipelines running continuously. I do not know how to tell which one leaves executors and when. Pipeline build that has such "likelyStuck" executor looks completed on its build page (no progress bars, build status is set). But I still can see a matching OneOffExecutor on master: "_class" : "hudson.model.Hudson$MasterComputer" , "oneOffExecutors" : [ { "currentExecutable" : { "_class" : "org.jenkinsci.plugins.workflow.job.WorkflowRun" , "building" : false , // always false for these lost executors "result" : "SUCCESS" , // always set to some valid build status != null "url" : "JENKINS/job/PIPELINE/BUILD_NUMBER/" }, "likelyStuck" : false // can be true or false }, ...

Devin Nusbaum added a comment - 2018-08-29 14:19 - edited

atikhonova Are you able upload the build directory of the build matching the stuck executor? Specifically, it would be helpful to see build.xml and the xml file(s) in the workflow directory. EDIT: I see now that you can't easily tell which are stuck and which are good. If you can find an executor with likelyStuck: true, and whose build looks like it has otherwise completed or is suck, that would be a great candidate.

Another note: ~~JENKINS-38381~~ will change the control flow here significantly.

Devin Nusbaum added a comment - 2018-08-29 14:19 - edited atikhonova Are you able upload the build directory of the build matching the stuck executor? Specifically, it would be helpful to see build.xml and the xml file(s) in the workflow directory. EDIT: I see now that you can't easily tell which are stuck and which are good. If you can find an executor with likelyStuck: true , and whose build looks like it has otherwise completed or is suck, that would be a great candidate. Another note: JENKINS-38381 will change the control flow here significantly.

Jesse Glick added a comment - 2018-09-28 17:44

gives you the jobs currently running

This is not really an appropriate API query to use for that question. If your interest is limited to all Pipeline builds, FlowExecutionList is likely to be more useful. If you are looking at builds of a particular job (Pipeline or not), I think that information is available from the endpoint for that job.

Jesse Glick added a comment - 2018-09-28 17:44 gives you the jobs currently running This is not really an appropriate API query to use for that question. If your interest is limited to all Pipeline builds, FlowExecutionList is likely to be more useful. If you are looking at builds of a particular job (Pipeline or not), I think that information is available from the endpoint for that job.

Jesse Glick added a comment - 2018-09-28 19:25

TCP prevents scheduling more builds because it considers those hanging executors.

Offhand this sounds like a flaw in TCB. This PR introduced that behavior, purportedly to support the build-flow plugin (a conceptual predecessor of Pipeline née Workflow). If TCB intends to throttle builds per se (rather than work done by those builds—typically node blocks for Pipeline), then there are more direct ways of doing this than counting Executor slots.

Jesse Glick added a comment - 2018-09-28 19:25 TCP prevents scheduling more builds because it considers those hanging executors. Offhand this sounds like a flaw in TCB. This PR introduced that behavior, purportedly to support the build-flow plugin (a conceptual predecessor of Pipeline née Workflow). If TCB intends to throttle builds per se (rather than work done by those builds—typically node blocks for Pipeline), then there are more direct ways of doing this than counting Executor slots.

Basil Crow added a comment - 2019-02-13 03:10

Offhand this sounds like a flaw in TCB.

I am attempting to fix this flaw in jenkinsci/throttle-concurrent-builds-plugin#57.

Basil Crow added a comment - 2019-02-13 03:10 Offhand this sounds like a flaw in TCB. I am attempting to fix this flaw in jenkinsci/throttle-concurrent-builds-plugin#57 .

Basil Crow added a comment - 2020-02-05 23:40

I am attempting to fix this flaw in jenkinsci/throttle-concurrent-builds-plugin#57.

This PR has been merged, and the master branch of Throttle Concurrent Builds now uses FlowExecutionList to calculate the number of running Pipeline jobs, which should work around the issue described in this bug. I have yet to release a new version of Throttle Concurrent Builds with this fix, but there is an incremental build available here. atikhonova, are you interested in testing this incremental build before I do an official release?

Basil Crow added a comment - 2020-02-05 23:40 I am attempting to fix this flaw in jenkinsci/throttle-concurrent-builds-plugin#57 . This PR has been merged, and the master branch of Throttle Concurrent Builds now uses FlowExecutionList to calculate the number of running Pipeline jobs, which should work around the issue described in this bug. I have yet to release a new version of Throttle Concurrent Builds with this fix, but there is an incremental build available here . atikhonova , are you interested in testing this incremental build before I do an official release?

Assignee:: Unassigned

Reporter:: Stark Gabriel

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2017-07-17 12:04

Updated:: 2024-09-29 11:28

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Daniel Beck added a comment - 2017-07-20 08:09

Expand comment: Daniel Beck added a comment - 2017-07-20 08:09

Collapse comment: Daniel Beck added a comment - 2018-07-27 16:24

Expand comment: Daniel Beck added a comment - 2018-07-27 16:24

Collapse comment: Devin Nusbaum added a comment - 2018-07-27 17:55

Expand comment: Devin Nusbaum added a comment - 2018-07-27 17:55

Collapse comment: Stark Gabriel added a comment - 2018-07-27 19:18, Edited by Stark Gabriel - 2018-07-27 19:26

Expand comment: Stark Gabriel added a comment - 2018-07-27 19:18, Edited by Stark Gabriel - 2018-07-27 19:26

Collapse comment: Devin Nusbaum added a comment - 2018-07-27 19:27

Expand comment: Devin Nusbaum added a comment - 2018-07-27 19:27

Collapse comment: Sam Van Oort added a comment - 2018-07-27 19:56

Expand comment: Sam Van Oort added a comment - 2018-07-27 19:56

Collapse comment: Anna Tikhonova added a comment - 2018-08-08 08:59

Expand comment: Anna Tikhonova added a comment - 2018-08-08 08:59

Collapse comment: Anna Tikhonova added a comment - 2018-08-08 09:10

Expand comment: Anna Tikhonova added a comment - 2018-08-08 09:10

Collapse comment: Devin Nusbaum added a comment - 2018-08-23 17:39

Expand comment: Devin Nusbaum added a comment - 2018-08-23 17:39

Collapse comment: Sam Van Oort added a comment - 2018-08-28 21:38

Expand comment: Sam Van Oort added a comment - 2018-08-28 21:38

Collapse comment: Anna Tikhonova added a comment - 2018-08-29 11:02, Edited by Anna Tikhonova - 2018-08-29 14:31

Expand comment: Anna Tikhonova added a comment - 2018-08-29 11:02, Edited by Anna Tikhonova - 2018-08-29 14:31

Collapse comment: Devin Nusbaum added a comment - 2018-08-29 14:19, Edited by Devin Nusbaum - 2018-09-28 17:46

Expand comment: Devin Nusbaum added a comment - 2018-08-29 14:19, Edited by Devin Nusbaum - 2018-09-28 17:46

Collapse comment: Jesse Glick added a comment - 2018-09-28 17:44

Expand comment: Jesse Glick added a comment - 2018-09-28 17:44

Collapse comment: Jesse Glick added a comment - 2018-09-28 19:25

Expand comment: Jesse Glick added a comment - 2018-09-28 19:25

Collapse comment: Basil Crow added a comment - 2019-02-13 03:10

Expand comment: Basil Crow added a comment - 2019-02-13 03:10

Collapse comment: Basil Crow added a comment - 2020-02-05 23:40

Expand comment: Basil Crow added a comment - 2020-02-05 23:40

People

Dates