• workflow-api 1108.v57edf648f5d4 and workflow-durable-task-step 1107.v5dab75aaccbd

      Following the update to the latest LTE version my Jenkins instance would hang during startup and the process would be unresponsive so that systemctl stop and even a plain kill would not remove it. The logs would contain an error message about a thread deadlock (see below). If it's relevant, there was a job in progress which got suspended when the controller was stopped for the upgrade.

      I tried restarting several times, but the same thing happened each time. I then tried downgrading the jenkins package to the previous version but that hit the same error. Restoring from a snapshot allowed me to return to the previous version.

       

      The following error would appear in the logs:

      WARNING j.m.api.Metrics$HealthChecker#execute: Some health checks are reporting as unhealthy: [thread-deadlock : [AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26] locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)
      	at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)
      	at jenkins.model.PeepholePermalink.resolve(PeepholePermalink.java:103)
      	at hudson.model.Job.getLastSuccessfulBuild(Job.java:947)
      	at hudson.model.Job.getEstimatedDurationCandidates(Job.java:1019)
      	at hudson.model.Job.getEstimatedDuration(Job.java:1053)
      	at hudson.model.Run.getEstimatedDuration(Run.java:2496)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getEstimatedDuration(ExecutorStepExecution.java:696)
      	at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:327)
      	at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:312)
      	at hudson.model.Queue.maintain(Queue.java:1645)
      	at hudson.model.Queue$1.call(Queue.java:325)
      	at hudson.model.Queue$1.call(Queue.java:322)
      	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:107)
      	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:97)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121)
      	at java.lang.Thread.run(Thread.java:748)
      , CpsStepContext.isReady [#2] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@18965682 (owned by AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26]):
      	at sun.misc.Unsafe.park(Native Method)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
      	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
      	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
      	at hudson.model.Queue.schedule2(Queue.java:567)
      	at hudson.model.Queue.schedule2(Queue.java:693)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.start(ExecutorStepExecution.java:104)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.onResume(ExecutorStepExecution.java:210)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:265)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:243)
      	at com.google.common.util.concurrent.Futures$6.run(Futures.java:975)
      	at org.jenkinsci.plugins.workflow.flow.DirectExecutor.execute(DirectExecutor.java:33)
      	at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
      	at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
      	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
      	at com.google.common.util.concurrent.Futures.addCallback(Futures.java:985)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener.onResumed(FlowExecutionList.java:243)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionListener.fireResumed(FlowExecutionListener.java:84)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:567)
      	at hudson.model.RunMap.retrieve(RunMap.java:231)
      	at hudson.model.RunMap.retrieve(RunMap.java:58)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:506)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:488)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:386)
      	at hudson.model.RunMap.getById(RunMap.java:211)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:948)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:959)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getExecution(CpsStepContext.java:217)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:242)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.access$000(CpsStepContext.java:97)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:263)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:261)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      , Splunk data monitor thread locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)
      	at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)
      	at hudson.model.Run.fromExternalizableId(Run.java:2483)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.runForDisplay(ExecutorStepExecution.java:527)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getUrl(ExecutorStepExecution.java:536)
      	at com.splunk.splunkjenkins.HealthMonitor.sendPendingQueue(HealthMonitor.java:110)
      	at com.splunk.splunkjenkins.HealthMonitor.execute(HealthMonitor.java:44)
      	at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:101)
      	at hudson.model.AsyncPeriodicWork$$Lambda$545/292627145.run(Unknown Source)
      	at java.lang.Thread.run(Thread.java:748)
      

          [JENKINS-67351] thread deadlock after update to 2.319.1

          Devin Nusbaum added a comment -

          https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running?

          Maybe. It is definitely relevant for https://groups.google.com/g/jenkinsci-users/c/nHRWJrjqi74. I think that https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185#discussion_r762246581 is wrong because Queue.maintain may run on a Timer thread before FlowExecutionListener$ItemListenerImpl.onLoaded is executed.

          Devin Nusbaum added a comment - https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running? Maybe. It is definitely relevant for https://groups.google.com/g/jenkinsci-users/c/nHRWJrjqi74 . I think that https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185#discussion_r762246581 is wrong because Queue.maintain may run on a Timer thread before FlowExecutionListener$ItemListenerImpl.onLoaded is executed.

          Jesse Glick added a comment -

          a thread named "pool-21-thread-1" created by the Datadog plugin

          Should at a minimum use https://javadoc.jenkins.io/hudson/util/NamingThreadFactory.html.

          Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED?

          Jesse Glick added a comment - a thread named "pool-21-thread-1" created by the Datadog plugin Should at a minimum use https://javadoc.jenkins.io/hudson/util/NamingThreadFactory.html . Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED?

          Devin Nusbaum added a comment -

          https://github.com/jenkinsci/workflow-cps-plugin/blob/fc3007f2488397da3632908fcd7a2b1b11f79e7f/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L255-L273 FYI

          Yeah, I read through https://github.com/jenkinsci/pipeline-plugin/pull/68 and JENKINS-25890, but I am not sure that same issue applies here, since the Run and FlowExecution have already been loaded, and so just returning from ResumeStepExecutionListener.onResumed will be enough to release the RunMap lock. Simple enough to introduce a new thread pool just in case though.

          Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED?

          Maybe, although if I am reading things correctly, InitMilestone.COMPLETED is reached before we iterate through ItemListener implementations (see here), so I'm not sure it would be enough. Maybe we could introduce some new API like FlowExecutionList.haveExecutionsResumed that returns false until FlowExecutionList$ItemListenerImpl.onLoaded has completed and use that to disable PR 185.

          Devin Nusbaum added a comment - https://github.com/jenkinsci/workflow-cps-plugin/blob/fc3007f2488397da3632908fcd7a2b1b11f79e7f/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L255-L273  FYI Yeah, I read through https://github.com/jenkinsci/pipeline-plugin/pull/68 and JENKINS-25890 , but I am not sure that same issue applies here, since the Run and FlowExecution have already been loaded, and so just returning from ResumeStepExecutionListener.onResumed will be enough to release the RunMap lock. Simple enough to introduce a new thread pool just in case though. Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED? Maybe, although if I am reading things correctly, InitMilestone.COMPLETED is reached before we iterate through ItemListener implementations (see here ), so I'm not sure it would be enough. Maybe we could introduce some new API like FlowExecutionList.haveExecutionsResumed that returns false until FlowExecutionList$ItemListenerImpl.onLoaded has completed and use that to disable PR 185.

          Jesse Glick added a comment -

          Maybe use @Initializer with its more predictable timeline than ItemListener.onLoaded?

          Jesse Glick added a comment - Maybe use @Initializer with its more predictable timeline than ItemListener.onLoaded ?

          James Robson added a comment -

          https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running?

          I updated all plugins before the upadte, so it should have been the the latest version published as of Decembre 9th, which I believe would include the fix you linked to.

           

          do you have the full Jenkins log from when the error happened?

          I've attached the log file that shows what happened in 1 attempted startup.

          James Robson added a comment - https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running? I updated all plugins before the upadte, so it should have been the the latest version published as of Decembre 9th, which I believe would include the fix you linked to.   do you have the full Jenkins log from when the error happened? I've attached the log file that shows what happened in 1 attempted startup.

          Devin Nusbaum added a comment -

          organised_chaos Thanks, those logs indicate that the Pipeline was resuming before Jenkins fully started up, so maybe https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 was also responsible for that even if it's not directly apparent in the stack traces.

          I have potential fixes up in review in https://github.com/jenkinsci/workflow-api-plugin/pull/188 and https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/188.

          Devin Nusbaum added a comment - organised_chaos Thanks, those logs indicate that the Pipeline was resuming before Jenkins fully started up, so maybe https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 was also responsible for that even if it's not directly apparent in the stack traces. I have potential fixes up in review in https://github.com/jenkinsci/workflow-api-plugin/pull/188 and https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/188 .

          Devin Nusbaum added a comment -

          Fixes for this issue have been released in workflow-api (Pipeline: API) 1108.v57edf648f5d4 and workflow-durable-task-step (Pipeline: Nodes and Processes) 1107.v5dab75aaccbd. Please try updating those plugins, and if you still encounter deadlock, please comment here with a thread dump. Thanks!

          Devin Nusbaum added a comment - Fixes for this issue have been released in workflow-api (Pipeline: API) 1108.v57edf648f5d4 and workflow-durable-task-step (Pipeline: Nodes and Processes) 1107.v5dab75aaccbd. Please try updating those plugins, and if you still encounter deadlock, please comment here with a thread dump. Thanks!

          James Robson added a comment -

          Thanks, I'll try updating again in the new year.

          James Robson added a comment - Thanks, I'll try updating again in the new year.

          Jesse Glick added a comment -

          dnusbaum was the fix for this included in the things that got reverted? If so, should this be reopened as well?

          Jesse Glick added a comment - dnusbaum was the fix for this included in the things that got reverted? If so, should this be reopened as well?

          Devin Nusbaum added a comment -

          This issue was caused by the changes for JENKINS-67164, so given that https://github.com/jenkinsci/workflow-api-plugin/pull/198 reverted all commits related to both issues, I don't think this issue can occur any more. If you are seeing deadlock with the current version of workflow-api, then I think it would be an independent issue.

          Devin Nusbaum added a comment - This issue was caused by the changes for JENKINS-67164 , so given that https://github.com/jenkinsci/workflow-api-plugin/pull/198 reverted all commits related to both issues, I don't think this issue can occur any more. If you are seeing deadlock with the current version of workflow-api , then I think it would be an independent issue.

            dnusbaum Devin Nusbaum
            organised_chaos James Robson
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: