Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67351

thread deadlock after update to 2.319.1

    XMLWordPrintable

Details

    • workflow-api 1108.v57edf648f5d4 and workflow-durable-task-step 1107.v5dab75aaccbd

    Description

      Following the update to the latest LTE version my Jenkins instance would hang during startup and the process would be unresponsive so that systemctl stop and even a plain kill would not remove it. The logs would contain an error message about a thread deadlock (see below). If it's relevant, there was a job in progress which got suspended when the controller was stopped for the upgrade.

      I tried restarting several times, but the same thing happened each time. I then tried downgrading the jenkins package to the previous version but that hit the same error. Restoring from a snapshot allowed me to return to the previous version.

       

      The following error would appear in the logs:

      WARNING j.m.api.Metrics$HealthChecker#execute: Some health checks are reporting as unhealthy: [thread-deadlock : [AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26] locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)
      	at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)
      	at jenkins.model.PeepholePermalink.resolve(PeepholePermalink.java:103)
      	at hudson.model.Job.getLastSuccessfulBuild(Job.java:947)
      	at hudson.model.Job.getEstimatedDurationCandidates(Job.java:1019)
      	at hudson.model.Job.getEstimatedDuration(Job.java:1053)
      	at hudson.model.Run.getEstimatedDuration(Run.java:2496)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getEstimatedDuration(ExecutorStepExecution.java:696)
      	at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:327)
      	at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:312)
      	at hudson.model.Queue.maintain(Queue.java:1645)
      	at hudson.model.Queue$1.call(Queue.java:325)
      	at hudson.model.Queue$1.call(Queue.java:322)
      	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:107)
      	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:97)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121)
      	at java.lang.Thread.run(Thread.java:748)
      , CpsStepContext.isReady [#2] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@18965682 (owned by AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26]):
      	at sun.misc.Unsafe.park(Native Method)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
      	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
      	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
      	at hudson.model.Queue.schedule2(Queue.java:567)
      	at hudson.model.Queue.schedule2(Queue.java:693)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.start(ExecutorStepExecution.java:104)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.onResume(ExecutorStepExecution.java:210)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:265)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:243)
      	at com.google.common.util.concurrent.Futures$6.run(Futures.java:975)
      	at org.jenkinsci.plugins.workflow.flow.DirectExecutor.execute(DirectExecutor.java:33)
      	at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
      	at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
      	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
      	at com.google.common.util.concurrent.Futures.addCallback(Futures.java:985)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener.onResumed(FlowExecutionList.java:243)
      	at org.jenkinsci.plugins.workflow.flow.FlowExecutionListener.fireResumed(FlowExecutionListener.java:84)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:567)
      	at hudson.model.RunMap.retrieve(RunMap.java:231)
      	at hudson.model.RunMap.retrieve(RunMap.java:58)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:506)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:488)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:386)
      	at hudson.model.RunMap.getById(RunMap.java:211)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:948)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:959)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getExecution(CpsStepContext.java:217)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:242)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.access$000(CpsStepContext.java:97)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:263)
      	at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:261)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      , Splunk data monitor thread locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)
      	at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)
      	at hudson.model.Run.fromExternalizableId(Run.java:2483)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.runForDisplay(ExecutorStepExecution.java:527)
      	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getUrl(ExecutorStepExecution.java:536)
      	at com.splunk.splunkjenkins.HealthMonitor.sendPendingQueue(HealthMonitor.java:110)
      	at com.splunk.splunkjenkins.HealthMonitor.execute(HealthMonitor.java:44)
      	at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:101)
      	at hudson.model.AsyncPeriodicWork$$Lambda$545/292627145.run(Unknown Source)
      	at java.lang.Thread.run(Thread.java:748)
      

      Attachments

        Issue Links

          Activity

            organised_chaos James Robson created issue -
            markewaite Mark Waite made changes -
            Field Original Value New Value
            Summary thread deadlock after update to 2.391.1 thread deadlock after update to 2.319.1
            markewaite Mark Waite made changes -
            Environment jenkins: 2.391.1
            OS: ubuntu 20.04
            Java: 1.8.0_292
            jenkins: 2.319.1
            OS: ubuntu 20.04
            Java: 1.8.0_292
            dnusbaum Devin Nusbaum added a comment - - edited

            Appears to be caused by JENKINS-67164 in workflow-api 1105.v3de5e2efac97 (you may be able to downgrade workflow-api to 2.47 to avoid the issue).

            The "[AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]" thread is trying to access a build via RunMap, which requires a lock, and currently holds the queue lock.

            The "CpsStepContext.isReady 2" thread is loading a build and is holding a RunMap lock, and is stuck trying to schedule a placeholder task to resume a node step which requires the queue lock.

            I think the "Splunk data monitor" thread is irrelevant.

            I will look into patching workflow-api to fix this.

            dnusbaum Devin Nusbaum added a comment - - edited Appears to be caused by JENKINS-67164 in workflow-api 1105.v3de5e2efac97 (you may be able to downgrade workflow-api to 2.47 to avoid the issue). The "[AtmostOneTaskExecutor [Periodic Jenkins queue maintenance] " thread is trying to access a build via RunMap , which requires a lock, and currently holds the queue lock. The "CpsStepContext.isReady 2" thread is loading a build and is holding a RunMap lock, and is stuck trying to schedule a placeholder task to resume a node step which requires the queue lock. I think the "Splunk data monitor" thread is irrelevant. I will look into patching workflow-api to fix this.
            dnusbaum Devin Nusbaum made changes -
            Link This issue is caused by JENKINS-67164 [ JENKINS-67164 ]
            dnusbaum Devin Nusbaum made changes -
            Component/s workflow-api-plugin [ 21711 ]
            Component/s core [ 15593 ]
            dnusbaum Devin Nusbaum made changes -
            Assignee Devin Nusbaum [ dnusbaum ]
            jglick Jesse Glick made changes -
            Labels deadlock regression
            jglick Jesse Glick made changes -
            Description Following the update to the latest LTE version my Jenkins instance would hang during startup and the process would be unresponsive so that {{systemctl stop}} and even a plain {{kill}} would not remove it. The logs would contain an error message about a thread deadlock (see below). If it's relevant, there was a job in progress which got suspended when the controller was stopped for the upgrade.

            I tried restarting several times, but the same thing happened each time. I then tried downgrading the jenkins package to the previous version but that hit the same error. Restoring from a snapshot allowed me to return to the previous version.

             

            The following error would appear in the logs:

            {{WARNING j.m.api.Metrics$HealthChecker#execute: Some health checks are reporting as unhealthy: [thread-deadlock : [AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26] locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):}}
             \{{ at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)}}
             \{{ at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)}}
             \{{ at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)}}
             \{{ at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)}}
             \{{ at jenkins.model.PeepholePermalink.resolve(PeepholePermalink.java:103)}}
             \{{ at hudson.model.Job.getLastSuccessfulBuild(Job.java:947)}}
             \{{ at hudson.model.Job.getEstimatedDurationCandidates(Job.java:1019)}}
             \{{ at hudson.model.Job.getEstimatedDuration(Job.java:1053)}}
             \{{ at hudson.model.Run.getEstimatedDuration(Run.java:2496)}}
             \{{ at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getEstimatedDuration(ExecutorStepExecution.java:696)}}
             \{{ at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:327)}}
             \{{ at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:312)}}
             \{{ at hudson.model.Queue.maintain(Queue.java:1645)}}
             \{{ at hudson.model.Queue$1.call(Queue.java:325)}}
             \{{ at hudson.model.Queue$1.call(Queue.java:322)}}
             \{{ at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:107)}}
             \{{ at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:97)}}
             \{{ at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)}}
             \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
             \{{ at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121)}}
             \{{ at java.lang.Thread.run(Thread.java:748)}}
             {{, CpsStepContext.isReady [#2] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@18965682 (owned by AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26]):}}
             \{{ at sun.misc.Unsafe.park(Native Method)}}
             \{{ at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)}}
             \{{ at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)}}
             \{{ at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)}}
             \{{ at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)}}
             \{{ at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)}}
             \{{ at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)}}
             \{{ at hudson.model.Queue.schedule2(Queue.java:567)}}
             \{{ at hudson.model.Queue.schedule2(Queue.java:693)}}
             \{{ at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.start(ExecutorStepExecution.java:104)}}
             \{{ at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.onResume(ExecutorStepExecution.java:210)}}
             \{{ at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:265)}}
             \{{ at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:243)}}
             \{{ at com.google.common.util.concurrent.Futures$6.run(Futures.java:975)}}
             \{{ at org.jenkinsci.plugins.workflow.flow.DirectExecutor.execute(DirectExecutor.java:33)}}
             \{{ at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)}}
             \{{ at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)}}
             \{{ at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)}}
             \{{ at com.google.common.util.concurrent.Futures.addCallback(Futures.java:985)}}
             \{{ at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener.onResumed(FlowExecutionList.java:243)}}
             \{{ at org.jenkinsci.plugins.workflow.flow.FlowExecutionListener.fireResumed(FlowExecutionListener.java:84)}}
             \{{ at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:567)}}
             \{{ at hudson.model.RunMap.retrieve(RunMap.java:231)}}
             \{{ at hudson.model.RunMap.retrieve(RunMap.java:58)}}
             \{{ at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:506)}}
             \{{ at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:488)}}
             \{{ at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:386)}}
             \{{ at hudson.model.RunMap.getById(RunMap.java:211)}}
             \{{ at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:948)}}
             \{{ at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:959)}}
             \{{ at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getExecution(CpsStepContext.java:217)}}
             \{{ at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:242)}}
             \{{ at org.jenkinsci.plugins.workflow.cps.CpsStepContext.access$000(CpsStepContext.java:97)}}
             \{{ at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:263)}}
             \{{ at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:261)}}
             \{{ at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)}}
             \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
             \{{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
             \{{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
             \{{ at java.lang.Thread.run(Thread.java:748)}}
             {{, Splunk data monitor thread locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):}}
             \{{ at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)}}
             \{{ at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)}}
             \{{ at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)}}
             \{{ at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)}}
             \{{ at hudson.model.Run.fromExternalizableId(Run.java:2483)}}
             \{{ at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.runForDisplay(ExecutorStepExecution.java:527)}}
             \{{ at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getUrl(ExecutorStepExecution.java:536)}}
             \{{ at com.splunk.splunkjenkins.HealthMonitor.sendPendingQueue(HealthMonitor.java:110)}}
             \{{ at com.splunk.splunkjenkins.HealthMonitor.execute(HealthMonitor.java:44)}}
             \{{ at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:101)}}
             \{{ at hudson.model.AsyncPeriodicWork$$Lambda$545/292627145.run(Unknown Source)}}
             \{{ at java.lang.Thread.run(Thread.java:748)}}
             {{]]}}
            Following the update to the latest LTE version my Jenkins instance would hang during startup and the process would be unresponsive so that {{systemctl stop}} and even a plain {{kill}} would not remove it. The logs would contain an error message about a thread deadlock (see below). If it's relevant, there was a job in progress which got suspended when the controller was stopped for the upgrade.

            I tried restarting several times, but the same thing happened each time. I then tried downgrading the jenkins package to the previous version but that hit the same error. Restoring from a snapshot allowed me to return to the previous version.

             

            The following error would appear in the logs:

            {code:none}
            WARNING j.m.api.Metrics$HealthChecker#execute: Some health checks are reporting as unhealthy: [thread-deadlock : [AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26] locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):
            at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)
            at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)
            at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)
            at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)
            at jenkins.model.PeepholePermalink.resolve(PeepholePermalink.java:103)
            at hudson.model.Job.getLastSuccessfulBuild(Job.java:947)
            at hudson.model.Job.getEstimatedDurationCandidates(Job.java:1019)
            at hudson.model.Job.getEstimatedDuration(Job.java:1053)
            at hudson.model.Run.getEstimatedDuration(Run.java:2496)
            at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getEstimatedDuration(ExecutorStepExecution.java:696)
            at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:327)
            at hudson.model.queue.MappingWorksheet.<init>(MappingWorksheet.java:312)
            at hudson.model.Queue.maintain(Queue.java:1645)
            at hudson.model.Queue$1.call(Queue.java:325)
            at hudson.model.Queue$1.call(Queue.java:322)
            at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:107)
            at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:97)
            at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121)
            at java.lang.Thread.run(Thread.java:748)
            , CpsStepContext.isReady [#2] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@18965682 (owned by AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#26]):
            at sun.misc.Unsafe.park(Native Method)
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
            at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
            at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
            at hudson.model.Queue.schedule2(Queue.java:567)
            at hudson.model.Queue.schedule2(Queue.java:693)
            at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.start(ExecutorStepExecution.java:104)
            at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.onResume(ExecutorStepExecution.java:210)
            at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:265)
            at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener$1.onSuccess(FlowExecutionList.java:243)
            at com.google.common.util.concurrent.Futures$6.run(Futures.java:975)
            at org.jenkinsci.plugins.workflow.flow.DirectExecutor.execute(DirectExecutor.java:33)
            at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
            at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
            at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
            at com.google.common.util.concurrent.Futures.addCallback(Futures.java:985)
            at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener.onResumed(FlowExecutionList.java:243)
            at org.jenkinsci.plugins.workflow.flow.FlowExecutionListener.fireResumed(FlowExecutionListener.java:84)
            at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:567)
            at hudson.model.RunMap.retrieve(RunMap.java:231)
            at hudson.model.RunMap.retrieve(RunMap.java:58)
            at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:506)
            at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:488)
            at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:386)
            at hudson.model.RunMap.getById(RunMap.java:211)
            at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:948)
            at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:959)
            at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getExecution(CpsStepContext.java:217)
            at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:242)
            at org.jenkinsci.plugins.workflow.cps.CpsStepContext.access$000(CpsStepContext.java:97)
            at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:263)
            at org.jenkinsci.plugins.workflow.cps.CpsStepContext$1.call(CpsStepContext.java:261)
            at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)
            , Splunk data monitor thread locked on hudson.model.RunMap@166af3a7 (owned by CpsStepContext.isReady [#2]):
            at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:376)
            at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:228)
            at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:233)
            at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:104)
            at hudson.model.Run.fromExternalizableId(Run.java:2483)
            at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.runForDisplay(ExecutorStepExecution.java:527)
            at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getUrl(ExecutorStepExecution.java:536)
            at com.splunk.splunkjenkins.HealthMonitor.sendPendingQueue(HealthMonitor.java:110)
            at com.splunk.splunkjenkins.HealthMonitor.execute(HealthMonitor.java:44)
            at hudson.model.AsyncPeriodicWork.lambda$doRun$0(AsyncPeriodicWork.java:101)
            at hudson.model.AsyncPeriodicWork$$Lambda$545/292627145.run(Unknown Source)
            at java.lang.Thread.run(Thread.java:748)
            {code}
            dnusbaum Devin Nusbaum added a comment -

            One interesting thing about the thread dump is that the Pipeline is resuming because something called CpsStepContext.isReady (some method in ExecutorStepExecution.PlaceholderTask?) rather than because FlowExecutionList$ItemListenerImpl.onLoaded ran, which is what we would normally expect. organised_chaos do you have the full Jenkins log from when the error happened? I am curious to understand the timing of the deadlock in relation to Jenkins starting up.

            I was not able to reproduce the deadlock myself in a test, but from what I can tell switching from MoreExecutors.directExecutor to Timer.get in ResumeStepExecutionListener seems to work.

            dnusbaum Devin Nusbaum added a comment - One interesting thing about the thread dump is that the Pipeline is resuming because something called CpsStepContext.isReady (some method in ExecutorStepExecution.PlaceholderTask ?) rather than because  FlowExecutionList$ItemListenerImpl.onLoaded ran, which is what we would normally expect. organised_chaos  do you have the full Jenkins log from when the error happened? I am curious to understand the timing of the deadlock in relation to Jenkins starting up. I was not able to reproduce the deadlock myself in a test, but from what I can tell switching from MoreExecutors.directExecutor to Timer.get in ResumeStepExecutionListener seems to work.
            jglick Jesse Glick added a comment - https://github.com/jenkinsci/workflow-cps-plugin/blob/fc3007f2488397da3632908fcd7a2b1b11f79e7f/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L255-L273 FYI
            jglick Jesse Glick added a comment -

            https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running?

            jglick Jesse Glick added a comment - https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running?
            dnusbaum Devin Nusbaum added a comment -

            A similar issue was also reported on jenkinsci-users (that's how I noticed this), although in that case the deadlock is between Queue.maintain running on a Timer thread and a thread named "pool-21-thread-1" created by the Datadog plugin here (should probably be switched to Timer, but thankfully newCachedThreadPool will reap threads that are unused for more than 60 seconds), and again the Pipeline is not being resumed by FlowExecutionListener$ItemListenerImpl.onLoaded. In this the third thread mentioned in the deadlock is the Jenkins initialization thread, and we can see that it is blocked here prior to notifying ItemListener implementations.

            dnusbaum Devin Nusbaum added a comment - A similar issue was also reported on jenkinsci-users (that's how I noticed this), although in that case the deadlock is between Queue.maintain running on a Timer thread and a thread named "pool-21-thread-1" created by the Datadog plugin here (should probably be switched to Timer , but thankfully newCachedThreadPool will reap threads that are unused for more than 60 seconds), and again the Pipeline is not being resumed by FlowExecutionListener$ItemListenerImpl.onLoaded . In this the third thread mentioned in the deadlock is the Jenkins initialization thread, and we can see that it is blocked here prior to notifying ItemListener implementations.
            dnusbaum Devin Nusbaum made changes -
            Remote Link This issue links to "Similar report from jenkinsci-users (Web Link)" [ 27290 ]
            dnusbaum Devin Nusbaum added a comment -

            https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running?

            Maybe. It is definitely relevant for https://groups.google.com/g/jenkinsci-users/c/nHRWJrjqi74. I think that https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185#discussion_r762246581 is wrong because Queue.maintain may run on a Timer thread before FlowExecutionListener$ItemListenerImpl.onLoaded is executed.

            dnusbaum Devin Nusbaum added a comment - https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running? Maybe. It is definitely relevant for https://groups.google.com/g/jenkinsci-users/c/nHRWJrjqi74 . I think that https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185#discussion_r762246581 is wrong because Queue.maintain may run on a Timer thread before FlowExecutionListener$ItemListenerImpl.onLoaded is executed.
            jglick Jesse Glick added a comment -

            a thread named "pool-21-thread-1" created by the Datadog plugin

            Should at a minimum use https://javadoc.jenkins.io/hudson/util/NamingThreadFactory.html.

            Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED?

            jglick Jesse Glick added a comment - a thread named "pool-21-thread-1" created by the Datadog plugin Should at a minimum use https://javadoc.jenkins.io/hudson/util/NamingThreadFactory.html . Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED?
            dnusbaum Devin Nusbaum added a comment -

            https://github.com/jenkinsci/workflow-cps-plugin/blob/fc3007f2488397da3632908fcd7a2b1b11f79e7f/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L255-L273 FYI

            Yeah, I read through https://github.com/jenkinsci/pipeline-plugin/pull/68 and JENKINS-25890, but I am not sure that same issue applies here, since the Run and FlowExecution have already been loaded, and so just returning from ResumeStepExecutionListener.onResumed will be enough to release the RunMap lock. Simple enough to introduce a new thread pool just in case though.

            Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED?

            Maybe, although if I am reading things correctly, InitMilestone.COMPLETED is reached before we iterate through ItemListener implementations (see here), so I'm not sure it would be enough. Maybe we could introduce some new API like FlowExecutionList.haveExecutionsResumed that returns false until FlowExecutionList$ItemListenerImpl.onLoaded has completed and use that to disable PR 185.

            dnusbaum Devin Nusbaum added a comment - https://github.com/jenkinsci/workflow-cps-plugin/blob/fc3007f2488397da3632908fcd7a2b1b11f79e7f/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L255-L273  FYI Yeah, I read through https://github.com/jenkinsci/pipeline-plugin/pull/68 and JENKINS-25890 , but I am not sure that same issue applies here, since the Run and FlowExecution have already been loaded, and so just returning from ResumeStepExecutionListener.onResumed will be enough to release the RunMap lock. Simple enough to introduce a new thread pool just in case though. Maybe PR-185 needs to be disabled until after https://javadoc.jenkins.io/hudson/init/InitMilestone.html#COMPLETED? Maybe, although if I am reading things correctly, InitMilestone.COMPLETED is reached before we iterate through ItemListener implementations (see here ), so I'm not sure it would be enough. Maybe we could introduce some new API like FlowExecutionList.haveExecutionsResumed that returns false until FlowExecutionList$ItemListenerImpl.onLoaded has completed and use that to disable PR 185.
            jglick Jesse Glick added a comment -

            Maybe use @Initializer with its more predictable timeline than ItemListener.onLoaded?

            jglick Jesse Glick added a comment - Maybe use @Initializer with its more predictable timeline than ItemListener.onLoaded ?
            dnusbaum Devin Nusbaum made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            dnusbaum Devin Nusbaum made changes -
            Component/s workflow-durable-task-step-plugin [ 21715 ]
            organised_chaos James Robson made changes -
            Attachment jenkins.log.1 [ 56956 ]
            organised_chaos James Robson added a comment -

            https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running?

            I updated all plugins before the upadte, so it should have been the the latest version published as of Decembre 9th, which I believe would include the fix you linked to.

             

            do you have the full Jenkins log from when the error happened?

            I've attached the log file that shows what happened in 1 attempted startup.

            organised_chaos James Robson added a comment - https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 maybe also contributing. What version of this plugin is reporter running? I updated all plugins before the upadte, so it should have been the the latest version published as of Decembre 9th, which I believe would include the fix you linked to.   do you have the full Jenkins log from when the error happened? I've attached the log file that shows what happened in 1 attempted startup.
            dnusbaum Devin Nusbaum added a comment -

            organised_chaos Thanks, those logs indicate that the Pipeline was resuming before Jenkins fully started up, so maybe https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 was also responsible for that even if it's not directly apparent in the stack traces.

            I have potential fixes up in review in https://github.com/jenkinsci/workflow-api-plugin/pull/188 and https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/188.

            dnusbaum Devin Nusbaum added a comment - organised_chaos Thanks, those logs indicate that the Pipeline was resuming before Jenkins fully started up, so maybe https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/185 was also responsible for that even if it's not directly apparent in the stack traces. I have potential fixes up in review in https://github.com/jenkinsci/workflow-api-plugin/pull/188 and https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/188 .
            dnusbaum Devin Nusbaum made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            dnusbaum Devin Nusbaum made changes -
            Remote Link This issue links to "jenkinsci/workflow-durable-task-step-plugin#188 (Web Link)" [ 27296 ]
            dnusbaum Devin Nusbaum made changes -
            Remote Link This issue links to "jenkinsci/workflow-api-plugin#188 (Web Link)" [ 27297 ]
            dnusbaum Devin Nusbaum added a comment -

            Fixes for this issue have been released in workflow-api (Pipeline: API) 1108.v57edf648f5d4 and workflow-durable-task-step (Pipeline: Nodes and Processes) 1107.v5dab75aaccbd. Please try updating those plugins, and if you still encounter deadlock, please comment here with a thread dump. Thanks!

            dnusbaum Devin Nusbaum added a comment - Fixes for this issue have been released in workflow-api (Pipeline: API) 1108.v57edf648f5d4 and workflow-durable-task-step (Pipeline: Nodes and Processes) 1107.v5dab75aaccbd. Please try updating those plugins, and if you still encounter deadlock, please comment here with a thread dump. Thanks!
            dnusbaum Devin Nusbaum made changes -
            Released As workflow-api 1108.v57edf648f5d4 and workflow-durable-task-step 1107.v5dab75aaccbd
            Resolution Fixed [ 1 ]
            Status In Review [ 10005 ] Fixed but Unreleased [ 10203 ]
            dnusbaum Devin Nusbaum made changes -
            Status Fixed but Unreleased [ 10203 ] Resolved [ 5 ]
            organised_chaos James Robson added a comment -

            Thanks, I'll try updating again in the new year.

            organised_chaos James Robson added a comment - Thanks, I'll try updating again in the new year.
            jglick Jesse Glick added a comment -

            dnusbaum was the fix for this included in the things that got reverted? If so, should this be reopened as well?

            jglick Jesse Glick added a comment - dnusbaum was the fix for this included in the things that got reverted? If so, should this be reopened as well?
            dnusbaum Devin Nusbaum added a comment -

            This issue was caused by the changes for JENKINS-67164, so given that https://github.com/jenkinsci/workflow-api-plugin/pull/198 reverted all commits related to both issues, I don't think this issue can occur any more. If you are seeing deadlock with the current version of workflow-api, then I think it would be an independent issue.

            dnusbaum Devin Nusbaum added a comment - This issue was caused by the changes for JENKINS-67164 , so given that https://github.com/jenkinsci/workflow-api-plugin/pull/198 reverted all commits related to both issues, I don't think this issue can occur any more. If you are seeing deadlock with the current version of workflow-api , then I think it would be an independent issue.

            People

              dnusbaum Devin Nusbaum
              organised_chaos James Robson
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: