Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-61604

Pipeline deadlock caused by fix for JENKINS-59083

    • workflow-job 2.38

      The fix for JENKINS-59038 has been observed to cause deadlock in some cases. Here is one example. Thread one:

      "SupportPlugin periodic bundle generator: writing <support bundle name>" id=967 (0x3c7) state=BLOCKED cpu=94%
          - waiting to lock <0x71879181> (a java.lang.Object)
            owned by "Jenkins initialization thread" id=36 (0x24)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.getListener(WorkflowRun.java:217)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$300(WorkflowRun.java:133)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.getListener(WorkflowRun.java:970)
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:789)
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:776)
          at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)
          at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
          at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
          at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
          at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
          at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:160)
          at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:90)
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:773)
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:740)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:668)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:940)
          at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
          at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
          at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
          at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
          at org.jenkinsci.plugins.workflow.cps.CpsThreadDumpAction$PipelineThreadDump$1.writeTo(CpsThreadDumpAction.java:109)
          at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:327)
          at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:279)
          at com.cloudbees.jenkins.support.SupportPlugin$PeriodicWorkImpl.lambda$doRun$0(SupportPlugin.java:855)
          at com.cloudbees.jenkins.support.SupportPlugin$PeriodicWorkImpl$$Lambda$442/1694791994.run(Unknown Source)
          at java.lang.Thread.run(Thread.java:748)
      

      Thread two:

      "Jenkins initialization thread" id=36 (0x24) state=BLOCKED cpu=92%
          - waiting to lock <0x60272c7e> (a org.jenkinsci.plugins.workflow.job.WorkflowRun)
            owned by "SupportPlugin periodic bundle generator: writing <support bundle name>" id=967 (0x3c7)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:650)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:529)
          at hudson.model.RunMap.retrieve(RunMap.java:225)
          at hudson.model.RunMap.retrieve(RunMap.java:57)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381)
          at hudson.model.RunMap.getById(RunMap.java:205)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:917)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:928)
          at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
          at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
          at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
          at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
          at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
          at jenkins.model.Jenkins.<init>(Jenkins.java:1028)
          at hudson.model.Hudson.<init>(Hudson.java:85)
          at hudson.model.Hudson.<init>(Hudson.java:81)
          at hudson.WebAppMain$3.run(WebAppMain.java:233)
      

      I think these are the methods that are acquiring the locks relevant to the deadlock:

      Thread one:

      1. WorkflowRun.getExecution: Acquires the lock for WorkflowRun
      2. WorkflowRun.getListener: Attempts to acquire the lock for WorkflowRun.getMetadataGuard() held by thread two

      Thread two:

      1. WorkflowRun.onLoad: Acquires the lock for WorkflowRun.getMetadataGuard()
      2. WorkflowRun.getExecution: Attempts to acquire the lock for WorkflowRun held by thread one

      Looks like we should have considered this during review given the comment here. Not sure I understand the comment on the next line though, since obviously we are not already synchronizing on getMetadataGuard() in this case.

      I'm not sure about the best way to fix this. We could make WorkflowRun.completed volatile and just drop the synchronization in getListener. IIUC, that would not reintroduce JENKINS-59083, since WorkflowRun.finish sets completed to true before closing the listener, so the worst case would be that something calls getListener() while the build is running, gets the unclosed listener, and then the listener is closed while they are trying to use it because the build completes.

      I think the obvious fix would be to delete getMetadataGuard() completely and just always synchronize on WorkflowRun, but that might cause performance issues. I assume that performance issues regarding lock contention for WorkflowRun is why getMetadataGuard() was added in the first place, but looking through the history, I think WorkflowRun has always had two forms of locking: The standard monitor and metadataGuard (formerly logyCopyGuard, formerly completed, formerly completionGuard), so I'm not sure.

            dnusbaum Devin Nusbaum
            dnusbaum Devin Nusbaum
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: