Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-36479

Locked resources not freed up by Pipeline job hard kill

      Since LockStepExecution.Callback.finished(context) never gets called in the case of a hard kill, resources can be locked forever when a build is hard killed. It's possible to manually unlock those resources from the UI, but it'd be preferable to have some behavior that detects this scenario and is able to unlock resources locked by defunct builds.

          [JENKINS-36479] Locked resources not freed up by Pipeline job hard kill

          Andrew Bayer added a comment -

          Andrew Bayer added a comment - Continued further with https://github.com/jenkinsci/lockable-resources-plugin/compare/master...abayer:jenkins-36479

          Andrew Bayer added a comment -

          Note that this approach doesn't do anything for a scenario where there are already builds queued up for a resource and the build holding that lock gets deleted/hard killed - it only clears that defunct lock when something requests a lock on the resource, though in the X-builds-queued-up scenario, when the X+1th build tries to get a lock, the result will be that the first build in the queue ends up getting a new lock. Still thinking about how to deal with the existing queue without a new lock request...

          Andrew Bayer added a comment - Note that this approach doesn't do anything for a scenario where there are already builds queued up for a resource and the build holding that lock gets deleted/hard killed - it only clears that defunct lock when something requests a lock on the resource, though in the X-builds-queued-up scenario, when the X+1th build tries to get a lock, the result will be that the first build in the queue ends up getting a new lock. Still thinking about how to deal with the existing queue without a new lock request...

          Andrew Bayer added a comment -

          Interesting - turns out LockStepExecution.stop still gets called in the case of a hard kill, and it's not pleased about it -

          INFO: p #1 completed: ABORTED
          [p #1] Hard kill!
          Jul 06, 2016 11:27:29 AM org.jenkins.plugins.lockableresources.LockStepExecution stop
          WARNING: Cannot remove context from lockable resource witing list. The context is not in the waiting list.
          Jul 06, 2016 11:27:29 AM org.jenkinsci.plugins.workflow.cps.CpsStepContext onFailure
          WARNING: already completed CpsStepContext[3]:Owner[p/1:p #1]
          java.lang.IllegalStateException: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
          	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:320)
          	at org.jenkins.plugins.lockableresources.LockStepExecution.stop(LockStepExecution.java:90)
          	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:760)
          	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:755)
          	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)
          	at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
          	at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
          	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:134)
          	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:170)
          	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:53)
          	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:644)
          	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:631)
          	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:568)
          	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:32)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:744)
          

          Andrew Bayer added a comment - Interesting - turns out LockStepExecution.stop still gets called in the case of a hard kill, and it's not pleased about it - INFO: p #1 completed: ABORTED [p #1] Hard kill! Jul 06, 2016 11:27:29 AM org.jenkins.plugins.lockableresources.LockStepExecution stop WARNING: Cannot remove context from lockable resource witing list. The context is not in the waiting list. Jul 06, 2016 11:27:29 AM org.jenkinsci.plugins.workflow.cps.CpsStepContext onFailure WARNING: already completed CpsStepContext[3]:Owner[p/1:p #1] java.lang.IllegalStateException: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:320) at org.jenkins.plugins.lockableresources.LockStepExecution.stop(LockStepExecution.java:90) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:760) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:755) at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253) at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149) at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:134) at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:170) at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:53) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:644) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:631) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:568) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:32) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:744)

          Andrew Bayer added a comment -

          Ok, that's just noise - not ideal noise, but just noise. I've pushed a commit with a test that passes now and hangs forever without my changes to LockableResourceManager - that needs to be fixed to actually fail in that scenario.

          More tests still needed for the doDelete scenario and for the builds-already-in-queue scenario, but I just wanted to make sure this actually worked in the base case. =)

          Andrew Bayer added a comment - Ok, that's just noise - not ideal noise, but just noise. I've pushed a commit with a test that passes now and hangs forever without my changes to LockableResourceManager - that needs to be fixed to actually fail in that scenario. More tests still needed for the doDelete scenario and for the builds-already-in-queue scenario, but I just wanted to make sure this actually worked in the base case. =)

          Andrew Bayer added a comment -

          Andrew Bayer added a comment - PR up - https://github.com/jenkinsci/lockable-resources-plugin/pull/34

          Andrew Bayer added a comment -

          I can't see a way to clear the lock without either waiting for a new lock request to come in (as my PR does) or having an async recurring task running periodically checking every locked resource for defunct locks. LockRunListener doesn't seem to fire on hard kill or while-running build deletion, so far as I can tell from my experiments...

          Andrew Bayer added a comment - I can't see a way to clear the lock without either waiting for a new lock request to come in (as my PR does) or having an async recurring task running periodically checking every locked resource for defunct locks. LockRunListener doesn't seem to fire on hard kill or while-running build deletion, so far as I can tell from my experiments...

          Andrew Bayer added a comment -

          New PR (https://github.com/jenkinsci/lockable-resources-plugin/pull/35) probably supersedes #34 - updating LockRunListener to listen on Run rather than AbstractBuild seems to, well, fix everything!

          Andrew Bayer added a comment - New PR ( https://github.com/jenkinsci/lockable-resources-plugin/pull/35 ) probably supersedes #34 - updating LockRunListener to listen on Run rather than AbstractBuild seems to, well, fix everything!

          Andrew Bayer added a comment -

          Fixed as of next release (presumably 1.10), which should be coming shortly, I think.

          The fix makes LockRunListener fire correctly on Run not just AbstractBuild. That alone did the trick for both hard killed builds and deleted-while-in-progress builds, and doesn't require queuing up a new lock request to clear the defunct lock. Woo.

          Andrew Bayer added a comment - Fixed as of next release (presumably 1.10), which should be coming shortly, I think. The fix makes LockRunListener fire correctly on Run not just AbstractBuild . That alone did the trick for both hard killed builds and deleted-while-in-progress builds, and doesn't require queuing up a new lock request to clear the defunct lock. Woo.

          Cool abayer
          Could we also add some documentations/samples about it. It's always not clear for me how this feature should be used and what is its interest.

          Thx

          Arnaud Héritier added a comment - Cool abayer Could we also add some documentations/samples about it. It's always not clear for me how this feature should be used and what is its interest. Thx

          Andrew Bayer added a comment -

          Talk to amuniz in re docs/samples - I honestly don't know much about how to use the plugin, I just decided to fix the bug. =)

          Andrew Bayer added a comment - Talk to amuniz in re docs/samples - I honestly don't know much about how to use the plugin, I just decided to fix the bug. =)

            abayer Andrew Bayer
            abayer Andrew Bayer
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: