Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22395

Run.delete (from LogRotator) failing with "...looks to have already been deleted"

      Sometimes "log rotation" fails with an exception like

      SEVERE  hudson.model.Run#execute: Failed to rotate log
      java.io.IOException: .../jobs/.../modules/...$.../builds/2013-... looks to have already been deleted
          at hudson.model.Run.delete(Run.java:1432)
          at hudson.maven.MavenModuleSetBuild.delete(MavenModuleSetBuild.java:420)
          at hudson.tasks.LogRotator.perform(LogRotator.java:136)
          at hudson.model.Job.logRotate(Job.java:437)
          at hudson.maven.MavenModuleSet.logRotate(MavenModuleSet.java:851)
          at hudson.model.Run.execute(Run.java:1728)
          at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:509)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:246)
      

      Usually MavenModuleSetBuild is involved, perhaps suggesting a problem with deleted or skipped module builds (I cannot reproduce in either scenario), though I think I have also seen this happen for freestyle builds. Unclear whether the directory actually exists but File.isDirectory thinks it does not (the code originally tried to delete the dir without checking this and failed with "... is in use"); or whether the directory was actually deleted earlier but this Run was not cleaned up properly for some reason.

      Hypothesis: AbstractLazyLoadRunMap.idOnDisk is not removed by removeValue, called from AbstractProject.removeRun (from Run.delete). Perhaps something is later resurrecting the AbstractBuild from idOnDisk, making it again available for another round of log rotation? But if the actual directory was deleted, it is hard to see how: load would just fail.

      Or perhaps Run.delete is the race condition: the run is removed from the parent after its directory has been deleted. The method is synchronized, but that does not help if two copies of the Run exist, which might happen due to other lazy-loading bugs.

      Diagnostics added to date:

          [JENKINS-22395] Run.delete (from LogRotator) failing with "...looks to have already been deleted"

          Jesse Glick added a comment - Testing build of https://github.com/jenkinsci/jenkins/compare/1.565.1-rc1...JENKINS-22395-dropLinks-backport uploaded: http://repo.jenkins-ci.org/public/org/jenkins-ci/main/jenkins-war/1.565.1.JENKINS-22395-dropLinks/jenkins-war-1.565.1.JENKINS-22395-dropLinks.war

          Code changed in jenkins
          User: Daniel Beck
          Path:
          changelog.html
          http://jenkins-ci.org/commit/jenkins/fdc0b5c5650b3ee849f02e3c5d94d23c12886adc
          Log:
          Noting #1314, #1316, #1308, JENKINS-17667, JENKINS-22395, JENKINS-18065

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Daniel Beck Path: changelog.html http://jenkins-ci.org/commit/jenkins/fdc0b5c5650b3ee849f02e3c5d94d23c12886adc Log: Noting #1314, #1316, #1308, JENKINS-17667 , JENKINS-22395 , JENKINS-18065

          Code changed in jenkins
          User: Daniel Beck
          Path:
          changelog.html
          http://jenkins-ci.org/commit/jenkins/72f37e414533c02624f9deffc85a2b50f03905e6
          Log:
          Remove JENKINS-22395, JENKINS-18065 from 1.572 section

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Daniel Beck Path: changelog.html http://jenkins-ci.org/commit/jenkins/72f37e414533c02624f9deffc85a2b50f03905e6 Log: Remove JENKINS-22395 , JENKINS-18065 from 1.572 section

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
          http://jenkins-ci.org/commit/jenkins/2baa7caade6923b9a2023e64b60bb537fcc65077
          Log:
          JENKINS-22395

          I'm not sure if I understand how the original fix in PR #1190 (f1430a2) addresses the problem.

          My understanding is of the problem is as follows:

          • someone deletes 'b2' but holds on to its reference (could be different threads)
          • someone calls b2.getPreviousBuild()
          • if b2.previousBuildR is null, then this triggers the loading of b1, and
            that process sets up a bi-di link via b1<>b2 via b1.nextBuildR <> b2.previousBuildR
          • this makes b1.getNextBuild() incorrectly return b2

          Presumably f1430a2 addresses this somehow, but I think I can induce this situation in other ways,
          which is what dropLinksAfterGC2() does.

          In this test,

          initial state:

          b1 <> b2 <> b3

          we load everyone and connect them all up

          after deleting b2:

          b1 <--------> b3
          ^ ^
          ---- b2 ----

          b1 and b3 points each other, and b2 still refers to each side

          after dropping b1:
          b2 --> b3

          now, here, when I do b2.getPreviousBuild(), it loads b1a and it sets b1a.nextBuildR to b2.

          b1a <-> b2 --> b3

          So I claim this is a proof that the fix is incomplete, even for the problem JENKINS-22395 has discovered.

          I don't think that the problem is for the dropLinks call to fail to patch up references.
          The problem is that b2.getPreviousBuild() forcing b1 to point to b2, because if b2 is deleted and assumed to be
          invalid, then no matter what bX this method will find you never want bX.nextBuildR to point to b2.

          (cherry picked from commit 53e95ee517259e1e10d7209bcbaa4314397c1f40)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java http://jenkins-ci.org/commit/jenkins/2baa7caade6923b9a2023e64b60bb537fcc65077 Log: JENKINS-22395 I'm not sure if I understand how the original fix in PR #1190 (f1430a2) addresses the problem. My understanding is of the problem is as follows: someone deletes 'b2' but holds on to its reference (could be different threads) someone calls b2.getPreviousBuild() if b2.previousBuildR is null, then this triggers the loading of b1, and that process sets up a bi-di link via b1< >b2 via b1.nextBuildR < > b2.previousBuildR this makes b1.getNextBuild() incorrectly return b2 Presumably f1430a2 addresses this somehow, but I think I can induce this situation in other ways, which is what dropLinksAfterGC2() does. In this test, initial state: b1 < > b2 < > b3 we load everyone and connect them all up after deleting b2: b1 <--------> b3 ^ ^ ---- b2 ---- b1 and b3 points each other, and b2 still refers to each side after dropping b1: b2 --> b3 now, here, when I do b2.getPreviousBuild(), it loads b1a and it sets b1a.nextBuildR to b2. b1a <-> b2 --> b3 So I claim this is a proof that the fix is incomplete, even for the problem JENKINS-22395 has discovered. I don't think that the problem is for the dropLinks call to fail to patch up references. The problem is that b2.getPreviousBuild() forcing b1 to point to b2, because if b2 is deleted and assumed to be invalid, then no matter what bX this method will find you never want bX.nextBuildR to point to b2. (cherry picked from commit 53e95ee517259e1e10d7209bcbaa4314397c1f40)

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
          http://jenkins-ci.org/commit/jenkins/3f0388379b06b31ebec1bf17cc976001fcd80b7b
          Log:
          JENKINS-22395 correcting the control test

          As I step-executed the code, I discovered b2.getPreviousBuild() was getting invoked
          between BRHF.drop(b1) and b2.dropLinks() call, in PeepholePermalink.RunListenerImpl.onDeleted()
          because Run.delete() calls RunListener.fireDeleted(this).

          Thus in effect the sequence of the call order was as follows:
          assertEquals(1, BRHF.drop(b1));
          b2.getPreviousBuild(); // happens indirectly in PeepholePermalink
          b2.delete();
          FreeStyleBuild b1a = b2.getPreviousBuild();

          This defeats the purpose of the fix, because in this call sequence, by b2.dropLinks()
          as implemented before f1430a2 will correctly fix up b1a.nextBuildR to b3.

          For the test to work as intended, it is important that b2.previousBuildR.get()==null
          during dropLinks(). That is what causes b2.getPreviousBuild() in the next line to go
          load b1a, and sets up b1a.nextBuildR to b2, and trigger the assertion error.

          Added a code to remove all RunListeners. With this change, the test now fails without
          the fix in f1430a2.

          (cherry picked from commit 7b1b50c85a6a4bdb75f1ae5b40885afa45f96b82)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java http://jenkins-ci.org/commit/jenkins/3f0388379b06b31ebec1bf17cc976001fcd80b7b Log: JENKINS-22395 correcting the control test As I step-executed the code, I discovered b2.getPreviousBuild() was getting invoked between BRHF.drop(b1) and b2.dropLinks() call, in PeepholePermalink.RunListenerImpl.onDeleted() because Run.delete() calls RunListener.fireDeleted(this). Thus in effect the sequence of the call order was as follows: assertEquals(1, BRHF.drop(b1)); b2.getPreviousBuild(); // happens indirectly in PeepholePermalink b2.delete(); FreeStyleBuild b1a = b2.getPreviousBuild(); This defeats the purpose of the fix, because in this call sequence, by b2.dropLinks() as implemented before f1430a2 will correctly fix up b1a.nextBuildR to b3. For the test to work as intended, it is important that b2.previousBuildR.get()==null during dropLinks(). That is what causes b2.getPreviousBuild() in the next line to go load b1a, and sets up b1a.nextBuildR to b2, and trigger the assertion error. Added a code to remove all RunListeners. With this change, the test now fails without the fix in f1430a2. (cherry picked from commit 7b1b50c85a6a4bdb75f1ae5b40885afa45f96b82)

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          core/src/main/java/jenkins/model/lazy/BuildReference.java
          core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
          http://jenkins-ci.org/commit/jenkins/4bfa16143705e219705ff667c6917a2d6a6a8939
          Log:
          [FIXED JENKINS-22395] redoing the fix in f1430a2

          Based on the last few commits, I proved that the original fix in f1430a2
          doesn't really address the problem.

          That is, once b2 is deleted, and after sufficient garbage collection,
          we can make b2.previousBuild.get() be null, and then
          b2.getPreviousBuild().getNextBuild() ends up incorrectly returning b2.

          In this commit, I roll back that part of f1430a2, and then fix the
          problem differently.

          I started thinking that the main problem we are trying to fix here
          is that the deleted build object should be unreferenceable. That is,
          it should behave almost as if the object has already been GCed.
          The easiest way to do this is to clear a BuildReference object,
          since we always use the same BuildReference object for all inbound
          references.

          This change allows us to clear BuildReference. Code like
          b2.getPreviousBuild() will continue to try to update
          b1.nextBuildR to b2, but it will only end up wiping out the field,
          only to have b1.getNextBuild() recompute the correct value.

          This fix makes both test cases pass in LazyBuildMixInTest.

          (cherry picked from commit b6226ad2d1a332cb661ceb5c5f5b673771118e14)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: core/src/main/java/jenkins/model/lazy/BuildReference.java core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java http://jenkins-ci.org/commit/jenkins/4bfa16143705e219705ff667c6917a2d6a6a8939 Log: [FIXED JENKINS-22395] redoing the fix in f1430a2 Based on the last few commits, I proved that the original fix in f1430a2 doesn't really address the problem. That is, once b2 is deleted, and after sufficient garbage collection, we can make b2.previousBuild.get() be null, and then b2.getPreviousBuild().getNextBuild() ends up incorrectly returning b2. In this commit, I roll back that part of f1430a2, and then fix the problem differently. I started thinking that the main problem we are trying to fix here is that the deleted build object should be unreferenceable. That is, it should behave almost as if the object has already been GCed. The easiest way to do this is to clear a BuildReference object, since we always use the same BuildReference object for all inbound references. This change allows us to clear BuildReference. Code like b2.getPreviousBuild() will continue to try to update b1.nextBuildR to b2, but it will only end up wiping out the field, only to have b1.getNextBuild() recompute the correct value. This fix makes both test cases pass in LazyBuildMixInTest. (cherry picked from commit b6226ad2d1a332cb661ceb5c5f5b673771118e14)

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
          http://jenkins-ci.org/commit/jenkins/ca156f730b9aedac9072f7243e3f1528330ca68b
          Log:
          JENKINS-22395 more diagnostics

          Just in case the previous fix didn't address the root cause of ZD-11985 (and given the time it takes for changes like this to land in LTS), I'm adding additional diagnostics that let us track how previous/next builds are getting discovered

          (cherry picked from commit aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java http://jenkins-ci.org/commit/jenkins/ca156f730b9aedac9072f7243e3f1528330ca68b Log: JENKINS-22395 more diagnostics Just in case the previous fix didn't address the root cause of ZD-11985 (and given the time it takes for changes like this to land in LTS), I'm adding additional diagnostics that let us track how previous/next builds are getting discovered (cherry picked from commit aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece)

          Code changed in jenkins
          User: Jesse Glick
          Path:
          test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
          http://jenkins-ci.org/commit/jenkins/579753403c63886cf0feff177602d79254829780
          Log:
          JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC.
          Confirmed that dropLinksAfterGC and dropLinksAfterGC2 both fail in the expected way (b1a.nextBuild == b2) after commenting out the call to createReference().clear() in dropLinks.
          (Also that they fail as expected in assertNotSame if the reference is not cleared at all.)

          (cherry picked from commit b7ec8578f938864af9c8dbfb73a54d67f972d694)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java http://jenkins-ci.org/commit/jenkins/579753403c63886cf0feff177602d79254829780 Log: JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC. Confirmed that dropLinksAfterGC and dropLinksAfterGC2 both fail in the expected way (b1a.nextBuild == b2) after commenting out the call to createReference().clear() in dropLinks. (Also that they fail as expected in assertNotSame if the reference is not cleared at all.) (cherry picked from commit b7ec8578f938864af9c8dbfb73a54d67f972d694)

          Sam Van Oort added a comment - - edited

          Reproducible again with a high enough build rate and pipeline (lots of concurrent or fast builds):

          Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
          INFO: sample #8540 completed: SUCCESS
          Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
          WARNING: failed to save sample #8541 or perform log rotation
          java.io.IOException: sample #8527: /Users/svanoort/Documents/jenkins-plugins-oss/random-job-builder-plugin/work/jobs/sample/builds/8527 looks to have already been deleted; siblings: [.8527, 4568, 4569, 4570, 4571, 8528, 8529, 8530, 8531, 8532, 8533, 8534, 8535, 8536, 8537, 8538, 8539, 8540, 8541, 8542, 8543, 8544, 8545, 8546, 8547, lastFailedBuild, lastStableBuild, lastSuccessfulBuild, lastUnstableBuild, lastUnsuccessfulBuild, legacyIds]
          at hudson.model.Run.delete(Run.java:1480)
          at hudson.tasks.LogRotator.perform(LogRotator.java:130)
          at hudson.model.Job.logRotate(Job.java:478)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:597)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$1200(WorkflowRun.java:120)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:865)
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1081)
          at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:400)
          at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
          at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)

          Sam Van Oort added a comment - - edited Reproducible again with a high enough build rate and pipeline (lots of concurrent or fast builds): Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish INFO: sample #8540 completed: SUCCESS Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish WARNING: failed to save sample #8541 or perform log rotation java.io.IOException: sample #8527: /Users/svanoort/Documents/jenkins-plugins-oss/random-job-builder-plugin/work/jobs/sample/builds/8527 looks to have already been deleted; siblings: [.8527, 4568, 4569, 4570, 4571, 8528, 8529, 8530, 8531, 8532, 8533, 8534, 8535, 8536, 8537, 8538, 8539, 8540, 8541, 8542, 8543, 8544, 8545, 8546, 8547, lastFailedBuild, lastStableBuild, lastSuccessfulBuild, lastUnstableBuild, lastUnsuccessfulBuild, legacyIds] at hudson.model.Run.delete(Run.java:1480) at hudson.tasks.LogRotator.perform(LogRotator.java:130) at hudson.model.Job.logRotate(Job.java:478) at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:597) at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$1200(WorkflowRun.java:120) at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:865) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1081) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

          Jesse Glick added a comment -

          Original bug was indeed fixed, backported, etc. so what you are seeing svanoort is something else with similar symptoms. Better to file a fresh issue and link it to this one.

          Jesse Glick added a comment - Original bug was indeed fixed, backported, etc. so what you are seeing svanoort is something else with similar symptoms. Better to file a fresh issue and link it to this one.

            jglick Jesse Glick
            jglick Jesse Glick
            Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: