Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22395

Run.delete (from LogRotator) failing with "...looks to have already been deleted"

    XMLWordPrintable

Details

    Description

      Sometimes "log rotation" fails with an exception like

      SEVERE  hudson.model.Run#execute: Failed to rotate log
      java.io.IOException: .../jobs/.../modules/...$.../builds/2013-... looks to have already been deleted
          at hudson.model.Run.delete(Run.java:1432)
          at hudson.maven.MavenModuleSetBuild.delete(MavenModuleSetBuild.java:420)
          at hudson.tasks.LogRotator.perform(LogRotator.java:136)
          at hudson.model.Job.logRotate(Job.java:437)
          at hudson.maven.MavenModuleSet.logRotate(MavenModuleSet.java:851)
          at hudson.model.Run.execute(Run.java:1728)
          at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:509)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:246)
      

      Usually MavenModuleSetBuild is involved, perhaps suggesting a problem with deleted or skipped module builds (I cannot reproduce in either scenario), though I think I have also seen this happen for freestyle builds. Unclear whether the directory actually exists but File.isDirectory thinks it does not (the code originally tried to delete the dir without checking this and failed with "... is in use"); or whether the directory was actually deleted earlier but this Run was not cleaned up properly for some reason.

      Hypothesis: AbstractLazyLoadRunMap.idOnDisk is not removed by removeValue, called from AbstractProject.removeRun (from Run.delete). Perhaps something is later resurrecting the AbstractBuild from idOnDisk, making it again available for another round of log rotation? But if the actual directory was deleted, it is hard to see how: load would just fail.

      Or perhaps Run.delete is the race condition: the run is removed from the parent after its directory has been deleted. The method is synchronized, but that does not help if two copies of the Run exist, which might happen due to other lazy-loading bugs.

      Diagnostics added to date:

      Attachments

        Issue Links

          Activity

            Code changed in jenkins
            User: Kohsuke Kawaguchi
            Path:
            core/src/main/java/jenkins/model/lazy/BuildReference.java
            core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
            http://jenkins-ci.org/commit/jenkins/4bfa16143705e219705ff667c6917a2d6a6a8939
            Log:
            [FIXED JENKINS-22395] redoing the fix in f1430a2

            Based on the last few commits, I proved that the original fix in f1430a2
            doesn't really address the problem.

            That is, once b2 is deleted, and after sufficient garbage collection,
            we can make b2.previousBuild.get() be null, and then
            b2.getPreviousBuild().getNextBuild() ends up incorrectly returning b2.

            In this commit, I roll back that part of f1430a2, and then fix the
            problem differently.

            I started thinking that the main problem we are trying to fix here
            is that the deleted build object should be unreferenceable. That is,
            it should behave almost as if the object has already been GCed.
            The easiest way to do this is to clear a BuildReference object,
            since we always use the same BuildReference object for all inbound
            references.

            This change allows us to clear BuildReference. Code like
            b2.getPreviousBuild() will continue to try to update
            b1.nextBuildR to b2, but it will only end up wiping out the field,
            only to have b1.getNextBuild() recompute the correct value.

            This fix makes both test cases pass in LazyBuildMixInTest.

            (cherry picked from commit b6226ad2d1a332cb661ceb5c5f5b673771118e14)

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: core/src/main/java/jenkins/model/lazy/BuildReference.java core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java http://jenkins-ci.org/commit/jenkins/4bfa16143705e219705ff667c6917a2d6a6a8939 Log: [FIXED JENKINS-22395] redoing the fix in f1430a2 Based on the last few commits, I proved that the original fix in f1430a2 doesn't really address the problem. That is, once b2 is deleted, and after sufficient garbage collection, we can make b2.previousBuild.get() be null, and then b2.getPreviousBuild().getNextBuild() ends up incorrectly returning b2. In this commit, I roll back that part of f1430a2, and then fix the problem differently. I started thinking that the main problem we are trying to fix here is that the deleted build object should be unreferenceable. That is, it should behave almost as if the object has already been GCed. The easiest way to do this is to clear a BuildReference object, since we always use the same BuildReference object for all inbound references. This change allows us to clear BuildReference. Code like b2.getPreviousBuild() will continue to try to update b1.nextBuildR to b2, but it will only end up wiping out the field, only to have b1.getNextBuild() recompute the correct value. This fix makes both test cases pass in LazyBuildMixInTest. (cherry picked from commit b6226ad2d1a332cb661ceb5c5f5b673771118e14)

            Code changed in jenkins
            User: Kohsuke Kawaguchi
            Path:
            core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java
            http://jenkins-ci.org/commit/jenkins/ca156f730b9aedac9072f7243e3f1528330ca68b
            Log:
            JENKINS-22395 more diagnostics

            Just in case the previous fix didn't address the root cause of ZD-11985 (and given the time it takes for changes like this to land in LTS), I'm adding additional diagnostics that let us track how previous/next builds are getting discovered

            (cherry picked from commit aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece)

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: core/src/main/java/jenkins/model/lazy/LazyBuildMixIn.java http://jenkins-ci.org/commit/jenkins/ca156f730b9aedac9072f7243e3f1528330ca68b Log: JENKINS-22395 more diagnostics Just in case the previous fix didn't address the root cause of ZD-11985 (and given the time it takes for changes like this to land in LTS), I'm adding additional diagnostics that let us track how previous/next builds are getting discovered (cherry picked from commit aa8e0b4f0e312ba4ef9c647ccf2c093793a9bece)

            Code changed in jenkins
            User: Jesse Glick
            Path:
            test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java
            http://jenkins-ci.org/commit/jenkins/579753403c63886cf0feff177602d79254829780
            Log:
            JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC.
            Confirmed that dropLinksAfterGC and dropLinksAfterGC2 both fail in the expected way (b1a.nextBuild == b2) after commenting out the call to createReference().clear() in dropLinks.
            (Also that they fail as expected in assertNotSame if the reference is not cleared at all.)

            (cherry picked from commit b7ec8578f938864af9c8dbfb73a54d67f972d694)

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: test/src/test/java/jenkins/model/lazy/LazyBuildMixInTest.java http://jenkins-ci.org/commit/jenkins/579753403c63886cf0feff177602d79254829780 Log: JENKINS-22395 Taking advantage of BuildReference.clear (just introduced in b6226ad) to simplify test by not requiring a custom build reference holder just to simulate GC. Confirmed that dropLinksAfterGC and dropLinksAfterGC2 both fail in the expected way (b1a.nextBuild == b2) after commenting out the call to createReference().clear() in dropLinks. (Also that they fail as expected in assertNotSame if the reference is not cleared at all.) (cherry picked from commit b7ec8578f938864af9c8dbfb73a54d67f972d694)
            svanoort Sam Van Oort added a comment - - edited

            Reproducible again with a high enough build rate and pipeline (lots of concurrent or fast builds):

            Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
            INFO: sample #8540 completed: SUCCESS
            Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
            WARNING: failed to save sample #8541 or perform log rotation
            java.io.IOException: sample #8527: /Users/svanoort/Documents/jenkins-plugins-oss/random-job-builder-plugin/work/jobs/sample/builds/8527 looks to have already been deleted; siblings: [.8527, 4568, 4569, 4570, 4571, 8528, 8529, 8530, 8531, 8532, 8533, 8534, 8535, 8536, 8537, 8538, 8539, 8540, 8541, 8542, 8543, 8544, 8545, 8546, 8547, lastFailedBuild, lastStableBuild, lastSuccessfulBuild, lastUnstableBuild, lastUnsuccessfulBuild, legacyIds]
            at hudson.model.Run.delete(Run.java:1480)
            at hudson.tasks.LogRotator.perform(LogRotator.java:130)
            at hudson.model.Job.logRotate(Job.java:478)
            at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:597)
            at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$1200(WorkflowRun.java:120)
            at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:865)
            at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1081)
            at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:400)
            at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
            at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
            at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)

            svanoort Sam Van Oort added a comment - - edited Reproducible again with a high enough build rate and pipeline (lots of concurrent or fast builds): Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish INFO: sample #8540 completed: SUCCESS Aug 08, 2017 5:53:08 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish WARNING: failed to save sample #8541 or perform log rotation java.io.IOException: sample #8527: /Users/svanoort/Documents/jenkins-plugins-oss/random-job-builder-plugin/work/jobs/sample/builds/8527 looks to have already been deleted; siblings: [.8527, 4568, 4569, 4570, 4571, 8528, 8529, 8530, 8531, 8532, 8533, 8534, 8535, 8536, 8537, 8538, 8539, 8540, 8541, 8542, 8543, 8544, 8545, 8546, 8547, lastFailedBuild, lastStableBuild, lastSuccessfulBuild, lastUnstableBuild, lastUnsuccessfulBuild, legacyIds] at hudson.model.Run.delete(Run.java:1480) at hudson.tasks.LogRotator.perform(LogRotator.java:130) at hudson.model.Job.logRotate(Job.java:478) at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:597) at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$1200(WorkflowRun.java:120) at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:865) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1081) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
            jglick Jesse Glick added a comment -

            Original bug was indeed fixed, backported, etc. so what you are seeing svanoort is something else with similar symptoms. Better to file a fresh issue and link it to this one.

            jglick Jesse Glick added a comment - Original bug was indeed fixed, backported, etc. so what you are seeing svanoort is something else with similar symptoms. Better to file a fresh issue and link it to this one.

            People

              jglick Jesse Glick
              jglick Jesse Glick
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: