Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-61687

Exceptions due to race condition during build deletion

    • jenkins-2.253

      Messages like the following are being logged on ci.jenkins.io after updating to 2.222.1.

      2020-03-25 19:18:34.958+0000 [id=21774]	WARNING	j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build Discarder
      Also:   java.nio.file.NoSuchFileException: /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/93295 -> /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/.93295
      		at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
      		at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
      		at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
      		at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
      		at java.base/java.nio.file.Files.move(Files.java:1421)
      		at hudson.model.Run.delete(Run.java:1621)
      		at hudson.tasks.LogRotator.perform(LogRotator.java:166)
      jenkins.util.io.CompositeIOException: Failed to rotate logs for [Infra/plugin-site-api/generate-data #93295]
      	at hudson.tasks.LogRotator.perform(LogRotator.java:223)
      	at hudson.model.Job.logRotate(Job.java:469)
      	at jenkins.model.JobGlobalBuildDiscarderStrategy.apply(JobGlobalBuildDiscarderStrategy.java:54)
      	at jenkins.model.BackgroundGlobalBuildDiscarder.lambda$processJob$0(BackgroundGlobalBuildDiscarder.java:67)
      	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
      	at jenkins.model.BackgroundGlobalBuildDiscarder.processJob(BackgroundGlobalBuildDiscarder.java:61)
      	at jenkins.model.GlobalBuildDiscarderListener.onFinalized(GlobalBuildDiscarderListener.java:49)
      	at hudson.model.listeners.RunListener.fireFinalized(RunListener.java:255)
      	at hudson.model.Run.onEndBuilding(Run.java:2018)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:617)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$800(WorkflowRun.java:137)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:1018)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1463)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:488)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834) 

      or

      2020-03-25 19:28:42.053+0000 [id=56]	WARNING	o.j.p.workflow.job.WorkflowRun#lambda$finish$2: failed to perform log rotation after Infra/plugin-site-api/generate-data #93303
      Also:   java.nio.file.NoSuchFileException: /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/93296 -> /var/jenkins_home/jobs/Infra/jobs/plugin-site-api/branches/generate-data/builds/.93296
      		at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
      		at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
      		at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
      		at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
      		at java.base/java.nio.file.Files.move(Files.java:1421)
      		at hudson.model.Run.delete(Run.java:1621)
      		at hudson.tasks.LogRotator.perform(LogRotator.java:166)
      jenkins.util.io.CompositeIOException: Failed to rotate logs for [Infra/plugin-site-api/generate-data #93296]
      	at hudson.tasks.LogRotator.perform(LogRotator.java:223)
      	at hudson.model.Job.logRotate(Job.java:469)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.lambda$finish$2(WorkflowRun.java:612)
      	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834) 

      None of them occurred while background deletion was ongoing.

          [JENKINS-61687] Exceptions due to race condition during build deletion

          Daniel Beck added a comment -

          (I hypothesize that each build runs Project Build Discarder twice in parallel and those race against each other

          I don't remember how far along I was in the investigation when I posted the comment, but IIRC this is correct.

          Specifically, the "background" build discarders are also run after a build finishes, in addition to the regular build discarder. The idea being, why should it only run periodically, when it could as well run when a build is finished, and not accumulate a lot of builds for its periodic run?

          https://github.com/jenkinsci/jenkins/blob/da90af311587f6c3d37ec4e9c4b4637763924743/core/src/main/java/jenkins/model/GlobalBuildDiscarderListener.java#L46-L50

          Now, if you configure the "job specific build discarder" as part of global/background build discarders (the default), there'll be two build discarder instantiations running after a build is finished, both configured to delete the exact same builds. So this configuration basically creates the circumstances under which the (AFAICT) long-standing concurrency bug occurs. (Strictly speaking, any global build discarder could trigger this, but if the configurations are different, it's likely that only one build discarder will ever find builds to delete.)

          A workaround for this specific case only could be to skip the "project specific build discarder" when we're triggered by a build finishing (i.e. only run it when triggered by the periodic run), but that seems like a hack, but doesn't consider fixed global build discarders that match the job configuration exactly.

          A more reasonable workaround, don't run multiple build discarders on a job in parallel. There's currently no synchronization there.


          Ideally however, we'd fix the concurrency issues in build deletion, then neither builds in quick succession, nor background build discarders will be a problem.

          Daniel Beck added a comment - (I hypothesize that each build runs Project Build Discarder twice in parallel and those race against each other I don't remember how far along I was in the investigation when I posted the comment, but IIRC this is correct. Specifically, the "background" build discarders are also run after a build finishes, in addition to the regular build discarder. The idea being, why should it only run periodically, when it could as well run when a build is finished, and not accumulate a lot of builds for its periodic run? https://github.com/jenkinsci/jenkins/blob/da90af311587f6c3d37ec4e9c4b4637763924743/core/src/main/java/jenkins/model/GlobalBuildDiscarderListener.java#L46-L50 Now, if you configure the "job specific build discarder" as part of global/background build discarders (the default), there'll be two build discarder instantiations running after a build is finished, both configured to delete the exact same builds. So this configuration basically creates the circumstances under which the (AFAICT) long-standing concurrency bug occurs. (Strictly speaking, any global build discarder could trigger this, but if the configurations are different, it's likely that only one build discarder will ever find builds to delete.) A workaround for this specific case only could be to skip the "project specific build discarder" when we're triggered by a build finishing (i.e. only run it when triggered by the periodic run), but that seems like a hack, but doesn't consider fixed global build discarders that match the job configuration exactly. A more reasonable workaround, don't run multiple build discarders on a job in parallel. There's currently no synchronization there. Ideally however, we'd fix the concurrency issues in build deletion, then neither builds in quick succession, nor background build discarders will be a problem.

          Would it be reasonable to specifically catch and swallow java.nio.file.NoSuchFileException from the Files.move call? Or would that just surface more concurrency problems in other parts of hudson.model.Run.delete?

          Kalle Niemitalo added a comment - Would it be reasonable to specifically catch and swallow java.nio.file.NoSuchFileException from the Files.move call? Or would that just surface more concurrency problems in other parts of hudson.model.Run.delete?

          Zhenlei Huang added a comment -

          danielbeck I managed to reproduce this issue by a simple pipeline job.

          properties ([buildDiscarder(logRotator(numToKeepStr: '1'))])
          
          node {
              echo "Running ..."
          }
          

          Environment : Jenkins 2.235.2, Pipeline 2.6, with `Project Build Discarder` configured.

          Steps to reproduce:

          1. Create a pipeline job with above script
          2. Hit `Build Now` for the initial build.
          3. Repeat step 2 and you will get `java.nio.file.NoSuchFileException` in system log.

          Zhenlei Huang added a comment - danielbeck I managed to reproduce this issue by a simple pipeline job. properties ([buildDiscarder(logRotator(numToKeepStr: '1' ))]) node { echo "Running ..." } Environment : Jenkins 2.235.2, Pipeline 2.6, with `Project Build Discarder` configured. Steps to reproduce: Create a pipeline job with above script Hit `Build Now` for the initial build. Repeat step 2 and you will get `java.nio.file.NoSuchFileException` in system log.

          Zhenlei Huang added a comment -

          For the issue in the above simple pipeline job, I can conclude it was caused by concurrency in these lines:
          https://github.com/jenkinsci/workflow-job-plugin/blob/81fa9191779d54f0641b3ff0cec4aeeb3dad3bbb/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L618-L625

                      Timer.get().submit(() -> {
                          try {
                              getParent().logRotate();
                          } catch (Exception x) {
                              LOGGER.log(Level.WARNING, "failed to perform log rotation after " + this, x);
                          }
                      });
                      onEndBuilding();
          

          the `onEndBuilding()` calls `RunListener.fireFinalized()` and finally `GlobalBuildDiscarderListener#onFinalized()` was called, and then there're two concurrent thread doing `logRotate`.

          Zhenlei Huang added a comment - For the issue in the above simple pipeline job, I can conclude it was caused by concurrency in these lines: https://github.com/jenkinsci/workflow-job-plugin/blob/81fa9191779d54f0641b3ff0cec4aeeb3dad3bbb/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L618-L625 Timer.get().submit(() -> { try { getParent().logRotate(); } catch (Exception x) { LOGGER.log(Level.WARNING, "failed to perform log rotation after " + this , x); } }); onEndBuilding(); the `onEndBuilding()` calls `RunListener.fireFinalized()` and finally `GlobalBuildDiscarderListener#onFinalized()` was called, and then there're two concurrent thread doing `logRotate`.

          Zhenlei Huang added a comment -

          danielbeck Draft PR filed: https://github.com/jenkinsci/jenkins/pull/4850

          Sorry I've not tested locally as the change was made online. I've slow internet speed when cloning the jenkins repository.

          I'll report back later when the regression tests are ready.

          Zhenlei Huang added a comment - danielbeck Draft PR filed: https://github.com/jenkinsci/jenkins/pull/4850 Sorry I've not tested locally as the change was made online. I've slow internet speed when cloning the jenkins repository. I'll report back later when the regression tests are ready.

          Zhenlei Huang added a comment -

          Local test looks good with the fix

          Zhenlei Huang added a comment - Local test looks good with the fix

          danielbeck By the way, isn't the "JobGlobalBuildDiscarderStrategy" supposed to run periodically ? According to the documentation:

          Build discarders configured for a job are only run after a build finishes. This option runs jobs' configured build discarders periodically, applying configuration changes even when no new builds are run. This option has no effect if there is no build discarder configured for a job.

          Allan BURDAJEWICZ added a comment - danielbeck By the way, isn't the "JobGlobalBuildDiscarderStrategy" supposed to run periodically ? According to the documentation: Build discarders configured for a job are only run after a build finishes. This option runs jobs' configured build discarders periodically , applying configuration changes even when no new builds are run. This option has no effect if there is no build discarder configured for a job.

          Daniel Beck added a comment -

          allan_burdajewicz

          It does; and additionally it runs global build discarders once on a project when a build finishes. I thought it makes no sense to wait up to an hour (IIRC) to delete builds. I think the result is much nicer this way (except for exposing this bug in many situations, of course). The documentation probably just didn't keep up with the evolution of the feature and could be improved.

          Daniel Beck added a comment - allan_burdajewicz It does; and additionally it runs global build discarders once on a project when a build finishes. I thought it makes no sense to wait up to an hour (IIRC) to delete builds. I think the result is much nicer this way (except for exposing this bug in many situations, of course). The documentation probably just didn't keep up with the evolution of the feature and could be improved.

          Oleg Nenashev added a comment -

          Created JENKINS-63275 as a follow-up discussed in the PR

          Oleg Nenashev added a comment - Created  JENKINS-63275 as a follow-up discussed in the PR

          Same problem for me:

          2020-08-26 07:26:05.412+0000 [id=511] WARNING j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build Discarder
          
          2020-08-26 07:26:05.412+0000 [id=511] WARNING j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build DiscarderAlso:   java.nio.file.NoSuchFileException: /var/jenkins_home/jobs/simp/branches/tasks-BET-31993.du2529/builds/8 -> /var/jenkins_home/jobs/simp/branches/tasks-BET-31993.du2529/builds/.8 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at hudson.model.Run.delete(Run.java:1645) at hudson.tasks.LogRotator.perform(LogRotator.java:166)jenkins.util.io.CompositeIOException: Failed to rotate logs for [simp/tasks%2FBET-31993 #8] at hudson.tasks.LogRotator.perform(LogRotator.java:223) at hudson.model.Job.logRotate(Job.java:469) at jenkins.model.JobGlobalBuildDiscarderStrategy.apply(JobGlobalBuildDiscarderStrategy.java:54) at jenkins.model.BackgroundGlobalBuildDiscarder.lambda$processJob$0(BackgroundGlobalBuildDiscarder.java:67) at java.lang.Iterable.forEach(Iterable.java:75) at jenkins.model.BackgroundGlobalBuildDiscarder.processJob(BackgroundGlobalBuildDiscarder.java:61) at jenkins.model.GlobalBuildDiscarderListener.onFinalized(GlobalBuildDiscarderListener.java:49) at hudson.model.listeners.RunListener.fireFinalized(RunListener.java:255) at hudson.model.Run.onEndBuilding(Run.java:2042) at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:625) at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$800(WorkflowRun.java:137) at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1463) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:489) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
          

           

          Antonio Petricca added a comment - Same problem for me: 2020-08-26 07:26:05.412+0000 [id=511] WARNING j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build Discarder 2020-08-26 07:26:05.412+0000 [id=511] WARNING j.m.BackgroundGlobalBuildDiscarder#lambda$processJob$0: An exception occurred when executing Project Build DiscarderAlso:   java.nio.file.NoSuchFileException: / var /jenkins_home/jobs/simp/branches/tasks-BET-31993.du2529/builds/8 -> / var /jenkins_home/jobs/simp/branches/tasks-BET-31993.du2529/builds/.8 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at hudson.model.Run.delete(Run.java:1645) at hudson.tasks.LogRotator.perform(LogRotator.java:166)jenkins.util.io.CompositeIOException: Failed to rotate logs for [simp/tasks%2FBET-31993 #8] at hudson.tasks.LogRotator.perform(LogRotator.java:223) at hudson.model.Job.logRotate(Job.java:469) at jenkins.model.JobGlobalBuildDiscarderStrategy.apply(JobGlobalBuildDiscarderStrategy.java:54) at jenkins.model.BackgroundGlobalBuildDiscarder.lambda$processJob$0(BackgroundGlobalBuildDiscarder.java:67) at java.lang.Iterable.forEach(Iterable.java:75) at jenkins.model.BackgroundGlobalBuildDiscarder.processJob(BackgroundGlobalBuildDiscarder.java:61) at jenkins.model.GlobalBuildDiscarderListener.onFinalized(GlobalBuildDiscarderListener.java:49) at hudson.model.listeners.RunListener.fireFinalized(RunListener.java:255) at hudson.model.Run.onEndBuilding(Run.java:2042) at org.jenkinsci.plugins.workflow.job.WorkflowRun.finish(WorkflowRun.java:625) at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$800(WorkflowRun.java:137) at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1463) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:489) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:38) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748)  

            Unassigned Unassigned
            danielbeck Daniel Beck
            Votes:
            20 Vote for this issue
            Watchers:
            32 Start watching this issue

              Created:
              Updated:
              Resolved: