Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53775

FileNotFoundException for program.dat when running a Pipeline Job concurrently with the Job DSL plugin

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • None
    • job-dsl 1.76

      I unfortunately don't have an easy reproduer, but this bug happens to me fairly regularly (at least a few times a month). For this failure to occur:

      • A Freestyle job is running and performing a "Process Job DSLs" step. Part of this involves updating an existing Pipeline job.
      • That existing Pipeline job is also running (or starting) at around the same time as the "Process Job DLSs" step is trying to update it.

      When the timing is just right, the Pipeline job fails with:

      java.io.FileNotFoundException: /var/jenkins_home/jobs/devops-gate/jobs/projects/jobs/dx4linux/jobs/delphix-build-and-snapshots/jobs/ami-snapshots/builds/1173/program.dat (No such file or directory)
      	at java.io.FileInputStream.open0(Native Method)
      	at java.io.FileInputStream.open(FileInputStream.java:195)
      	at java.io.FileInputStream.<init>(FileInputStream.java:138)
      	at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.openStreamAt(RiverReader.java:188)
      	at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader.restorePickles(RiverReader.java:136)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:773)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:739)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:875)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:745)
      	at hudson.model.RunMap.retrieve(RunMap.java:225)
      	at hudson.model.RunMap.retrieve(RunMap.java:57)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:499)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:481)
      	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:379)
      	at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:231)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:234)
      	at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:105)
      	at hudson.model.Run.fromExternalizableId(Run.java:2436)
      	at org.jenkinsci.plugins.workflow.support.steps.build.RunWrapper.getRawBuild(RunWrapper.java:71)
      	at org.jenkinsci.plugins.workflow.support.steps.build.RunWrapper.build(RunWrapper.java:75)
      	at org.jenkinsci.plugins.workflow.support.steps.build.RunWrapper.setResult(RunWrapper.java:87)
      	at sun.reflect.GeneratedMethodAccessor820.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
      	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
      	at groovy.lang.MetaClassImpl.setProperty(MetaClassImpl.java:2725)
      	at groovy.lang.MetaClassImpl.setProperty(MetaClassImpl.java:3770)
      	at org.codehaus.groovy.runtime.InvokerHelper.setProperty(InvokerHelper.java:201)
      	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.setProperty(ScriptBytecodeAdapter.java:484)
      	at org.kohsuke.groovy.sandbox.impl.Checker$7.call(Checker.java:347)
      	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onSetProperty(GroovyInterceptor.java:84)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onSetProperty(SandboxInterceptor.java:197)
      	at org.kohsuke.groovy.sandbox.impl.Checker$7.call(Checker.java:344)
      	at org.kohsuke.groovy.sandbox.impl.Checker.checkedSetProperty(Checker.java:351)
      	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.setProperty(SandboxInvoker.java:33)
      	at com.cloudbees.groovy.cps.impl.PropertyAccessBlock.rawSet(PropertyAccessBlock.java:24)
      	at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.set(PropertyishBlock.java:88)
      	at com.cloudbees.groovy.cps.impl.AssignmentBlock$ContinuationImpl.assignAndDone(AssignmentBlock.java:70)
      	at sun.reflect.GeneratedMethodAccessor706.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
      	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$101(SandboxContinuable.java:34)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.lambda$run0$0(SandboxContinuable.java:59)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:58)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:332)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
      Caused: java.io.IOException: Failed to load build state
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:854)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:852)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Finished: FAILURE
      [Pipeline] stage
      [Pipeline] { (Cloning repos)
      [Pipeline] node
      19:23:07 Running on jenkins-agent3 in /var/tmp/jenkins_slaves/jenkins-ops/workspace/devops-gate/projects/dx4linux/delphix-build-and-snapshots/ami-snapshots
      [Pipeline] {
      [Pipeline] checkout
      19:23:07 Wiping out workspace first.
      19:23:08 Cloning the remote Git repository
      [...]
      [Pipeline] // timestamps
      [Pipeline] End of Pipeline
      Finished: SUCCESS
      

      Several things are unusual:

      • The job continues running, even after the Pipeline plugin printed "Finished: FAILURE."
      • If the job was running from a parent job, the parent job shows the job as failed (even though the child job is still running).
      • The job gets to the finally block of my pipeline and sends me a notification that it failed, however, the value of currentBuild.result is paradoxically SUCCESS, which leads me to believe that somehow the pipeline got to my finally block but not my catch block (which sets currentBuild.result to FAILURE on any exception). While the details of this behavior might be specific to my pipeline, the general behavior is very wonky.

          [JENKINS-53775] FileNotFoundException for program.dat when running a Pipeline Job concurrently with the Job DSL plugin

          Basil Crow added a comment -

          Here is the behavior I have observed:

          1. A WorkflowJob is created using Job DSL with concurrentBuilds false.
          2. The job starts running.
          3. Another build of the WorkflowJob is scheduled (due to an SCM commit) but remains in the queue, blocked on the first build.
          4. A Job DSL job runs again and updates the above WorkflowJob's job definition.
          5. At this point, what should be impossible happens: the second build starts running concurrently with the first build (even though concurrentBuilds is false in the DSL!). I can clearly see both running in the Jenkins classic UI.
          6. The first build fails with the java.io.FileNotFoundException mentioned above (program.dat (No such file or directory)).

          Here is some background information. When Job DSL creates a pipeline job with concurrentBuilds false, it emits the following XML in the flow definition:

          <concurrentBuild>false</concurrentBuild>
          

          Now, note that this field is deprecated in WorkflowJob:

          /** @deprecated replaced by {@link DisableConcurrentBuildsJobProperty} */
          private @CheckForNull Boolean concurrentBuild;
          

          In fact, the getter and setter in WorkflowJob use this deprecated field just to set a DisableConcurrentBuildsJobProperty property on the job:

              @Exported
              @Override public boolean isConcurrentBuild() {
                  return getProperty(DisableConcurrentBuildsJobProperty.class) == null;
              }
          [...]
              public void setConcurrentBuild(boolean b) throws IOException {
                  concurrentBuild = null;
          
                  boolean propertyExists = getProperty(DisableConcurrentBuildsJobProperty.class) != null;
          
                  // If the property exists, concurrent builds are disabled. So if the argument here is true and the
                  // property exists, we need to remove the property, while if the argument is false and the property
                  // does not exist, we need to add the property. Yay for flipping boolean values around!
                  if (propertyExists == b) {
                      BulkChange bc = new BulkChange(this);
                      try {
                          removeProperty(DisableConcurrentBuildsJobProperty.class);
                          if (!b) {
                              addProperty(new DisableConcurrentBuildsJobProperty());
                          }
                          bc.commit();
                      } finally {
                          bc.abort();
                      }
                  }
              }
          

          The deserialization from XML takes place in WorkflowJob#onLoad:

              @Override public void onLoad(ItemGroup<? extends Item> parent, String name) throws IOException {
                  super.onLoad(parent, name);
          
                  if (buildMixIn == null) {
                      buildMixIn = createBuildMixIn();
                  }
                  buildMixIn.onLoad(parent, name);
                  if (triggers != null && !triggers.isEmpty()) {
                      setTriggers(triggers.toList());
                  }
                  if (concurrentBuild != null) {
                      setConcurrentBuild(concurrentBuild);
                  }
          

          We know that Job DSL is writing out the XML with <concurrentBuild>false</concurrentBuild>. So when the job is deserialized, the deprecated field concurrentBuild must be set to false. The onLoad method checks this field, sees that it is not null, and calls WorkflowJob#setConcurrentBuild(false). This changes the value of the field from false to null and adds the DisableConcurrentBuildsJobProperty property to the job in a bulk change.

          My theory is that while this is taking place, another caller concurrently invokes WorkflowJob#isConcurrentBuild. Since the DisableConcurrentBuildsJobProperty is not yet set on the job, this method returns true. Hence the scheduler starts running this job concurrently, erroneously. Then later on, we reach a pathological state in Pipeline and the java.io.FileNotFoundException is thrown.

          Next, I will try to prove my theory. I'll post updates in this bug, but I welcome any suggestions.

          Basil Crow added a comment - Here is the behavior I have observed: A WorkflowJob is created using Job DSL with concurrentBuilds false . The job starts running. Another build of the WorkflowJob is scheduled (due to an SCM commit) but remains in the queue, blocked on the first build. A Job DSL job runs again and updates the above WorkflowJob 's job definition. At this point, what should be impossible happens: the second build starts running concurrently with the first build (even though concurrentBuilds is false in the DSL!). I can clearly see both running in the Jenkins classic UI. The first build fails with the java.io.FileNotFoundException mentioned above ( program.dat (No such file or directory) ). Here is some background information. When Job DSL creates a pipeline job with concurrentBuilds false , it emits the following XML in the flow definition: <concurrentBuild>false</concurrentBuild> Now, note that this field is deprecated in WorkflowJob : /** @deprecated replaced by {@link DisableConcurrentBuildsJobProperty} */ private @CheckForNull Boolean concurrentBuild; In fact, the getter and setter in WorkflowJob use this deprecated field just to set a DisableConcurrentBuildsJobProperty property on the job: @Exported @Override public boolean isConcurrentBuild() { return getProperty(DisableConcurrentBuildsJobProperty.class) == null; } [...] public void setConcurrentBuild(boolean b) throws IOException { concurrentBuild = null; boolean propertyExists = getProperty(DisableConcurrentBuildsJobProperty.class) != null; // If the property exists, concurrent builds are disabled. So if the argument here is true and the // property exists, we need to remove the property, while if the argument is false and the property // does not exist, we need to add the property. Yay for flipping boolean values around! if (propertyExists == b) { BulkChange bc = new BulkChange(this); try { removeProperty(DisableConcurrentBuildsJobProperty.class); if (!b) { addProperty(new DisableConcurrentBuildsJobProperty()); } bc.commit(); } finally { bc.abort(); } } } The deserialization from XML takes place in WorkflowJob#onLoad : @Override public void onLoad(ItemGroup<? extends Item> parent, String name) throws IOException { super.onLoad(parent, name); if (buildMixIn == null) { buildMixIn = createBuildMixIn(); } buildMixIn.onLoad(parent, name); if (triggers != null && !triggers.isEmpty()) { setTriggers(triggers.toList()); } if (concurrentBuild != null) { setConcurrentBuild(concurrentBuild); } We know that Job DSL is writing out the XML with <concurrentBuild>false</concurrentBuild> . So when the job is deserialized, the deprecated field concurrentBuild must be set to false . The onLoad method checks this field, sees that it is not null , and calls WorkflowJob#setConcurrentBuild(false) . This changes the value of the field from false to null and adds the DisableConcurrentBuildsJobProperty property to the job in a bulk change. My theory is that while this is taking place, another caller concurrently invokes WorkflowJob#isConcurrentBuild . Since the DisableConcurrentBuildsJobProperty is not yet set on the job, this method returns true. Hence the scheduler starts running this job concurrently, erroneously. Then later on, we reach a pathological state in Pipeline and the java.io.FileNotFoundException is thrown. Next, I will try to prove my theory. I'll post updates in this bug, but I welcome any suggestions.

          Basil Crow added a comment -

          I verified my understanding of the above by checking config.xml after Job DSL had generated the job with concurrentBuild false. It had the <concurrentBuild>false</concurrentBuild> attribute. I then checked how this had been deserialized in the Script Console:

          println job.@concurrentBuild
          println job.concurrentBuild
          

          This printed null and false, which was what I expected. The field had been set to null, and the getter was instead relying on the property. Next, I called job.save() from the Script Console and checked the contents of config.xml again. Now, <concurrentBuild>false</concurrentBuild> was gone. In its place, in the properties section was <org.jenkinsci.plugins.workflow.job.properties.DisableConcurrentBuildsJobProperty/>.

          Based on the above, I think the fix should be for Job DSL to generate the property rather than setting the deprecated concurrentBuild field. In other words, there seems to be a race in converting the deprecated field to the new-style property. If Job DSL simply used the new-style property in the first place, we would avoid the race and therefore we wouldn't put Pipeline in a pathological state that ultimate results in a FileNotFoundException.

          daspilker, do you have any concerns with this approach? If not, I will prepare a PR.

          Basil Crow added a comment - I verified my understanding of the above by checking config.xml after Job DSL had generated the job with concurrentBuild false . It had the <concurrentBuild>false</concurrentBuild> attribute. I then checked how this had been deserialized in the Script Console: println job.@concurrentBuild println job.concurrentBuild This printed null and false , which was what I expected. The field had been set to null, and the getter was instead relying on the property. Next, I called job.save() from the Script Console and checked the contents of config.xml again. Now, <concurrentBuild>false</concurrentBuild> was gone. In its place, in the properties section was <org.jenkinsci.plugins.workflow.job.properties.DisableConcurrentBuildsJobProperty/> . Based on the above, I think the fix should be for Job DSL to generate the property rather than setting the deprecated concurrentBuild field. In other words, there seems to be a race in converting the deprecated field to the new-style property. If Job DSL simply used the new-style property in the first place, we would avoid the race and therefore we wouldn't put Pipeline in a pathological state that ultimate results in a FileNotFoundException . daspilker , do you have any concerns with this approach? If not, I will prepare a PR.

          Basil Crow added a comment -

          Thinking about this some more, the fix might be even simpler. The Dynamic DSL already supports the new-style property. So all that we need to do in Job DSL is deprecated concurrentBuilds for Pipeline jobs and encourage users to migrate to the new property via the Dynamic DSL. In other words, users should convert this syntax …

          concurrentBuild false
          

          … to this syntax …

          properties {
            disableConcurrentBuilds()
          }
          

          I confirmed that this generates the new-style XML and that the "Do not allow concurrent builds" was checked when I viewed the generated job in the Jenkins UI.

          Basil Crow added a comment - Thinking about this some more, the fix might be even simpler. The Dynamic DSL already supports the new-style property. So all that we need to do in Job DSL is deprecated concurrentBuilds for Pipeline jobs and encourage users to migrate to the new property via the Dynamic DSL. In other words, users should convert this syntax … concurrentBuild false … to this syntax … properties { disableConcurrentBuilds() } I confirmed that this generates the new-style XML and that the "Do not allow concurrent builds" was checked when I viewed the generated job in the Jenkins UI.

          Nikita Bochenko added a comment - - edited

          basil thank you for extensive investigation. It seems highly plausible. I did observe multiple jobs running where there should not be, or a job stuck in running state although it has been completed. This also could explain why sometimes reload configuration helps to find "missing" jobs that are stuck in this state.

          We do generate some jobs via DSL and some directly from the scripted pipeline, so I'd need to check how concurrentBuild is set up there, I suspect it is also using ye olde way of setting concurrent builds values.

          One additional observation: it also happens on the jobs that are using Throttle Concurrent Builds plugin - the cause could be a similar one. Basically I just set something like these:

          throttleConcurrentBuilds {
            categories(['some-category'])
          }
          
          throttleConcurrentBuilds {   
            maxPerNode(2)
            maxTotal(8)
          }
          

           

          Nikita Bochenko added a comment - - edited basil  thank you for extensive investigation. It seems highly plausible. I did observe multiple jobs running where there should not be, or a job stuck in running state although it has been completed. This also could explain why sometimes reload configuration helps to find "missing" jobs that are stuck in this state. We do generate some jobs via DSL and some directly from the scripted pipeline, so I'd need to check how concurrentBuild is set up there, I suspect it is also using ye olde way of setting concurrent builds values. One additional observation: it also happens on the jobs that are using Throttle Concurrent Builds plugin - the cause could be a similar one. Basically I just set something like these: throttleConcurrentBuilds { categories([ 'some-category' ]) } throttleConcurrentBuilds { maxPerNode(2) maxTotal(8) }  

          I need to clarify: I believe we are explicitly setting concurrentBuild to either true or false for most, if not all, builds. Even the ones with throttleConcurrentBuilds - so it might not be related to the plugin itself. I will investigate this when I get some time - right now I am working on some projects that do not give me enough time to investigate this deeper.

          Nikita Bochenko added a comment - I need to clarify: I believe we are explicitly setting  concurrentBuild  to either true or false for most, if not all, builds. Even the ones with  throttleConcurrentBuilds - so it might not be related to the plugin itself. I will investigate this when I get some time - right now I am working on some projects that do not give me enough time to investigate this deeper.

          P.S.

          Had a quick look and in many places we are using setConcurrentProperty which is not marked deprecated in docs. This is not related to DSL, however it seems to me that this is the same bug, just in a different context.

          Nikita Bochenko added a comment - P.S. Had a quick look and in many places we are using setConcurrentProperty which is not marked deprecated in docs. This is not related to DSL, however it seems to me that this is the same bug, just in a different context.

          P.P.S.

          For freestyle jobs I have to set up {{concurrentBuild(true).}}I think by default freestyle jobs are not allowed concurrent execution. Not related to pipeline, of course, but feels inconsistent and may cause confusion?

          Nikita Bochenko added a comment - P.P.S. For freestyle jobs I have to set up {{concurrentBuild(true).}}I think by default freestyle jobs are not allowed concurrent execution. Not related to pipeline, of course, but feels inconsistent and may cause confusion?

          Basil Crow added a comment -

          I'm not surprised Throttle Concurrent Builds is on the scene of the crime here, but I don't want to jump to any conclusions yet without knowing more details about your configuration. It may or may not be related to the Pipeline race being described here. As an aside, if you're using Throttle Concurrent Builds with categories and Pipeline, you really should be using my patch from throttle-concurrent-builds-plugin#57, which improves CPU usage drastically and also has some correctness benefits. But that may or may not be related to this bug.

          Basil Crow added a comment - I'm not surprised Throttle Concurrent Builds is on the scene of the crime here, but I don't want to jump to any conclusions yet without knowing more details about your configuration. It may or may not be related to the Pipeline race being described here. As an aside, if you're using Throttle Concurrent Builds with categories and Pipeline, you really should be using my patch from throttle-concurrent-builds-plugin#57 , which improves CPU usage drastically and also has some correctness benefits. But that may or may not be related to this bug.

          Basil Crow added a comment -

          PS if you use that Throttle Concurrent Builds patch and find that it helps, please comment on the PR. I've been using it successfully for over 6 months and trying to get it merged/released (including asking for the privileges to merge/release it myself) but so far have received no response.

          Basil Crow added a comment - PS if you use that Throttle Concurrent Builds patch and find that it helps, please comment on the PR. I've been using it successfully for over 6 months and trying to get it merged/released (including asking for the privileges to merge/release it myself) but so far have received no response.

          Basil Crow added a comment -

          I've been running with the new syntax described in my previous comment for about a month, and this issue hasn't occurred again. Previously it occurred almost every time I deployed changes to a Job DSL pipeline.

          Basil Crow added a comment - I've been running with the new syntax described in my previous comment for about a month, and this issue hasn't occurred again. Previously it occurred almost every time I deployed changes to a Job DSL pipeline.

            daspilker Daniel Spilker
            basil Basil Crow
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: