Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53401

Random FileNotFoundException when creating lots of agents in parallel threads

      Upon creating lots of agents in parallel (Cloud provisioning containers), I see sometimes random exceptions reported moving temporary files to node/config.xml.

      Also:   java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/myagent-5pr7b/atomic4488666319135941520tmp -> /var/jenkins_home/nodes/myagent-5pr7b/config.xml
      		at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
      		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      		at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
      		at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      		at java.nio.file.Files.move(Files.java:1395)
      		at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:191)
      java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/myagent-5pr7b/atomic4488666319135941520tmp
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
      	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
      	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      	at java.nio.file.Files.move(Files.java:1395)
      	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
      	at hudson.XmlFile.write(XmlFile.java:198)
      	at jenkins.model.Nodes.save(Nodes.java:289)
      	at hudson.util.PersistedList.onModified(PersistedList.java:173)
      	at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
      	at hudson.model.Slave.<init>(Slave.java:198)
      	at hudson.slaves.AbstractCloudSlave.<init>(AbstractCloudSlave.java:51)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave.<init>(KubernetesSlave.java:116)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave$Builder.build(KubernetesSlave.java:408)
      	at com.cloudbees.jenkins.plugins.kube.PlannedKubernetesSlave.call(PlannedKubernetesSlave.java:122)
      	at com.cloudbees.jenkins.plugins.kube.PlannedKubernetesSlave.call(PlannedKubernetesSlave.java:35)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      I tracked the root cause being the nodeProperties field in hudson.model.Slave.

      If you have a lot of agents created in different threads, this will cause to call Jenkins.get().getNodesObject().save in each thread. This method is not thread-safe, and affects all nodes storage. As a result, in some threads, save() throws an exception because the node has been already processed through another thread.

      In JENKINS-31055, Stephen made Node implement Saveable, which means the persisted lists should be tied to the node instead of the Nodes object. The corresponding save() operation is fine-grained, so the issue would be avoided completely.

          [JENKINS-53401] Random FileNotFoundException when creating lots of agents in parallel threads

          Matt Nuzzaco added a comment -

          This sounds very similar to what I was seeing in a few heavily parallelized jobs. We can easily kick off 200-500 agents in a very short period of time. I've tested v2.143 and so far I haven't seen the failure we were seeing before. Crossing fingers this was the solution. Thanks for the patch.

          Matt Nuzzaco added a comment - This sounds very similar to what I was seeing in a few heavily parallelized jobs. We can easily kick off 200-500 agents in a very short period of time. I've tested v2.143 and so far I haven't seen the failure we were seeing before. Crossing fingers this was the solution. Thanks for the patch.

          Daniel Beck added a comment -

          Addressed in 2.143.

          Daniel Beck added a comment - Addressed in 2.143.

          Greg Smith added a comment - - edited

          Please forgive me if out-of-line:

          There are reports that there are deadlock issues with EC2 slaves after upgrading to Jenkins LTS 2.138.2, and one of the changes between LTS 2.138.1 and 2.138.2 was this change.  The issue is reported here:  JENKINS-54187

          I don't know the code well enough to say really:  But this change mentions slaves and thread-safety, and that bug is around the creation of slaves and a deadlock, so knowing nothing other than that, and trying to figure out which change caused the deadlock issue, I thought maybe they were related?

          Greg Smith added a comment - - edited Please forgive me if out-of-line: There are reports that there are deadlock issues with EC2 slaves after upgrading to Jenkins LTS 2.138.2, and one of the changes between LTS 2.138.1 and 2.138.2 was this change.  The issue is reported here:  JENKINS-54187 I don't know the code well enough to say really:  But this change mentions slaves and thread-safety, and that bug is around the creation of slaves and a deadlock, so knowing nothing other than that, and trying to figure out which change caused the deadlock issue, I thought maybe they were related?

          Vincent Latombe added a comment - gregcovertsmith Indeed, it looks like they are related. For other readers: the new save path adds a Queue lock ( https://github.com/jenkinsci/jenkins/blob/9557da32a3550bd98acc9d04728547fcd98b8a15/core/src/main/java/jenkins/model/Nodes.java#L193-L202 ), which wasn't in the previous save path ( https://github.com/jenkinsci/jenkins/blob/9557da32a3550bd98acc9d04728547fcd98b8a15/core/src/main/java/jenkins/model/Nodes.java#L277-L300 ).

            vlatombe Vincent Latombe
            vlatombe Vincent Latombe
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: