Random FileNotFoundException when creating lots of agents in parallel threads

This issue is archived. You can view it, but you can't modify it. Learn more

XMLWordPrintable

      Upon creating lots of agents in parallel (Cloud provisioning containers), I see sometimes random exceptions reported moving temporary files to node/config.xml.

      Also:   java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/myagent-5pr7b/atomic4488666319135941520tmp -> /var/jenkins_home/nodes/myagent-5pr7b/config.xml
      		at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
      		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      		at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
      		at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      		at java.nio.file.Files.move(Files.java:1395)
      		at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:191)
      java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/myagent-5pr7b/atomic4488666319135941520tmp
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
      	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
      	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      	at java.nio.file.Files.move(Files.java:1395)
      	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
      	at hudson.XmlFile.write(XmlFile.java:198)
      	at jenkins.model.Nodes.save(Nodes.java:289)
      	at hudson.util.PersistedList.onModified(PersistedList.java:173)
      	at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
      	at hudson.model.Slave.<init>(Slave.java:198)
      	at hudson.slaves.AbstractCloudSlave.<init>(AbstractCloudSlave.java:51)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave.<init>(KubernetesSlave.java:116)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave$Builder.build(KubernetesSlave.java:408)
      	at com.cloudbees.jenkins.plugins.kube.PlannedKubernetesSlave.call(PlannedKubernetesSlave.java:122)
      	at com.cloudbees.jenkins.plugins.kube.PlannedKubernetesSlave.call(PlannedKubernetesSlave.java:35)
      	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
      	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      I tracked the root cause being the nodeProperties field in hudson.model.Slave.

      If you have a lot of agents created in different threads, this will cause to call Jenkins.get().getNodesObject().save in each thread. This method is not thread-safe, and affects all nodes storage. As a result, in some threads, save() throws an exception because the node has been already processed through another thread.

      In JENKINS-31055, Stephen made Node implement Saveable, which means the persisted lists should be tied to the node instead of the Nodes object. The corresponding save() operation is fine-grained, so the issue would be avoided completely.

            Assignee:
            Vincent Latombe
            Reporter:
            Vincent Latombe
            Archiver:
            Jenkins Service Account

              Created:
              Updated:
              Resolved:
              Archived: