-
Bug
-
Resolution: Fixed
-
Major
-
Code that's creating multiple Slave nodes, in parallel, that have nodeProperties.
e.g. cloud plugins or pipelines running parallel jobs in dynamic slaves.
TL;DR: Slave.setNodeProperties is unreliable because it's doing stuff it shouldn't.
When a cloud plugin's Cloud.provision method gets called, the plugin code has to create one or more instances of hudson.model.Slave and return them to Jenkins.
Prior to returning them to Jenkins, it has to populate those instances with all the data required. This can (e.g. in the case of the docker-plugin) include a call to Slave.setNodeProperties.
Slave.setNodeProperties calls nodeProperties.replaceBy(). nodeProperties are a hudson.util.PersistedList whose owner is Jenkins.getInstance().getNodesObject() (see Slave.java line 150), and PersistedList.replaceBy calls PersistedList.onModified() which then calls Nodes.save() which writes all nodes to disk (which is a non-trivial operation that isn't thread-safe).
i.e. At present, any change to the node properties for any slave (even those not persisted to disk yet) causes all slaves to get (re-)persisted to disk.
This is very inefficient when there are a lot of slaves - it should, at most, persist just the slave in question, and in the case where a slave doesn't belong to anyone yet, it shouldn't get persisted at all.
Worse still, when there's more than one slave being provisioned at a time (e.g. on a busy Jenkins server with a fast cloud), the persistence functionality can throw an exception and cause the entire operation to fail:
java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-22f02f15d4988b/atomic3648298142656815004tmp at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206) at hudson.XmlFile.write(XmlFile.java:198) at jenkins.model.Nodes.save(Nodes.java:274) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.setNodeProperties(Slave.java:299) at ...
What should happen instead is that changes to Slave.nodeProperties should only cause those changes to be persisted if the slave itself is persisted. Changes to Slaves that are not (yet) part of Jenkins' list of Nodes should not cause a re-save of the configuration to disk.
e.g. much like the logic done in Node.save().
FYI this was discovered while investigating the cause of numerous docker provisioning failures caused by occasional (unpredictable) persistence failures from hudson.util.AtomicFileWriter (see docker-plugin issue 644). While that plugin code is being enhanced to cope with such failures, the core code shouldn't be causing these failures in the first place.
Suggestion: I suspect that this could be fixed by having Slave have nodeProperties.owner set to this, because then it'll benefit from the same optimization logic that's already implemented in Node.save().
Note: Elsewhere, calls to Nodes.save() are wrapped in a Queue.withLock block to prevent concurrent updates; the call from Slave.nodeProperties isn't wrapped like this, which is why the exception happens, but it'd be better to avoid calling it entirely than to simply make it's existing functionality thread-safe.
- is duplicated by
-
JENKINS-53401 Random FileNotFoundException when creating lots of agents in parallel threads
- Resolved
- is related to
-
JENKINS-53401 Random FileNotFoundException when creating lots of agents in parallel threads
- Resolved