-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
2.462.1 and 2.479.1
When upgrading from 2.462.1 to 2.479.1, we ran into a problem where Jenkins spawned too many threads, hit OS limits, and failed to boot. It turned out this was caused by Jenkins trying to read agent config files from{{ /jenkins_home/nodes}} – there were over 8000 directories there, the majority of them for agents that have been terminated and removed from Jenkins a long time ago. After removing the directories, Jenkins booted successfully.
It appears that in our setup Jenkins is not deleting the directories for terminated agents correctly from under /jenkins_home/nodes. I've been looking at the logs of jenkins.model.Nodes. Jenkins attempts to delete the directories of terminated agents as expected, and no exceptions are thrown. However, the directories are not actually removed from disk, and inotifywatch }}suggests that a delete call doesn't make it to the operating system at all. Below is {{inotifywatch output for a directory that Jenkins attempted to delete, but which remained on disk:
total access modify close_write close_nowrite open moved_from moved_to create delete delete_self filename 30 6 5 2 6 8 1 1 1 0 0 /jenkins_home/nodes/oci-compute-ec2a342f-foo/
This only happens for some nodes. Some are deleted correctly, some are not. In both cases they are created by the same plugin (Oracle Cloud Infrastructure Compute). The config.xml file for all the leaked ones has this:
<temporaryOfflineCause class="hudson.slaves.OfflineCause$SimpleOfflineCause">
<timestamp>1732305733233</timestamp>
<description>
<holder>
<owner>hudson.model.Messages</owner>
</holder>
<key>Hudson.NodeBeingRemoved</key>
<args/>
</description>
</temporaryOfflineCause>
The churn of agents in our setup is pretty big, so this might be a factor. I'm not able to reproduce this on another instance that we have, which is under much less load.
One problem here is the leaks, and another is the failed boot – maybe there should be a limit on the number of threads that are spawned when the agent files are read during startup.
rmoszczy Any chance you could share a sanitized version of these logs with us?
FYI vlatombe and jglick as you worked on some Nodes persistence changes to this class recently, though I am not sure if they are related or not. (My apologies for the unnecessary ping if those changes are unrelated.)