Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-75099

Agent config files not deleted after agent termination

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • core
    • None
    • 2.462.1 and 2.479.1

      When upgrading from 2.462.1 to 2.479.1, we ran into a problem where Jenkins spawned too many threads, hit OS limits, and failed to boot. It turned out this was caused by Jenkins trying to read agent config files from{{ /jenkins_home/nodes}} – there were over 8000 directories there, the majority of them for agents that have been terminated and removed from Jenkins a long time ago. After removing the directories, Jenkins booted successfully.

       

      It appears that in our setup Jenkins is not deleting the directories for terminated agents correctly from under /jenkins_home/nodes. I've been looking at the logs of jenkins.model.Nodes. Jenkins attempts to delete the directories of terminated agents as expected, and no exceptions are thrown. However, the directories are not actually removed from disk, and inotifywatch }}suggests that a delete call doesn't make it to the operating system at all. Below is {{inotifywatch  output for a directory that Jenkins attempted to delete, but which remained on disk:

      total  access  modify  close_write  close_nowrite  open  moved_from  moved_to  create  delete  delete_self  filename
      30     6       5       2            6              8     1           1         1       0       0            /jenkins_home/nodes/oci-compute-ec2a342f-foo/ 

      This only happens for some nodes. Some are deleted correctly, some are not. In both cases they are created by the same plugin (Oracle Cloud Infrastructure Compute). The config.xml file for all the leaked ones has this:

        <temporaryOfflineCause class="hudson.slaves.OfflineCause$SimpleOfflineCause">
          <timestamp>1732305733233</timestamp>
          <description>
            <holder>
              <owner>hudson.model.Messages</owner>
            </holder>
            <key>Hudson.NodeBeingRemoved</key>
            <args/>
          </description>
        </temporaryOfflineCause>
       

      The churn of agents in our setup is pretty big, so this might be a factor. I'm not able to reproduce this on another instance that we have, which is under much less load.

       

      One problem here is the leaks, and another is the failed boot – maybe there should be a limit on the number of threads that are spawned when the agent files are read during startup.

          [JENKINS-75099] Agent config files not deleted after agent termination

            Unassigned Unassigned
            rmoszczy Radoslaw Moszczynski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: