Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-75099

Agent config files not deleted after agent termination

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • core
    • None
    • 2.462.1 and 2.479.1

      When upgrading from 2.462.1 to 2.479.1, we ran into a problem where Jenkins spawned too many threads, hit OS limits, and failed to boot. It turned out this was caused by Jenkins trying to read agent config files from{{ /jenkins_home/nodes}} – there were over 8000 directories there, the majority of them for agents that have been terminated and removed from Jenkins a long time ago. After removing the directories, Jenkins booted successfully.

       

      It appears that in our setup Jenkins is not deleting the directories for terminated agents correctly from under /jenkins_home/nodes. I've been looking at the logs of jenkins.model.Nodes. Jenkins attempts to delete the directories of terminated agents as expected, and no exceptions are thrown. However, the directories are not actually removed from disk, and inotifywatch }}suggests that a delete call doesn't make it to the operating system at all. Below is {{inotifywatch  output for a directory that Jenkins attempted to delete, but which remained on disk:

      total  access  modify  close_write  close_nowrite  open  moved_from  moved_to  create  delete  delete_self  filename
      30     6       5       2            6              8     1           1         1       0       0            /jenkins_home/nodes/oci-compute-ec2a342f-foo/ 

      This only happens for some nodes. Some are deleted correctly, some are not. In both cases they are created by the same plugin (Oracle Cloud Infrastructure Compute). The config.xml file for all the leaked ones has this:

        <temporaryOfflineCause class="hudson.slaves.OfflineCause$SimpleOfflineCause">
          <timestamp>1732305733233</timestamp>
          <description>
            <holder>
              <owner>hudson.model.Messages</owner>
            </holder>
            <key>Hudson.NodeBeingRemoved</key>
            <args/>
          </description>
        </temporaryOfflineCause>
       

      The churn of agents in our setup is pretty big, so this might be a factor. I'm not able to reproduce this on another instance that we have, which is under much less load.

       

      One problem here is the leaks, and another is the failed boot – maybe there should be a limit on the number of threads that are spawned when the agent files are read during startup.

          [JENKINS-75099] Agent config files not deleted after agent termination

          Basil Crow added a comment - - edited

          I've been looking at the logs of jenkins.model.Nodes. Jenkins attempts to delete the directories of terminated agents as expected, and no exceptions are thrown. However, the directories are not actually removed from disk

          rmoszczy Any chance you could share a sanitized version of these logs with us?

          FYI vlatombe and jglick as you worked on some Nodes persistence changes to this class recently, though I am not sure if they are related or not. (My apologies for the unnecessary ping if those changes are unrelated.)

          Basil Crow added a comment - - edited I've been looking at the logs of jenkins.model.Nodes . Jenkins attempts to delete the directories of terminated agents as expected, and no exceptions are thrown. However, the directories are not actually removed from disk rmoszczy Any chance you could share a sanitized version of these logs with us? FYI vlatombe and jglick as you worked on some Nodes persistence changes to this class recently, though I am not sure if they are related or not. (My apologies for the unnecessary ping if those changes are unrelated.)

          https://github.com/jenkinsci/jenkins/pull/8979/ could possibly be related (more closely https://github.com/jenkinsci/jenkins/pull/8979/files#diff-500f803e8ff7fd79aac34475870f3901509fad0dbc2d0b3a799c8d9edd7d2c0eL268-L289) though it is unclear how the directory deletion would be prevented here.

          Looking at the Oracle Cloud Infrastructure Compute plugin, there is a fishy AsyncPeriodicWork implementation (https://github.com/jenkinsci/oracle-cloud-infrastructure-compute-plugin/blob/b72cdd84c40354e2a93d90d07b0334d8624b3769/src/main/java/com/oracle/cloud/baremetal/jenkins/BaremetalCloudInstanceMonitor.java#L16) that seems to duplicate what a retention strategy would do.

          Vincent Latombe added a comment - https://github.com/jenkinsci/jenkins/pull/8979/ could possibly be related (more closely https://github.com/jenkinsci/jenkins/pull/8979/files#diff-500f803e8ff7fd79aac34475870f3901509fad0dbc2d0b3a799c8d9edd7d2c0eL268-L289) though it is unclear how the directory deletion would be prevented here. Looking at the Oracle Cloud Infrastructure Compute plugin, there is a fishy AsyncPeriodicWork implementation ( https://github.com/jenkinsci/oracle-cloud-infrastructure-compute-plugin/blob/b72cdd84c40354e2a93d90d07b0334d8624b3769/src/main/java/com/oracle/cloud/baremetal/jenkins/BaremetalCloudInstanceMonitor.java#L16) that seems to duplicate what a retention strategy would do.

          basil The logs from jenkins.model.Nodes are not very interesting:

          Jan 10, 2025 12:09:37 PM FINE jenkins.model.Nodes
          deleting /var/jenkins_home/nodes/oci-compute-fc77f674-7747-4503-8a67-fdbbb341974d
          Jan 10, 2025 12:10:40 PM FINE jenkins.model.Nodes
          deleting /var/jenkins_home/nodes/oci-compute-0ab58ad4-022e-408d-9d16-917b4c54cb94
          Jan 10, 2025 12:10:41 PM FINE jenkins.model.Nodes
          deleting /var/jenkins_home/nodes/oci-compute-3dcc0169-5e1a-4cfe-bec8-dac5324de2c4
          Jan 10, 2025 12:10:42 PM FINE jenkins.model.Nodes
          deleting /var/jenkins_home/nodes/oci-compute-7ac655ab-eab6-429d-8046-706c3d997d40
           

          In the example below, the 1st and 4th directories were actually not deleted, the 2nd and 3rd ones are gone from disk.

           

          vlatombe Looking at logs of BareMetalCloudInstanceMonitor in our instance, it iterates over agents periodically, but doesn't attempt to delete any itself, so it probably isn't interfering with any core Jenkins mechanisms.

          Radoslaw Moszczynski added a comment - basil The logs from jenkins.model.Nodes are not very interesting: Jan 10, 2025 12:09:37 PM FINE jenkins.model.Nodes deleting / var /jenkins_home/nodes/oci-compute-fc77f674-7747-4503-8a67-fdbbb341974d Jan 10, 2025 12:10:40 PM FINE jenkins.model.Nodes deleting / var /jenkins_home/nodes/oci-compute-0ab58ad4-022e-408d-9d16-917b4c54cb94 Jan 10, 2025 12:10:41 PM FINE jenkins.model.Nodes deleting / var /jenkins_home/nodes/oci-compute-3dcc0169-5e1a-4cfe-bec8-dac5324de2c4 Jan 10, 2025 12:10:42 PM FINE jenkins.model.Nodes deleting / var /jenkins_home/nodes/oci-compute-7ac655ab-eab6-429d-8046-706c3d997d40 In the example below, the 1st and 4th directories were actually not deleted, the 2nd and 3rd ones are gone from disk.   vlatombe Looking at logs of BareMetalCloudInstanceMonitor in our instance, it iterates over agents periodically, but doesn't attempt to delete any itself, so it probably isn't interfering with any core Jenkins mechanisms.

          Vincent Latombe added a comment - - edited

          These logs would indicate the issue is lower level, since the log is immediately before the actual Util delete call (https://github.com/jenkinsci/jenkins/blob/0235a800b80342d6959ec7de78e41a72f1ae8f03/core/src/main/java/jenkins/model/Nodes.java#L139-L140)
          Or maybe some race condition with another thread saving the node configuration?

          Vincent Latombe added a comment - - edited These logs would indicate the issue is lower level, since the log is immediately before the actual Util delete call ( https://github.com/jenkinsci/jenkins/blob/0235a800b80342d6959ec7de78e41a72f1ae8f03/core/src/main/java/jenkins/model/Nodes.java#L139-L140 ) Or maybe some race condition with another thread saving the node configuration?

          If you have any specific pointers where to look or any suggestions for enabling additional logging, I'd be happy to do some more digging.

          Radoslaw Moszczynski added a comment - If you have any specific pointers where to look or any suggestions for enabling additional logging, I'd be happy to do some more digging.

            Unassigned Unassigned
            rmoszczy Radoslaw Moszczynski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: