Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68931

nodes/config.xml not cleaned up after failed provisioning

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • kubernetes-plugin
    • None
    • Plugin tags:
      - 3646.va_b_469a_7666b_7
      - 3651.v908e7db_10d06

      Jenkins version: 2.332.2

      Dear community,

      we are currently facing an issue with the latest two Kubernetes plugin versions (see environment) where the node configurations `/var/jenkins_home/nodes/$pod/config.xml ` are not cleaned up properly when a pod fails to start. This leads to an accumulation of obsolete `config.xml` files in the filesystem and in our case to an inability of Jenkins to start up.

       

      We observed the following behaviour:

      1. create a new job in Jenkins and specify an invalid pod template (e.g. use more memory than is currently available in your resource quota)
      2. start a new build for this job
        1. Jenkins will create a new  /var/jenkins_home/nodes/$pod/config.xml in its filesystem
        2. the Kubernetes API will reject the pod as expected
        3. Jenkins will retry to create the pod as often as a new executor is available (if you have 10 executors, it will retry ten times per second)
      3. now trigger this job multiple times so that the build queue increases (e.g. 50 times)
      4. Jenkins will again create the config.xml files but now it will no longer clean up any failed `../nodes/$pod/config.xml` files
      5. after a while, the number of config.xml files from failed builds has increased to an unreasonable number (in our case: 138 thousand)
      6. once you restart Jenkins, it will try to load all nodes from disk, which will not succeed in time (it takes a while to load 138k obsolete configs)

       

      The issue can be prevented by downgrading the Kubernetes plugin to v3600 (https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3600.v144b_cd192ca_a_).

       

      Since there were some changes to how dead nodes are reaped, we suspect that this may have introduced this bug: https://github.com/jenkinsci/kubernetes-plugin/pull/1170

       

      If you need any additional information from us please let us know, we are  happy to contribute to the project in any way we can.

       

      Best regards,

      Florian.

          [JENKINS-68931] nodes/config.xml not cleaned up after failed provisioning

          We are facing the same error. For us, this is a big problem, which is why I changed the priority to critical.

          Bruno Köferli added a comment - We are facing the same error. For us, this is a big problem, which is why I changed the priority to critical.

          Mohammad added a comment -

          We are also severly impacted by this issue and our Jenkins master fails to restart due to OOM error since it tries to load all (undeleted) nodes config.xml files form the disk. Everytime we have to clean the node config.xml files manually to bring the Jenkins up and running.

          Does anyone knows any workaround until we have a permanent fix for this issue.

          Mohammad added a comment - We are also severly impacted by this issue and our Jenkins master fails to restart due to OOM error since it tries to load all (undeleted) nodes config.xml files form the disk. Everytime we have to clean the node config.xml files manually to bring the Jenkins up and running. Does anyone knows any workaround until we have a permanent fix for this issue.

          Dennis Keitzel added a comment - - edited

          Affected as well.

          Dennis Keitzel added a comment - - edited Affected as well.

            Unassigned Unassigned
            fbuchmeier_abi Florian Buchmeier
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: