Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68931

nodes/config.xml not cleaned up after failed provisioning

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • kubernetes-plugin
    • None
    • Plugin tags:
      - 3646.va_b_469a_7666b_7
      - 3651.v908e7db_10d06

      Jenkins version: 2.332.2

      Dear community,

      we are currently facing an issue with the latest two Kubernetes plugin versions (see environment) where the node configurations `/var/jenkins_home/nodes/$pod/config.xml ` are not cleaned up properly when a pod fails to start. This leads to an accumulation of obsolete `config.xml` files in the filesystem and in our case to an inability of Jenkins to start up.

       

      We observed the following behaviour:

      1. create a new job in Jenkins and specify an invalid pod template (e.g. use more memory than is currently available in your resource quota)
      2. start a new build for this job
        1. Jenkins will create a new  /var/jenkins_home/nodes/$pod/config.xml in its filesystem
        2. the Kubernetes API will reject the pod as expected
        3. Jenkins will retry to create the pod as often as a new executor is available (if you have 10 executors, it will retry ten times per second)
      3. now trigger this job multiple times so that the build queue increases (e.g. 50 times)
      4. Jenkins will again create the config.xml files but now it will no longer clean up any failed `../nodes/$pod/config.xml` files
      5. after a while, the number of config.xml files from failed builds has increased to an unreasonable number (in our case: 138 thousand)
      6. once you restart Jenkins, it will try to load all nodes from disk, which will not succeed in time (it takes a while to load 138k obsolete configs)

       

      The issue can be prevented by downgrading the Kubernetes plugin to v3600 (https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3600.v144b_cd192ca_a_).

       

      Since there were some changes to how dead nodes are reaped, we suspect that this may have introduced this bug: https://github.com/jenkinsci/kubernetes-plugin/pull/1170

       

      If you need any additional information from us please let us know, we are  happy to contribute to the project in any way we can.

       

      Best regards,

      Florian.

            Unassigned Unassigned
            fbuchmeier_abi Florian Buchmeier
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: