Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67100

KubernetesProvisioningLimits has race condition during initialization

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • 3546.v6103d89542d6

      After a Jenkins restart, some of our Jenkins instances are stuck, unable to spawn any new Kubernetes agent. As we already faced some issues with the KubernetesProvisioningLimits class (see https://issues.jenkins.io/browse/JENKINS-66484 and https://github.com/jenkinsci/kubernetes-plugin/pull/1028), I re-created a logger for {{org.csanchez.jenkins.plugins.kubernetes.KubernetesProvisioningLimits }}and indeed, the stuck instance were showing "impossible" data, e.g.:

       

      Nov 10, 2021 2:57:09 AM FINEST org.csanchez.jenkins.plugins.kubernetes.KubernetesProvisioningLimits
      kubernetes global limit reached: 32/4. Cannot add 1 more!

       

      My best guess is that KubernetesProvisioningLimits initialization phase has some race condition and more specifically when the instance is restarted with some jobs in the queue. It seems that KubernetesSlave are being created for the elements in the queue before the KubernetesProvisioningLimits#init method is invoked.

          [JENKINS-67100] KubernetesProvisioningLimits has race condition during initialization

          Fred G added a comment - - edited

          Ping! This issue randomly affects ~250 Eclipse projects and blocks their CI instance at https://ci.eclipse.org.

           

          See also: https://bugs.eclipse.org/bugs/show_bug.cgi?id=577166

          Fred G added a comment - - edited Ping! This issue randomly affects ~250 Eclipse projects and blocks their CI instance at https://ci.eclipse.org .   See also: https://bugs.eclipse.org/bugs/show_bug.cgi?id=577166

          It seems that this race condition is a blocker only when associated with Kube quotas regarding the number of pods. It could explain why it is not experienced more widely by the community.

          AFAICT, KubernetesProvisioningLimits#init can be triggered after some calls to KubernetesProvisioningLimits#register have been made, creating those impossible limits (e.g. 32/4)

          I identified that those registrations are done when some agents were running before Jenkins has been terminated. In this case, NodeProvisionerInvoker can be triggered before the initialization of the KubernetesProvisioningLimits#init method.

          I'll provide a patch shortly.

          Mikaël Barbero added a comment - It seems that this race condition is a blocker only when associated with Kube quotas regarding the number of pods. It could explain why it is not experienced more widely by the community. AFAICT, KubernetesProvisioningLimits#init can be triggered after some calls to KubernetesProvisioningLimits#register have been made, creating those impossible limits (e.g. 32/4) I identified that those registrations are done when some agents were running before Jenkins has been terminated. In this case, NodeProvisionerInvoker can be triggered before the initialization of the KubernetesProvisioningLimits#init method. I'll provide a patch shortly.

          Fred G added a comment -

          While it happens a lot less often, we still see issues like:

          Mar 22, 2022 6:00:34 AM FINEST org.csanchez.jenkins.plugins.kubernetes.KubernetesProvisioningLimits
          kubernetes global limit reached: 61/8. Cannot add 1 more!

          Fred G added a comment - While it happens a lot less often, we still see issues like: Mar 22, 2022 6:00:34 AM FINEST org.csanchez.jenkins.plugins.kubernetes.KubernetesProvisioningLimits kubernetes global limit reached: 61/8. Cannot add 1 more!

          Piotrek Zygielo added a comment - https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/3918#note_1293373: 13122/34...

            vlatombe Vincent Latombe
            mbarbero Mikaël Barbero
            Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: