Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68409

kubernetes plugin can create a new pod every 10s when something wrong

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • recent jenkins/plugin running on Linux, kubernetes on Linus, various versions
    • 3578.vb_9a_92ea_9845a_

      When a build creates a pod on kubernetes, if jenkins cannot verify the pod, it will delete it and recreate one every 10s.  We are not aware of any configuration parameters that can control the pace.  This can hose the kubernetes cluster with huge number of pod creations and deletions.  We would like the build to fail after a number of failures instead of keeping delete/create pods forever.  At least we would like to have a new pod wait progressively more time, similar to kubernetes crashloop.

       In production, we had situations where the kubernetes cannot report the pod status in the time expected by jenkins, and the resulting flood of pod creation/deletion left each node to hold more than 8000 deleted containers while running over the pod count limit, which would need hours to clear even with the jenkins  feed turned off - we eventually restored the  nodes from backup.  Although this bug is not considered the root cause for the response slowing down, the bug caused a "pod storm" which brought the kubernetes cluster to its knees and required this drastic node restore.

       In testing, we had a situation that the connection to kubernetes does not  support websocket, thus jenkins could not read the pod status via what appears to  be a "watch" on the pod, failing on request "path" similar to the following in the kubernetes ingress log: '/api/v1/namespaces/<ns>/pods?<podname>&allowWatchBookmarks=true&watch=true'

       This started the pod creation/deletion loop.  In the slightly obfuscated console log attached,  the log line "Still waiting to schedule task" is around the failure on the watch request in the k8s ingress log shown above, and the build is recreating the pod every 10s until the build is aborted manually.

          [JENKINS-68409] kubernetes plugin can create a new pod every 10s when something wrong

          peng wu created issue -

          hello i had this issue aswell, i was working on this on my free time and i reach to this part of code,  i would like to share what i found 

          https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91

           

          there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set  by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new

          https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328

           

          gabriel cuadros added a comment - hello i had this issue aswell, i was working on this on my free time and i reach to this part of code,  i would like to share what i found  https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91   there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set  by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328  

          i think i found the way while i was doing some experiment, try to following , add  instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you  

           

          podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}

          gabriel cuadros added a comment - i think i found the way while i was doing some experiment, try to following , add  instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you     podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}
          gabriel cuadros made changes -
          Assignee New: gabriel cuadros [ gabocuadros ]

          peng wu added a comment -

          Thanks for the quick response.  I added the instanceCap like the following:

          podTemplate(
              instanceCap: 2,

          ...

          Then simulated the issue by deleting the  jnlp container of the pod, and jenkins still keeps recreating the pod none stop.  I got the same result with instanceCap: 1.

           

          The simulation is on the only available worker node of the k8s cluster, run the following:

          while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done

           

          peng wu added a comment - Thanks for the quick response.  I added the instanceCap like the following: podTemplate(     instanceCap: 2, ... Then simulated the issue by deleting the  jnlp container of the pod, and jenkins still keeps recreating the pod none stop.  I got the same result with instanceCap: 1.   The simulation is on the only available worker node of the k8s cluster, run the following: while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done  

          hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful  to share 

          gabriel cuadros added a comment - hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful  to share 

          i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation ,  i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself 

          https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54

          anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if  pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister )

          https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 ,  at this point , the only solution that comes in mind will be to adjust the code to add  a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this  code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94

           

           

          the other solution will be on kubernetes and should be an customizable  admission controller but that is unknow territorry  for me 

          gabriel cuadros added a comment - i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation ,  i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself  https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54 anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if  pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister ) https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 ,  at this point , the only solution that comes in mind will be to adjust the code to add  a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this  code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94     the other solution will be on kubernetes and should be an customizable  admission controller but that is unknow territorry  for me 
          gabriel cuadros made changes -
          Attachment New: image-2022-05-05-12-13-17-227.png [ 57979 ]
          gabriel cuadros made changes -
          Attachment New: image-2022-05-05-12-15-37-793.png [ 57980 ]
          gabriel cuadros made changes -
          Attachment New: image-2022-05-05-12-16-17-714.png [ 57981 ]

            gabocuadros gabriel cuadros
            wu105 peng wu
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: