Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68409

kubernetes plugin can create a new pod every 10s when something wrong

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • recent jenkins/plugin running on Linux, kubernetes on Linus, various versions
    • 3578.vb_9a_92ea_9845a_

      When a build creates a pod on kubernetes, if jenkins cannot verify the pod, it will delete it and recreate one every 10s.  We are not aware of any configuration parameters that can control the pace.  This can hose the kubernetes cluster with huge number of pod creations and deletions.  We would like the build to fail after a number of failures instead of keeping delete/create pods forever.  At least we would like to have a new pod wait progressively more time, similar to kubernetes crashloop.

       In production, we had situations where the kubernetes cannot report the pod status in the time expected by jenkins, and the resulting flood of pod creation/deletion left each node to hold more than 8000 deleted containers while running over the pod count limit, which would need hours to clear even with the jenkins  feed turned off - we eventually restored the  nodes from backup.  Although this bug is not considered the root cause for the response slowing down, the bug caused a "pod storm" which brought the kubernetes cluster to its knees and required this drastic node restore.

       In testing, we had a situation that the connection to kubernetes does not  support websocket, thus jenkins could not read the pod status via what appears to  be a "watch" on the pod, failing on request "path" similar to the following in the kubernetes ingress log: '/api/v1/namespaces/<ns>/pods?<podname>&allowWatchBookmarks=true&watch=true'

       This started the pod creation/deletion loop.  In the slightly obfuscated console log attached,  the log line "Still waiting to schedule task" is around the failure on the watch request in the k8s ingress log shown above, and the build is recreating the pod every 10s until the build is aborted manually.

          [JENKINS-68409] kubernetes plugin can create a new pod every 10s when something wrong

          hello i had this issue aswell, i was working on this on my free time and i reach to this part of code,  i would like to share what i found 

          https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91

           

          there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set  by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new

          https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328

           

          gabriel cuadros added a comment - hello i had this issue aswell, i was working on this on my free time and i reach to this part of code,  i would like to share what i found  https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91   there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set  by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328  

          i think i found the way while i was doing some experiment, try to following , add  instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you  

           

          podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}

          gabriel cuadros added a comment - i think i found the way while i was doing some experiment, try to following , add  instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you     podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}

          peng wu added a comment -

          Thanks for the quick response.  I added the instanceCap like the following:

          podTemplate(
              instanceCap: 2,

          ...

          Then simulated the issue by deleting the  jnlp container of the pod, and jenkins still keeps recreating the pod none stop.  I got the same result with instanceCap: 1.

           

          The simulation is on the only available worker node of the k8s cluster, run the following:

          while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done

           

          peng wu added a comment - Thanks for the quick response.  I added the instanceCap like the following: podTemplate(     instanceCap: 2, ... Then simulated the issue by deleting the  jnlp container of the pod, and jenkins still keeps recreating the pod none stop.  I got the same result with instanceCap: 1.   The simulation is on the only available worker node of the k8s cluster, run the following: while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done  

          hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful  to share 

          gabriel cuadros added a comment - hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful  to share 

          i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation ,  i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself 

          https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54

          anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if  pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister )

          https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 ,  at this point , the only solution that comes in mind will be to adjust the code to add  a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this  code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94

           

           

          the other solution will be on kubernetes and should be an customizable  admission controller but that is unknow territorry  for me 

          gabriel cuadros added a comment - i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation ,  i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself  https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54 anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if  pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister ) https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 ,  at this point , the only solution that comes in mind will be to adjust the code to add  a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this  code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94     the other solution will be on kubernetes and should be an customizable  admission controller but that is unknow territorry  for me 

          hello i cloned all the kubernetes plugin and i simulate this and others scenario on it to start solving it  but in the last version of the plugin  this error is not happening , the  pod creation exits once its failed once , let me know if you need help in setting the plugin locally to make your test , here i send you the procedure , once you clone the kubernetes plugin run the following command 

           

          mvn verify 
          mvn -Dhost="0.0.0.0" hpi:run #this allow to start jenkins with your local ip  (important! because localhost cant work with your local kubernetes cluster)
          

           if you lack of the hpi dependency you need to manually download  https://github.com/jenkinsci/maven-hpi-plugin/releases/tag/maven-hpi-plugin-3.26 and run 

          mvn clean install #repeat the step adove 

          then after your local kubernetes cluster (i set mine easily with docker desktop and using the option in gui , its set a single node cluster and also add the credentials into your .kube/config ) is up you need to configure the kubernetes cluster in jenkins , add a simple pod template with 2 containers ( a random container that fails plus the jnlp container) , of course the jnlp doesnt need to fail and wont fails becuase we usually dont touch it 

           

          if port 50000 of your local ip doesnt return this

          you need to configure in configureSecurity option in jenkins admin page , set to 50000

          here is the podtemplate i tested in my usecase

          and the pipeline itself 

          podTemplate(inheritFrom: "base") {
              node(POD_LABEL) {
                  container('jnlp'){
                      stage('test') {
                          echo "sample"
                      }
                  }
              }
              
          } 

           

           

           

          gabriel cuadros added a comment - hello i cloned all the kubernetes plugin and i simulate this and others scenario on it to start solving it  but in the last version of the plugin  this error is not happening , the  pod creation exits once its failed once , let me know if you need help in setting the plugin locally to make your test , here i send you the procedure , once you clone the kubernetes plugin run the following command    mvn verify  mvn -Dhost= "0.0.0.0" hpi:run # this allow to start jenkins with your local ip (important! because localhost cant work with your local kubernetes cluster)  if you lack of the hpi dependency you need to manually download   https://github.com/jenkinsci/maven-hpi-plugin/releases/tag/maven-hpi-plugin-3.26 and run  mvn clean install #repeat the step adove then after your local kubernetes cluster (i set mine easily with docker desktop and using the option in gui , its set a single node cluster and also add the credentials into your .kube/config ) is up you need to configure the kubernetes cluster in jenkins , add a simple pod template with 2 containers ( a random container that fails plus the jnlp container) , of course the jnlp doesnt need to fail and wont fails becuase we usually dont touch it    if port 50000 of your local ip doesnt return this you need to configure in configureSecurity option in jenkins admin page , set to 50000 here is the podtemplate i tested in my usecase and the pipeline itself  podTemplate(inheritFrom: "base" ) {     node(POD_LABEL) {         container( 'jnlp' ){             stage( 'test' ) {                 echo "sample"             }         }     }      }      

          peng wu added a comment -

          Hi Gabriel,

          Thanks for the quick turnaround.

          We upgraded to 3568.vde94f6b_41b_c8 a month ago and 3580.v78271e5631dc is now available for upgrade.

          Do you think I can upgrade and test it instead?

          We have standard jenkins installation and not sure what is involved to perform the update you described.

           

          Thanks,

           

          Peng

          peng wu added a comment - Hi Gabriel, Thanks for the quick turnaround. We upgraded to 3568.vde94f6b_41b_c8 a month ago and 3580.v78271e5631dc is now available for upgrade. Do you think I can upgrade and test it instead? We have standard jenkins installation and not sure what is involved to perform the update you described.   Thanks,   Peng

          wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature  https://issues.jenkins.io/browse/JENKINS-66822

          gabriel cuadros added a comment - wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature  https://issues.jenkins.io/browse/JENKINS-66822

          hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all 

          gabriel cuadros added a comment - hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all 

          peng wu added a comment -

          Hi Gabriel,  I upgraded to 3580.v78271e5631dc and indeed it is one and done!  You can close this ticket.  Many thanks!

          peng wu added a comment - Hi Gabriel,  I upgraded to 3580.v78271e5631dc and indeed it is one and done!  You can close this ticket.  Many thanks!

          peng wu added a comment -

          Hi Gabriel,

           

          Can we reopen this?  Under a different condition, we still get a new pod every 20s or so.  It  goes roughly like this:  The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop.  When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins),  the loop would stop, some 15 minutes later, a new pod starts and the build succeeds.

           

          The build log would log this on on the first pod only:

          ```

          Still waiting to schedule task
          ‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline

          ```

          then it goes on to create a new pod every 20 to 30s and dutifully delete it.  The loop would stop when Jenkins is restarted.  Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines:

          ```

          Ready to run at Mon Jul 04 10:35:12 EDT 2022
          Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart

          ```

          I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail.  In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name.

           

          A solution would be best to add progressively more delays for the next new pod.  If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.

           

          peng wu added a comment - Hi Gabriel,   Can we reopen this?  Under a different condition, we still get a new pod every 20s or so.  It  goes roughly like this:  The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop.  When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins),  the loop would stop, some 15 minutes later, a new pod starts and the build succeeds.   The build log would log this on on the first pod only: ``` Still waiting to schedule task ‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline ``` then it goes on to create a new pod every 20 to 30s and dutifully delete it.  The loop would stop when Jenkins is restarted.  Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines: ``` Ready to run at Mon Jul 04 10:35:12 EDT 2022 Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart ``` I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail.  In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name.   A solution would be best to add progressively more delays for the next new pod.  If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.  

          peng wu added a comment -

          peng wu added a comment - jenkins-68409-reopen.pub.txt

          hello good morning i just got back , let me check it and i give you back a feeddback, thanks! 

          gabriel cuadros added a comment - hello good morning i just got back , let me check it and i give you back a feeddback, thanks! 

          Could this be the same as JENKINS-47615 ?

          Marcin Cieślak added a comment - Could this be the same as JENKINS-47615 ?

          wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper 

          gabriel cuadros added a comment - wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper  

          gabriel cuadros added a comment - - edited

          hello wu105  the error you described is a completely different error,  checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more ,  this is not a kubernetes status that can be reported to jenkins  like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118

          Still waiting to schedule task
          jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

          gabriel cuadros added a comment - - edited hello wu105   the error you described is a completely different error,  checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more ,  this is not a kubernetes status that can be reported to jenkins  like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118 Still waiting to schedule task jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

          so currently there is 3 ways to get jenkins generate pods at crazy rates 

          creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ 

          creating a bad podtemplate with good image (error  CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ 

          creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

          gabriel cuadros added a comment - so currently there is 3 ways to get jenkins generate pods at crazy rates  creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_  creating a bad podtemplate with good image (error  CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_  creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

            gabocuadros gabriel cuadros
            wu105 peng wu
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: