Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68409

kubernetes plugin can create a new pod every 10s when something wrong

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • recent jenkins/plugin running on Linux, kubernetes on Linus, various versions
    • 3578.vb_9a_92ea_9845a_

      When a build creates a pod on kubernetes, if jenkins cannot verify the pod, it will delete it and recreate one every 10s.  We are not aware of any configuration parameters that can control the pace.  This can hose the kubernetes cluster with huge number of pod creations and deletions.  We would like the build to fail after a number of failures instead of keeping delete/create pods forever.  At least we would like to have a new pod wait progressively more time, similar to kubernetes crashloop.

       In production, we had situations where the kubernetes cannot report the pod status in the time expected by jenkins, and the resulting flood of pod creation/deletion left each node to hold more than 8000 deleted containers while running over the pod count limit, which would need hours to clear even with the jenkins  feed turned off - we eventually restored the  nodes from backup.  Although this bug is not considered the root cause for the response slowing down, the bug caused a "pod storm" which brought the kubernetes cluster to its knees and required this drastic node restore.

       In testing, we had a situation that the connection to kubernetes does not  support websocket, thus jenkins could not read the pod status via what appears to  be a "watch" on the pod, failing on request "path" similar to the following in the kubernetes ingress log: '/api/v1/namespaces/<ns>/pods?<podname>&allowWatchBookmarks=true&watch=true'

       This started the pod creation/deletion loop.  In the slightly obfuscated console log attached,  the log line "Still waiting to schedule task" is around the failure on the watch request in the k8s ingress log shown above, and the build is recreating the pod every 10s until the build is aborted manually.

          [JENKINS-68409] kubernetes plugin can create a new pod every 10s when something wrong

          wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature  https://issues.jenkins.io/browse/JENKINS-66822

          gabriel cuadros added a comment - wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature  https://issues.jenkins.io/browse/JENKINS-66822

          hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all 

          gabriel cuadros added a comment - hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all 

          peng wu added a comment -

          Hi Gabriel,  I upgraded to 3580.v78271e5631dc and indeed it is one and done!  You can close this ticket.  Many thanks!

          peng wu added a comment - Hi Gabriel,  I upgraded to 3580.v78271e5631dc and indeed it is one and done!  You can close this ticket.  Many thanks!

          peng wu added a comment -

          Hi Gabriel,

           

          Can we reopen this?  Under a different condition, we still get a new pod every 20s or so.  It  goes roughly like this:  The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop.  When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins),  the loop would stop, some 15 minutes later, a new pod starts and the build succeeds.

           

          The build log would log this on on the first pod only:

          ```

          Still waiting to schedule task
          ‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline

          ```

          then it goes on to create a new pod every 20 to 30s and dutifully delete it.  The loop would stop when Jenkins is restarted.  Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines:

          ```

          Ready to run at Mon Jul 04 10:35:12 EDT 2022
          Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart

          ```

          I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail.  In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name.

           

          A solution would be best to add progressively more delays for the next new pod.  If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.

           

          peng wu added a comment - Hi Gabriel,   Can we reopen this?  Under a different condition, we still get a new pod every 20s or so.  It  goes roughly like this:  The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop.  When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins),  the loop would stop, some 15 minutes later, a new pod starts and the build succeeds.   The build log would log this on on the first pod only: ``` Still waiting to schedule task ‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline ``` then it goes on to create a new pod every 20 to 30s and dutifully delete it.  The loop would stop when Jenkins is restarted.  Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines: ``` Ready to run at Mon Jul 04 10:35:12 EDT 2022 Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart ``` I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail.  In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name.   A solution would be best to add progressively more delays for the next new pod.  If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.  

          peng wu added a comment -

          peng wu added a comment - jenkins-68409-reopen.pub.txt

          hello good morning i just got back , let me check it and i give you back a feeddback, thanks! 

          gabriel cuadros added a comment - hello good morning i just got back , let me check it and i give you back a feeddback, thanks! 

          Could this be the same as JENKINS-47615 ?

          Marcin Cieślak added a comment - Could this be the same as JENKINS-47615 ?

          wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper 

          gabriel cuadros added a comment - wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper  

          gabriel cuadros added a comment - - edited

          hello wu105  the error you described is a completely different error,  checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more ,  this is not a kubernetes status that can be reported to jenkins  like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118

          Still waiting to schedule task
          jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

          gabriel cuadros added a comment - - edited hello wu105   the error you described is a completely different error,  checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more ,  this is not a kubernetes status that can be reported to jenkins  like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118 Still waiting to schedule task jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

          so currently there is 3 ways to get jenkins generate pods at crazy rates 

          creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ 

          creating a bad podtemplate with good image (error  CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ 

          creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

          gabriel cuadros added a comment - so currently there is 3 ways to get jenkins generate pods at crazy rates  creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_  creating a bad podtemplate with good image (error  CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_  creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

            gabocuadros gabriel cuadros
            wu105 peng wu
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: