[JENKINS-68409] kubernetes plugin can create a new pod every 10s when something wrong

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin
Labels:
None
Environment:
recent jenkins/plugin running on Linux, kubernetes on Linus, various versions

Similar Issues:
Powered by SuggestiMate

Show
Released As:
3578.vb_9a_92ea_9845a_

When a build creates a pod on kubernetes, if jenkins cannot verify the pod, it will delete it and recreate one every 10s. We are not aware of any configuration parameters that can control the pace. This can hose the kubernetes cluster with huge number of pod creations and deletions. We would like the build to fail after a number of failures instead of keeping delete/create pods forever. At least we would like to have a new pod wait progressively more time, similar to kubernetes crashloop.

In production, we had situations where the kubernetes cannot report the pod status in the time expected by jenkins, and the resulting flood of pod creation/deletion left each node to hold more than 8000 deleted containers while running over the pod count limit, which would need hours to clear even with the jenkins feed turned off - we eventually restored the nodes from backup. Although this bug is not considered the root cause for the response slowing down, the bug caused a "pod storm" which brought the kubernetes cluster to its knees and required this drastic node restore.

In testing, we had a situation that the connection to kubernetes does not support websocket, thus jenkins could not read the pod status via what appears to be a "watch" on the pod, failing on request "path" similar to the following in the kubernetes ingress log: '/api/v1/namespaces/<ns>/pods?<podname>&allowWatchBookmarks=true&watch=true'

This started the pod creation/deletion loop. In the slightly obfuscated console log attached, the log line "Still waiting to schedule task" is around the failure on the watch request in the k8s ingress log shown above, and the build is recreating the pod every 10s until the build is aborted manually.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jenkins-68409-reopen.pub.txt
505 kB
2022-07-06 02:40
image-2022-05-05-12-20-46-321.png
7 kB
2022-05-05 17:20
image-2022-05-05-12-20-17-867.png
13 kB
2022-05-05 17:20
image-2022-05-05-12-18-46-089.png
14 kB
2022-05-05 17:18
image-2022-05-05-12-16-17-714.png
3 kB
2022-05-05 17:16
image-2022-05-05-12-15-37-793.png
2 kB
2022-05-05 17:15
image-2022-05-05-12-13-17-227.png
15 kB
2022-05-05 17:13
bug.consolelog.txt
4 kB
2022-05-04 22:50

is related to

JENKINS-66822 Jenkins is trying to create an agent pod forever

Open

relates to

JENKINS-47615 failing container in pod triggers multiple restarts

Open

gabriel cuadros added a comment - 2022-05-05 02:01

hello i had this issue aswell, i was working on this on my free time and i reach to this part of code, i would like to share what i found

https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91

there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new

https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328

gabriel cuadros added a comment - 2022-05-05 02:01 hello i had this issue aswell, i was working on this on my free time and i reach to this part of code, i would like to share what i found https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91 there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328

gabriel cuadros added a comment - 2022-05-05 02:17

i think i found the way while i was doing some experiment, try to following , add instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you

podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}

gabriel cuadros added a comment - 2022-05-05 02:17 i think i found the way while i was doing some experiment, try to following , add instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}

peng wu added a comment - 2022-05-05 04:09

Thanks for the quick response. I added the instanceCap like the following:

podTemplate(
instanceCap: 2,

...

Then simulated the issue by deleting the jnlp container of the pod, and jenkins still keeps recreating the pod none stop. I got the same result with instanceCap: 1.

The simulation is on the only available worker node of the k8s cluster, run the following:

while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done

peng wu added a comment - 2022-05-05 04:09 Thanks for the quick response. I added the instanceCap like the following: podTemplate( instanceCap: 2, ... Then simulated the issue by deleting the jnlp container of the pod, and jenkins still keeps recreating the pod none stop. I got the same result with instanceCap: 1. The simulation is on the only available worker node of the k8s cluster, run the following: while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done

gabriel cuadros added a comment - 2022-05-05 12:53

hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful to share

gabriel cuadros added a comment - 2022-05-05 12:53 hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful to share

gabriel cuadros added a comment - 2022-05-05 13:38

i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation , i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself

https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54

anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister )

https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 , at this point , the only solution that comes in mind will be to adjust the code to add a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94

the other solution will be on kubernetes and should be an customizable admission controller but that is unknow territorry for me

gabriel cuadros added a comment - 2022-05-05 13:38 i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation , i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54 anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister ) https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 , at this point , the only solution that comes in mind will be to adjust the code to add a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94 the other solution will be on kubernetes and should be an customizable admission controller but that is unknow territorry for me

gabriel cuadros added a comment - 2022-05-05 17:27

hello i cloned all the kubernetes plugin and i simulate this and others scenario on it to start solving it but in the last version of the plugin this error is not happening , the pod creation exits once its failed once , let me know if you need help in setting the plugin locally to make your test , here i send you the procedure , once you clone the kubernetes plugin run the following command

mvn verify 
mvn -Dhost="0.0.0.0" hpi:run #this allow to start jenkins with your local ip  (important! because localhost cant work with your local kubernetes cluster)

if you lack of the hpi dependency you need to manually download https://github.com/jenkinsci/maven-hpi-plugin/releases/tag/maven-hpi-plugin-3.26 and run

mvn clean install #repeat the step adove

then after your local kubernetes cluster (i set mine easily with docker desktop and using the option in gui , its set a single node cluster and also add the credentials into your .kube/config ) is up you need to configure the kubernetes cluster in jenkins , add a simple pod template with 2 containers ( a random container that fails plus the jnlp container) , of course the jnlp doesnt need to fail and wont fails becuase we usually dont touch it

if port 50000 of your local ip doesnt return this

you need to configure in configureSecurity option in jenkins admin page , set to 50000

here is the podtemplate i tested in my usecase

and the pipeline itself

podTemplate(inheritFrom: "base") {
    node(POD_LABEL) {
        container('jnlp'){
            stage('test') {
                echo "sample"
            }
        }
    }
    
}

gabriel cuadros added a comment - 2022-05-05 17:27 hello i cloned all the kubernetes plugin and i simulate this and others scenario on it to start solving it but in the last version of the plugin this error is not happening , the pod creation exits once its failed once , let me know if you need help in setting the plugin locally to make your test , here i send you the procedure , once you clone the kubernetes plugin run the following command mvn verify mvn -Dhost= "0.0.0.0" hpi:run # this allow to start jenkins with your local ip (important! because localhost cant work with your local kubernetes cluster) if you lack of the hpi dependency you need to manually download https://github.com/jenkinsci/maven-hpi-plugin/releases/tag/maven-hpi-plugin-3.26 and run mvn clean install #repeat the step adove then after your local kubernetes cluster (i set mine easily with docker desktop and using the option in gui , its set a single node cluster and also add the credentials into your .kube/config ) is up you need to configure the kubernetes cluster in jenkins , add a simple pod template with 2 containers ( a random container that fails plus the jnlp container) , of course the jnlp doesnt need to fail and wont fails becuase we usually dont touch it if port 50000 of your local ip doesnt return this you need to configure in configureSecurity option in jenkins admin page , set to 50000 here is the podtemplate i tested in my usecase and the pipeline itself podTemplate(inheritFrom: "base" ) { node(POD_LABEL) { container( 'jnlp' ){ stage( 'test' ) { echo "sample" } } } }

peng wu added a comment - 2022-05-05 19:06

Hi Gabriel,

Thanks for the quick turnaround.

We upgraded to 3568.vde94f6b_41b_c8 a month ago and 3580.v78271e5631dc is now available for upgrade.

Do you think I can upgrade and test it instead?

We have standard jenkins installation and not sure what is involved to perform the update you described.

Thanks,

Peng

peng wu added a comment - 2022-05-05 19:06 Hi Gabriel, Thanks for the quick turnaround. We upgraded to 3568.vde94f6b_41b_c8 a month ago and 3580.v78271e5631dc is now available for upgrade. Do you think I can upgrade and test it instead? We have standard jenkins installation and not sure what is involved to perform the update you described. Thanks, Peng

gabriel cuadros added a comment - 2022-05-05 19:23

wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature https://issues.jenkins.io/browse/JENKINS-66822

gabriel cuadros added a comment - 2022-05-05 19:23 wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature https://issues.jenkins.io/browse/JENKINS-66822

gabriel cuadros added a comment - 2022-05-05 19:32

hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all

gabriel cuadros added a comment - 2022-05-05 19:32 hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all

peng wu added a comment - 2022-05-05 21:38

Hi Gabriel, I upgraded to 3580.v78271e5631dc and indeed it is one and done! You can close this ticket. Many thanks!

peng wu added a comment - 2022-05-05 21:38 Hi Gabriel, I upgraded to 3580.v78271e5631dc and indeed it is one and done! You can close this ticket. Many thanks!

peng wu added a comment - 2022-07-06 02:39

Hi Gabriel,

Can we reopen this? Under a different condition, we still get a new pod every 20s or so. It goes roughly like this: The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop. When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins), the loop would stop, some 15 minutes later, a new pod starts and the build succeeds.

The build log would log this on on the first pod only:

```

Still waiting to schedule task
‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline

```

then it goes on to create a new pod every 20 to 30s and dutifully delete it. The loop would stop when Jenkins is restarted. Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines:

```

Ready to run at Mon Jul 04 10:35:12 EDT 2022
Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart

```

I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail. In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name.

A solution would be best to add progressively more delays for the next new pod. If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.

peng wu added a comment - 2022-07-06 02:39 Hi Gabriel, Can we reopen this? Under a different condition, we still get a new pod every 20s or so. It goes roughly like this: The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop. When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins), the loop would stop, some 15 minutes later, a new pod starts and the build succeeds. The build log would log this on on the first pod only: ``` Still waiting to schedule task ‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline ``` then it goes on to create a new pod every 20 to 30s and dutifully delete it. The loop would stop when Jenkins is restarted. Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines: ``` Ready to run at Mon Jul 04 10:35:12 EDT 2022 Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart ``` I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail. In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name. A solution would be best to add progressively more delays for the next new pod. If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.

peng wu added a comment - 2022-07-06 02:40

jenkins-68409-reopen.pub.txt

peng wu added a comment - 2022-07-06 02:40 jenkins-68409-reopen.pub.txt

gabriel cuadros added a comment - 2022-07-22 16:12

hello good morning i just got back , let me check it and i give you back a feeddback, thanks!

gabriel cuadros added a comment - 2022-07-22 16:12 hello good morning i just got back , let me check it and i give you back a feeddback, thanks!

Marcin Cieślak added a comment - 2022-11-04 13:36

Could this be the same as JENKINS-47615 ?

Marcin Cieślak added a comment - 2022-11-04 13:36 Could this be the same as JENKINS-47615 ?

gabriel cuadros added a comment - 2022-11-07 15:51

wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper

gabriel cuadros added a comment - 2022-11-07 15:51 wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper

gabriel cuadros added a comment - 2022-11-07 16:52 - edited

hello wu105 the error you described is a completely different error, checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more , this is not a kubernetes status that can be reported to jenkins like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118

Still waiting to schedule task
jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

gabriel cuadros added a comment - 2022-11-07 16:52 - edited hello wu105 the error you described is a completely different error, checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more , this is not a kubernetes status that can be reported to jenkins like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118 Still waiting to schedule task jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

gabriel cuadros added a comment - 2022-11-07 17:23

so currently there is 3 ways to get jenkins generate pods at crazy rates

creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_

creating a bad podtemplate with good image (error CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_

creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

gabriel cuadros added a comment - 2022-11-07 17:23 so currently there is 3 ways to get jenkins generate pods at crazy rates creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ creating a bad podtemplate with good image (error CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

Assignee:: gabriel cuadros

Reporter:: peng wu

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022-05-04 23:04

Updated:: 2022-11-07 17:23

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: gabriel cuadros added a comment - 2022-05-05 02:01

Expand comment: gabriel cuadros added a comment - 2022-05-05 02:01

Collapse comment: gabriel cuadros added a comment - 2022-05-05 02:17

Expand comment: gabriel cuadros added a comment - 2022-05-05 02:17

Collapse comment: peng wu added a comment - 2022-05-05 04:09

Expand comment: peng wu added a comment - 2022-05-05 04:09

Collapse comment: gabriel cuadros added a comment - 2022-05-05 12:53

Expand comment: gabriel cuadros added a comment - 2022-05-05 12:53

Collapse comment: gabriel cuadros added a comment - 2022-05-05 13:38

Expand comment: gabriel cuadros added a comment - 2022-05-05 13:38

Collapse comment: gabriel cuadros added a comment - 2022-05-05 17:27

Expand comment: gabriel cuadros added a comment - 2022-05-05 17:27

Collapse comment: peng wu added a comment - 2022-05-05 19:06

Expand comment: peng wu added a comment - 2022-05-05 19:06

Collapse comment: gabriel cuadros added a comment - 2022-05-05 19:23

Expand comment: gabriel cuadros added a comment - 2022-05-05 19:23

Collapse comment: gabriel cuadros added a comment - 2022-05-05 19:32

Expand comment: gabriel cuadros added a comment - 2022-05-05 19:32

Collapse comment: peng wu added a comment - 2022-05-05 21:38

Expand comment: peng wu added a comment - 2022-05-05 21:38

Collapse comment: peng wu added a comment - 2022-07-06 02:39

Expand comment: peng wu added a comment - 2022-07-06 02:39

Collapse comment: peng wu added a comment - 2022-07-06 02:40

Expand comment: peng wu added a comment - 2022-07-06 02:40

Collapse comment: gabriel cuadros added a comment - 2022-07-22 16:12

Expand comment: gabriel cuadros added a comment - 2022-07-22 16:12

Collapse comment: Marcin Cieślak added a comment - 2022-11-04 13:36

Expand comment: Marcin Cieślak added a comment - 2022-11-04 13:36

Collapse comment: gabriel cuadros added a comment - 2022-11-07 15:51

Expand comment: gabriel cuadros added a comment - 2022-11-07 15:51

Collapse comment: gabriel cuadros added a comment - 2022-11-07 16:52, Edited by gabriel cuadros - 2022-11-07 16:58

Expand comment: gabriel cuadros added a comment - 2022-11-07 16:52, Edited by gabriel cuadros - 2022-11-07 16:58

Collapse comment: gabriel cuadros added a comment - 2022-11-07 17:23

Expand comment: gabriel cuadros added a comment - 2022-11-07 17:23

People

Dates