[JENKINS-68409] kubernetes plugin can create a new pod every 10s when something wrong

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin
Labels:
None
Environment:
recent jenkins/plugin running on Linux, kubernetes on Linus, various versions

Similar Issues:
Powered by SuggestiMate

Show
Released As:
3578.vb_9a_92ea_9845a_

When a build creates a pod on kubernetes, if jenkins cannot verify the pod, it will delete it and recreate one every 10s. We are not aware of any configuration parameters that can control the pace. This can hose the kubernetes cluster with huge number of pod creations and deletions. We would like the build to fail after a number of failures instead of keeping delete/create pods forever. At least we would like to have a new pod wait progressively more time, similar to kubernetes crashloop.

In production, we had situations where the kubernetes cannot report the pod status in the time expected by jenkins, and the resulting flood of pod creation/deletion left each node to hold more than 8000 deleted containers while running over the pod count limit, which would need hours to clear even with the jenkins feed turned off - we eventually restored the nodes from backup. Although this bug is not considered the root cause for the response slowing down, the bug caused a "pod storm" which brought the kubernetes cluster to its knees and required this drastic node restore.

In testing, we had a situation that the connection to kubernetes does not support websocket, thus jenkins could not read the pod status via what appears to be a "watch" on the pod, failing on request "path" similar to the following in the kubernetes ingress log: '/api/v1/namespaces/<ns>/pods?<podname>&allowWatchBookmarks=true&watch=true'

This started the pod creation/deletion loop. In the slightly obfuscated console log attached, the log line "Still waiting to schedule task" is around the failure on the watch request in the k8s ingress log shown above, and the build is recreating the pod every 10s until the build is aborted manually.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jenkins-68409-reopen.pub.txt
505 kB
2022-07-06 02:40
image-2022-05-05-12-20-46-321.png
7 kB
2022-05-05 17:20
image-2022-05-05-12-20-17-867.png
13 kB
2022-05-05 17:20
image-2022-05-05-12-18-46-089.png
14 kB
2022-05-05 17:18
image-2022-05-05-12-16-17-714.png
3 kB
2022-05-05 17:16
image-2022-05-05-12-15-37-793.png
2 kB
2022-05-05 17:15
image-2022-05-05-12-13-17-227.png
15 kB
2022-05-05 17:13
bug.consolelog.txt
4 kB
2022-05-04 22:50

is related to

JENKINS-66822 Jenkins is trying to create an agent pod forever

Open

relates to

JENKINS-47615 failing container in pod triggers multiple restarts

Open

peng wu created issue - 2022-05-04 23:04

gabriel cuadros added a comment - 2022-05-05 02:01

hello i had this issue aswell, i was working on this on my free time and i reach to this part of code, i would like to share what i found

https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91

there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new

https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328

gabriel cuadros added a comment - 2022-05-05 02:01 hello i had this issue aswell, i was working on this on my free time and i reach to this part of code, i would like to share what i found https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L91 there is a option called podTemplate.getInstanceCap() that theorically can be set and can posible stop the massive pod creation , if not set by default is set to Integer.MAX_VALUE which is very big value, i have not found the way to set this value yet on the pod template but i will report if i find anything new https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/PodTemplate.java#L328

gabriel cuadros added a comment - 2022-05-05 02:17

i think i found the way while i was doing some experiment, try to following , add instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you

podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}

gabriel cuadros added a comment - 2022-05-05 02:17 i think i found the way while i was doing some experiment, try to following , add instanceCap: 1 to your pod template definition, in my laboratory is accepting the parameter , let me know if is working for you podTemplate(inheritFrom: "<yourpodtemplate>", showRawYaml: false, instanceCap: 1){}

gabriel cuadros made changes - 2022-05-05 02:20

Assignee

New: gabriel cuadros [ gabocuadros ]

peng wu added a comment - 2022-05-05 04:09

Thanks for the quick response. I added the instanceCap like the following:

podTemplate(
instanceCap: 2,

...

Then simulated the issue by deleting the jnlp container of the pod, and jenkins still keeps recreating the pod none stop. I got the same result with instanceCap: 1.

The simulation is on the only available worker node of the k8s cluster, run the following:

while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done

peng wu added a comment - 2022-05-05 04:09 Thanks for the quick response. I added the instanceCap like the following: podTemplate( instanceCap: 2, ... Then simulated the issue by deleting the jnlp container of the pod, and jenkins still keeps recreating the pod none stop. I got the same result with instanceCap: 1. The simulation is on the only available worker node of the k8s cluster, run the following: while true; do echo $SECONDS; sudo crictl ps --name jnlp -q|xargs -r sudo crictl rm -f ; sleep 1; done

gabriel cuadros added a comment - 2022-05-05 12:53

hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful to share

gabriel cuadros added a comment - 2022-05-05 12:53 hello good morning, i just wake up, let me check what can be the cause of the difference and i will get back at you if i found anything useful to share

gabriel cuadros added a comment - 2022-05-05 13:38

i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation , i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself

https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54

anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister )

https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 , at this point , the only solution that comes in mind will be to adjust the code to add a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94

the other solution will be on kubernetes and should be an customizable admission controller but that is unknow territorry for me

gabriel cuadros added a comment - 2022-05-05 13:38 i have erased the instancecap and my instance creation are stopped after a while so i beleive something else in my laboratory is stopping the massive pod creation , i saw that we have a admission controller called eventratelimit with a configuration for namespace, i beleive that is the one that is actually stopping the pod creation and not the plugin by itself https://github.com/kubernetes/kubernetes/blob/v1.24.0/plugin/pkg/admission/eventratelimit/apis/eventratelimit/types.go#L54 anyways , let focus on the plugin, the problem itself is that when the pod terminate it disconect from jenkins (which cause a unregister in here https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java#L107 decreasing the value again to 0 instance running ) ,i tried to see if pod retention allow the failure pod stay connected to jenkins so its count as a running instance and allow us to reach the instancecount but that only let the pod stays in kubernetes, pod still get disconected from jenkins at all cost due to this part (the disconection of jenkins slave jnlp from jenkins master also cause a unregister ) https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesSlave.java#L322 , at this point , the only solution that comes in mind will be to adjust the code to add a counter that count the time that a pod fails and compare before the next pod creation , just in the part of this code https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesLauncher.java#L94 the other solution will be on kubernetes and should be an customizable admission controller but that is unknow territorry for me

gabriel cuadros made changes - 2022-05-05 17:13

Attachment

New: image-2022-05-05-12-13-17-227.png [ 57979 ]

gabriel cuadros made changes - 2022-05-05 17:15

Attachment

New: image-2022-05-05-12-15-37-793.png [ 57980 ]

gabriel cuadros made changes - 2022-05-05 17:16

Attachment

New: image-2022-05-05-12-16-17-714.png [ 57981 ]

Assignee:: gabriel cuadros

Reporter:: peng wu

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022-05-04 23:04

Updated:: 2022-11-07 17:23

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: gabriel cuadros added a comment - 2022-05-05 02:01

Expand comment: gabriel cuadros added a comment - 2022-05-05 02:01

Collapse comment: gabriel cuadros added a comment - 2022-05-05 02:17

Expand comment: gabriel cuadros added a comment - 2022-05-05 02:17

Collapse comment: peng wu added a comment - 2022-05-05 04:09

Expand comment: peng wu added a comment - 2022-05-05 04:09

Collapse comment: gabriel cuadros added a comment - 2022-05-05 12:53

Expand comment: gabriel cuadros added a comment - 2022-05-05 12:53

Collapse comment: gabriel cuadros added a comment - 2022-05-05 13:38

Expand comment: gabriel cuadros added a comment - 2022-05-05 13:38

People

Dates