[JENKINS-68409] kubernetes plugin can create a new pod every 10s when something wrong

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin
Labels:
None
Environment:
recent jenkins/plugin running on Linux, kubernetes on Linus, various versions

Similar Issues:
Powered by SuggestiMate

Show
Released As:
3578.vb_9a_92ea_9845a_

When a build creates a pod on kubernetes, if jenkins cannot verify the pod, it will delete it and recreate one every 10s. We are not aware of any configuration parameters that can control the pace. This can hose the kubernetes cluster with huge number of pod creations and deletions. We would like the build to fail after a number of failures instead of keeping delete/create pods forever. At least we would like to have a new pod wait progressively more time, similar to kubernetes crashloop.

In production, we had situations where the kubernetes cannot report the pod status in the time expected by jenkins, and the resulting flood of pod creation/deletion left each node to hold more than 8000 deleted containers while running over the pod count limit, which would need hours to clear even with the jenkins feed turned off - we eventually restored the nodes from backup. Although this bug is not considered the root cause for the response slowing down, the bug caused a "pod storm" which brought the kubernetes cluster to its knees and required this drastic node restore.

In testing, we had a situation that the connection to kubernetes does not support websocket, thus jenkins could not read the pod status via what appears to be a "watch" on the pod, failing on request "path" similar to the following in the kubernetes ingress log: '/api/v1/namespaces/<ns>/pods?<podname>&allowWatchBookmarks=true&watch=true'

This started the pod creation/deletion loop. In the slightly obfuscated console log attached, the log line "Still waiting to schedule task" is around the failure on the watch request in the k8s ingress log shown above, and the build is recreating the pod every 10s until the build is aborted manually.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

bug.consolelog.txt
4 kB
2022-05-04 22:50
image-2022-05-05-12-13-17-227.png
15 kB
2022-05-05 17:13
image-2022-05-05-12-15-37-793.png
2 kB
2022-05-05 17:15
image-2022-05-05-12-16-17-714.png
3 kB
2022-05-05 17:16
image-2022-05-05-12-18-46-089.png
14 kB
2022-05-05 17:18
image-2022-05-05-12-20-17-867.png
13 kB
2022-05-05 17:20
image-2022-05-05-12-20-46-321.png
7 kB
2022-05-05 17:20
jenkins-68409-reopen.pub.txt
505 kB
2022-07-06 02:40

is related to

JENKINS-66822 Jenkins is trying to create an agent pod forever

Open

relates to

JENKINS-47615 failing container in pod triggers multiple restarts

Open

gabriel cuadros added a comment - 2022-05-05 19:23

wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature https://issues.jenkins.io/browse/JENKINS-66822

gabriel cuadros added a comment - 2022-05-05 19:23 wrong link, my bad, the other pr was too old, this is the correct ticket that implemented this , seems to be a very new feature https://issues.jenkins.io/browse/JENKINS-66822

gabriel cuadros added a comment - 2022-05-05 19:32

hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all

gabriel cuadros added a comment - 2022-05-05 19:32 hello peng, let give a try and upgrade the plugin to this version https://github.com/jenkinsci/kubernetes-plugin/releases/tag/3578.vb_9a_92ea_9845a_ , acoording to the file, the change was done in that specific version , and the only dependencies update was in kubernetes api client and that all

peng wu added a comment - 2022-05-05 21:38

Hi Gabriel, I upgraded to 3580.v78271e5631dc and indeed it is one and done! You can close this ticket. Many thanks!

peng wu added a comment - 2022-05-05 21:38 Hi Gabriel, I upgraded to 3580.v78271e5631dc and indeed it is one and done! You can close this ticket. Many thanks!

peng wu added a comment - 2022-07-06 02:39

Hi Gabriel,

Can we reopen this? Under a different condition, we still get a new pod every 20s or so. It goes roughly like this: The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop. When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins), the loop would stop, some 15 minutes later, a new pod starts and the build succeeds.

The build log would log this on on the first pod only:

```

Still waiting to schedule task
‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline

```

then it goes on to create a new pod every 20 to 30s and dutifully delete it. The loop would stop when Jenkins is restarted. Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines:

```

Ready to run at Mon Jul 04 10:35:12 EDT 2022
Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart

```

I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail. In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name.

A solution would be best to add progressively more delays for the next new pod. If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.

peng wu added a comment - 2022-07-06 02:39 Hi Gabriel, Can we reopen this? Under a different condition, we still get a new pod every 20s or so. It goes roughly like this: The Jenkins server was under pressure, most likely over the limit of open files, and some builds got into a 20s a pod loop. When the Jenkins server is restarted (e.g., sudo systemctl restart jenkins), the loop would stop, some 15 minutes later, a new pod starts and the build succeeds. The build log would log this on on the first pod only: ``` Still waiting to schedule task ‘jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8’ is offline ``` then it goes on to create a new pod every 20 to 30s and dutifully delete it. The loop would stop when Jenkins is restarted. Some 15 minutes later, a new pod is created and finishes the build successfully, starting with the following log lines: ``` Ready to run at Mon Jul 04 10:35:12 EDT 2022 Resuming build at Mon Jul 04 10:35:12 EDT 2022 after Jenkins restart ``` I will include a ConsoleText log from such a build, and the ingress logs from the knubernetes cluster to provide timing and detail. In this particular sample, the ingress would record two more pods than Jenkins, and two pods are created once more using exactly the same name. A solution would be best to add progressively more delays for the next new pod. If the build is canceled, Jenkins is likely retry with a new build and repeat, still adding fuel to the fire.

peng wu added a comment - 2022-07-06 02:40

jenkins-68409-reopen.pub.txt

peng wu added a comment - 2022-07-06 02:40 jenkins-68409-reopen.pub.txt

gabriel cuadros added a comment - 2022-07-22 16:12

hello good morning i just got back , let me check it and i give you back a feeddback, thanks!

gabriel cuadros added a comment - 2022-07-22 16:12 hello good morning i just got back , let me check it and i give you back a feeddback, thanks!

Marcin Cieślak added a comment - 2022-11-04 13:36

Could this be the same as JENKINS-47615 ?

Marcin Cieślak added a comment - 2022-11-04 13:36 Could this be the same as JENKINS-47615 ?

gabriel cuadros added a comment - 2022-11-07 15:51

wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper

gabriel cuadros added a comment - 2022-11-07 15:51 wu105 is your jenkins master running outside of kubernetes by any means ? i couldnt find the scenario you described , im moving to the JENKINS-47615 to see if is related , thanks saper

gabriel cuadros added a comment - 2022-11-07 16:52 - edited

hello wu105 the error you described is a completely different error, checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more , this is not a kubernetes status that can be reported to jenkins like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118

Still waiting to schedule task
jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

gabriel cuadros added a comment - 2022-11-07 16:52 - edited hello wu105 the error you described is a completely different error, checking the source code im viewing that this line comes when the slave cant connect to jenkins master by some reason , jnlp cant connect to jenkins master to retrive its status to the master , so jenkins master think the pod never comes up and destroy it and creates more and more , this is not a kubernetes status that can be reported to jenkins like mentioned before in the solution ticket ( CreateContainerError, ImagePullBackOff , etc) https://github.com/jenkinsci/kubernetes-plugin/pull/1118 Still waiting to schedule task jenkins-agent-6c797bae-cd98-48-3vpsb-3p8m8 is offline

gabriel cuadros added a comment - 2022-11-07 17:23

so currently there is 3 ways to get jenkins generate pods at crazy rates

creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_

creating a bad podtemplate with good image (error CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_

creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

gabriel cuadros added a comment - 2022-11-07 17:23 so currently there is 3 ways to get jenkins generate pods at crazy rates creating a good podtemplate with bad image ( error ImagePullBackOff) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ creating a bad podtemplate with good image (error CreateContainerError) fixed with kubernetes plugin higher than 3578.vb_9a_92ea_9845a_ creating a good podtemplate with good image but jnlp doesnt connect to jenkins master (checking solution )

Assignee:: gabriel cuadros

Reporter:: peng wu

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022-05-04 23:04

Updated:: 2022-11-07 17:23

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: gabriel cuadros added a comment - 2022-05-05 19:23

Expand comment: gabriel cuadros added a comment - 2022-05-05 19:23

Collapse comment: gabriel cuadros added a comment - 2022-05-05 19:32

Expand comment: gabriel cuadros added a comment - 2022-05-05 19:32

Collapse comment: peng wu added a comment - 2022-05-05 21:38

Expand comment: peng wu added a comment - 2022-05-05 21:38

Collapse comment: peng wu added a comment - 2022-07-06 02:39

Expand comment: peng wu added a comment - 2022-07-06 02:39

Collapse comment: peng wu added a comment - 2022-07-06 02:40

Expand comment: peng wu added a comment - 2022-07-06 02:40

Collapse comment: gabriel cuadros added a comment - 2022-07-22 16:12

Expand comment: gabriel cuadros added a comment - 2022-07-22 16:12

Collapse comment: Marcin Cieślak added a comment - 2022-11-04 13:36

Expand comment: Marcin Cieślak added a comment - 2022-11-04 13:36

Collapse comment: gabriel cuadros added a comment - 2022-11-07 15:51

Expand comment: gabriel cuadros added a comment - 2022-11-07 15:51

Collapse comment: gabriel cuadros added a comment - 2022-11-07 16:52, Edited by gabriel cuadros - 2022-11-07 16:58

Expand comment: gabriel cuadros added a comment - 2022-11-07 16:52, Edited by gabriel cuadros - 2022-11-07 16:58

Collapse comment: gabriel cuadros added a comment - 2022-11-07 17:23

Expand comment: gabriel cuadros added a comment - 2022-11-07 17:23

People

Dates