[JENKINS-47144] Kubernetes pod slaves that never start successfully never get cleaned up

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin
Labels:
None
Environment:
Jenkins 2.73.1, kubernetes-plugin 1.0

Similar Issues:
Powered by SuggestiMate

Show

If I define a pod template with an invalid command and the container never becomes ready in the pod, then I see the following issues:

1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap.
2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions.
3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around.
4. The left over pod never gets deleted, even with container cleanup timeout specified.

I see errors like this in the logs:
https://gist.github.com/chancez/27c6afdaaff3e91aa82dfe03055273dd

I'm also seeings logs like `Failed to delete pod for agent jenkins/test-tmp-drvtq: not found` occassionally right after a build finishes, and the pod exists but isn't deleted.

https://gist.github.com/chancez/4d65118c11af054860f22df76364fa31 Is an example of a regular pipeline Jenkinsfile which i created to reproduce this issue.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2021-06-16-08-07-20-709.png
113 kB
2021-06-16 06:07
image-2021-06-16-08-03-53-814.png
362 kB
2021-06-16 06:03
image-2021-06-16-08-03-40-913.png
362 kB
2021-06-16 06:03

Chance Zibolski created issue - 2017-09-26 17:29

Carlos Sanchez added a comment - 2017-09-26 17:43

That is intended so it doesn't keep spawning pods, and allows to inspect the errors
There is another jira for "Failed to delete pod" errors

Carlos Sanchez added a comment - 2017-09-26 17:43 That is intended so it doesn't keep spawning pods, and allows to inspect the errors There is another jira for "Failed to delete pod" errors

Chance Zibolski added a comment - 2017-09-26 17:54

Also seeing this on master.

So this really breaks some use cases for us because we're not giving people full access to the namespace if we can avoid it (i mean, they indirectly have access via Jenkins, but that's about it), so the moment they trigger a few bad builds, then their instanceCap fills up and their jobs can no longer run until someone manually deletes the pods in the cluster.

Our podTemplates aren't getting cleaned up either, so unless we change the label on the the pod, often the next provision still re-uses the broken podTemplate.

I'm looking at the other JIRA issues, and they seem similiar, and I noticed there was a PR merged recently that looks related but it doesn't seem to be helping any when using a custom build from the current master.

Chance Zibolski added a comment - 2017-09-26 17:54 Also seeing this on master. So this really breaks some use cases for us because we're not giving people full access to the namespace if we can avoid it (i mean, they indirectly have access via Jenkins, but that's about it), so the moment they trigger a few bad builds, then their instanceCap fills up and their jobs can no longer run until someone manually deletes the pods in the cluster. Our podTemplates aren't getting cleaned up either, so unless we change the label on the the pod, often the next provision still re-uses the broken podTemplate. I'm looking at the other JIRA issues, and they seem similiar, and I noticed there was a PR merged recently that looks related but it doesn't seem to be helping any when using a custom build from the current master.

Chance Zibolski added a comment - 2017-09-26 17:58

I'm also curious to why the agent immediately goes offline in Jenkins if it's container isn't the one failing within the pod. It seems to just get connected an immediately terminated over and over. It seems like this is also related to why it can't cancel/kill the running (failing) job.

Chance Zibolski added a comment - 2017-09-26 17:58 I'm also curious to why the agent immediately goes offline in Jenkins if it's container isn't the one failing within the pod. It seems to just get connected an immediately terminated over and over. It seems like this is also related to why it can't cancel/kill the running (failing) job.

Alex Pliev added a comment - 2017-09-27 13:20

We are also affected by this issue, and it will be great figure out some way to have some retries limit and delete created pods after that.

Alex Pliev added a comment - 2017-09-27 13:20 We are also affected by this issue, and it will be great figure out some way to have some retries limit and delete created pods after that.

Carlos Sanchez added a comment - 2017-09-27 18:48

ok, so there are some things that can be done in the plugin and some that can not because they happen in jenkins core. Let's work on a proposal

1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap.

jobs stay running waiting for an agent to come up and seems the right thing to do. We could make the provisioner look at the last spawned pod for an agent template and not launch new ones if last one was error

2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions.

if you talk about ClosedChannelException I don't think there's anything that can be done here

3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around.

pods in error state are left for inspection. kubernetes-plugin right now is only called when provisioning so may need a service that periodically cleans up

4. The left over pod never gets deleted, even with container cleanup timeout specified.

if you talk about kubernetes cleanup I guess it won't delete it until the pod is deleted

Carlos Sanchez added a comment - 2017-09-27 18:48 ok, so there are some things that can be done in the plugin and some that can not because they happen in jenkins core. Let's work on a proposal 1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap. jobs stay running waiting for an agent to come up and seems the right thing to do. We could make the provisioner look at the last spawned pod for an agent template and not launch new ones if last one was error 2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions. if you talk about ClosedChannelException I don't think there's anything that can be done here 3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around. pods in error state are left for inspection. kubernetes-plugin right now is only called when provisioning so may need a service that periodically cleans up 4. The left over pod never gets deleted, even with container cleanup timeout specified. if you talk about kubernetes cleanup I guess it won't delete it until the pod is deleted

Jesse Glick made changes - 2019-07-16 19:42

Assignee

Original: Carlos Sanchez [ csanchez ]

Karl-Philipp Richter added a comment - 2020-04-09 13:06 - edited

Possible duplicate of https://issues.jenkins-ci.org/browse/JENKINS-54540

Karl-Philipp Richter added a comment - 2020-04-09 13:06 - edited Possible duplicate of https://issues.jenkins-ci.org/browse/JENKINS-54540

Abdennour Toumi added a comment - 2021-02-23 20:44

yes i confirm. i had same issue. i am cleaning them now manually with k -n jenkins delete pod -ljenkins=slave

Abdennour Toumi added a comment - 2021-02-23 20:44 yes i confirm. i had same issue. i am cleaning them now manually with k -n jenkins delete pod -ljenkins=slave

Björn made changes - 2021-06-16 06:03

Attachment

New: image-2021-06-16-08-03-40-913.png [ 55016 ]

Assignee:: Unassigned

Reporter:: Chance Zibolski

Votes:: 7 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2017-09-26 17:29

Updated:: 2021-06-16 06:08

Jenkins

Details

Description

Attachments

Attachments

Activity

Collapse comment: Carlos Sanchez added a comment - 2017-09-26 17:43

Expand comment: Carlos Sanchez added a comment - 2017-09-26 17:43

Collapse comment: Chance Zibolski added a comment - 2017-09-26 17:54

Expand comment: Chance Zibolski added a comment - 2017-09-26 17:54

Collapse comment: Chance Zibolski added a comment - 2017-09-26 17:58

Expand comment: Chance Zibolski added a comment - 2017-09-26 17:58

Collapse comment: Alex Pliev added a comment - 2017-09-27 13:20

Expand comment: Alex Pliev added a comment - 2017-09-27 13:20

Collapse comment: Carlos Sanchez added a comment - 2017-09-27 18:48

Expand comment: Carlos Sanchez added a comment - 2017-09-27 18:48

Collapse comment: Karl-Philipp Richter added a comment - 2020-04-09 13:06, Edited by Karl-Philipp Richter - 2020-04-09 13:19

Expand comment: Karl-Philipp Richter added a comment - 2020-04-09 13:06, Edited by Karl-Philipp Richter - 2020-04-09 13:19

Collapse comment: Abdennour Toumi added a comment - 2021-02-23 20:44

Expand comment: Abdennour Toumi added a comment - 2021-02-23 20:44

People

Dates