Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47144

Kubernetes pod slaves that never start successfully never get cleaned up

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • Jenkins 2.73.1, kubernetes-plugin 1.0

      If I define a pod template with an invalid command and the container never becomes ready in the pod, then I see the following issues:

      1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap.
      2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions.
      3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around.
      4. The left over pod never gets deleted, even with container cleanup timeout specified.

      I see errors like this in the logs:
      https://gist.github.com/chancez/27c6afdaaff3e91aa82dfe03055273dd

      I'm also seeings logs like `Failed to delete pod for agent jenkins/test-tmp-drvtq: not found` occassionally right after a build finishes, and the pod exists but isn't deleted.

      https://gist.github.com/chancez/4d65118c11af054860f22df76364fa31 Is an example of a regular pipeline Jenkinsfile which i created to reproduce this issue.

          [JENKINS-47144] Kubernetes pod slaves that never start successfully never get cleaned up

          Chance Zibolski created issue -

          That is intended so it doesn't keep spawning pods, and allows to inspect the errors
          There is another jira for "Failed to delete pod" errors

          Carlos Sanchez added a comment - That is intended so it doesn't keep spawning pods, and allows to inspect the errors There is another jira for "Failed to delete pod" errors

          Also seeing this on master.

          So this really breaks some use cases for us because we're not giving people full access to the namespace if we can avoid it (i mean, they indirectly have access via Jenkins, but that's about it), so the moment they trigger a few bad builds, then their instanceCap fills up and their jobs can no longer run until someone manually deletes the pods in the cluster.

          Our podTemplates aren't getting cleaned up either, so unless we change the label on the the pod, often the next provision still re-uses the broken podTemplate.

          I'm looking at the other JIRA issues, and they seem similiar, and I noticed there was a PR merged recently that looks related but it doesn't seem to be helping any when using a custom build from the current master.

          Chance Zibolski added a comment - Also seeing this on master. So this really breaks some use cases for us because we're not giving people full access to the namespace if we can avoid it (i mean, they indirectly have access via Jenkins, but that's about it), so the moment they trigger a few bad builds, then their instanceCap fills up and their jobs can no longer run until someone manually deletes the pods in the cluster. Our podTemplates aren't getting cleaned up either, so unless we change the label on the the pod, often the next provision still re-uses the broken podTemplate. I'm looking at the other JIRA issues, and they seem similiar, and I noticed there was a PR merged recently that looks related but it doesn't seem to be helping any when using a custom build from the current master.

          I'm also curious to why the agent immediately goes offline in Jenkins if it's container isn't the one failing within the pod. It seems to just get connected an immediately terminated over and over. It seems like this is also related to why it can't cancel/kill the running (failing) job.

          Chance Zibolski added a comment - I'm also curious to why the agent immediately goes offline in Jenkins if it's container isn't the one failing within the pod. It seems to just get connected an immediately terminated over and over. It seems like this is also related to why it can't cancel/kill the running (failing) job.

          Alex Pliev added a comment -

          We are also affected by this issue, and it will be great figure out some way to have some retries limit and delete created pods after that.

          Alex Pliev added a comment - We are also affected by this issue, and it will be great figure out some way to have some retries limit and delete created pods after that.

          ok, so there are some things that can be done in the plugin and some that can not because they happen in jenkins core. Let's work on a proposal

          1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap.

          jobs stay running waiting for an agent to come up and seems the right thing to do. We could make the provisioner look at the last spawned pod for an agent template and not launch new ones if last one was error

          2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions.

          if you talk about ClosedChannelException I don't think there's anything that can be done here

          3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around.

          pods in error state are left for inspection. kubernetes-plugin right now is only called when provisioning so may need a service that periodically cleans up

          4. The left over pod never gets deleted, even with container cleanup timeout specified.

          if you talk about kubernetes cleanup I guess it won't delete it until the pod is deleted

          Carlos Sanchez added a comment - ok, so there are some things that can be done in the plugin and some that can not because they happen in jenkins core. Let's work on a proposal 1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap. jobs stay running waiting for an agent to come up and seems the right thing to do. We could make the provisioner look at the last spawned pod for an agent template and not launch new ones if last one was error 2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions. if you talk about ClosedChannelException I don't think there's anything that can be done here 3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around. pods in error state are left for inspection. kubernetes-plugin right now is only called when provisioning so may need a service that periodically cleans up 4. The left over pod never gets deleted, even with container cleanup timeout specified. if you talk about kubernetes cleanup I guess it won't delete it until the pod is deleted
          Jesse Glick made changes -
          Assignee Original: Carlos Sanchez [ csanchez ]

          Karl-Philipp Richter added a comment - - edited

          Karl-Philipp Richter added a comment - - edited Possible duplicate of https://issues.jenkins-ci.org/browse/JENKINS-54540

          yes i confirm. i had same issue. i am cleaning them now manually with k -n jenkins delete pod -ljenkins=slave

          Abdennour Toumi added a comment - yes i confirm. i had same issue. i am cleaning them now manually with k -n jenkins delete pod -ljenkins=slave
          Björn made changes -
          Attachment New: image-2021-06-16-08-03-40-913.png [ 55016 ]

            Unassigned Unassigned
            chancez Chance Zibolski
            Votes:
            7 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: