Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47144

Kubernetes pod slaves that never start successfully never get cleaned up

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • Jenkins 2.73.1, kubernetes-plugin 1.0

      If I define a pod template with an invalid command and the container never becomes ready in the pod, then I see the following issues:

      1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap.
      2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions.
      3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around.
      4. The left over pod never gets deleted, even with container cleanup timeout specified.

      I see errors like this in the logs:
      https://gist.github.com/chancez/27c6afdaaff3e91aa82dfe03055273dd

      I'm also seeings logs like `Failed to delete pod for agent jenkins/test-tmp-drvtq: not found` occassionally right after a build finishes, and the pod exists but isn't deleted.

      https://gist.github.com/chancez/4d65118c11af054860f22df76364fa31 Is an example of a regular pipeline Jenkinsfile which i created to reproduce this issue.

          [JENKINS-47144] Kubernetes pod slaves that never start successfully never get cleaned up

          That is intended so it doesn't keep spawning pods, and allows to inspect the errors
          There is another jira for "Failed to delete pod" errors

          Carlos Sanchez added a comment - That is intended so it doesn't keep spawning pods, and allows to inspect the errors There is another jira for "Failed to delete pod" errors

          Also seeing this on master.

          So this really breaks some use cases for us because we're not giving people full access to the namespace if we can avoid it (i mean, they indirectly have access via Jenkins, but that's about it), so the moment they trigger a few bad builds, then their instanceCap fills up and their jobs can no longer run until someone manually deletes the pods in the cluster.

          Our podTemplates aren't getting cleaned up either, so unless we change the label on the the pod, often the next provision still re-uses the broken podTemplate.

          I'm looking at the other JIRA issues, and they seem similiar, and I noticed there was a PR merged recently that looks related but it doesn't seem to be helping any when using a custom build from the current master.

          Chance Zibolski added a comment - Also seeing this on master. So this really breaks some use cases for us because we're not giving people full access to the namespace if we can avoid it (i mean, they indirectly have access via Jenkins, but that's about it), so the moment they trigger a few bad builds, then their instanceCap fills up and their jobs can no longer run until someone manually deletes the pods in the cluster. Our podTemplates aren't getting cleaned up either, so unless we change the label on the the pod, often the next provision still re-uses the broken podTemplate. I'm looking at the other JIRA issues, and they seem similiar, and I noticed there was a PR merged recently that looks related but it doesn't seem to be helping any when using a custom build from the current master.

          I'm also curious to why the agent immediately goes offline in Jenkins if it's container isn't the one failing within the pod. It seems to just get connected an immediately terminated over and over. It seems like this is also related to why it can't cancel/kill the running (failing) job.

          Chance Zibolski added a comment - I'm also curious to why the agent immediately goes offline in Jenkins if it's container isn't the one failing within the pod. It seems to just get connected an immediately terminated over and over. It seems like this is also related to why it can't cancel/kill the running (failing) job.

          Alex Pliev added a comment -

          We are also affected by this issue, and it will be great figure out some way to have some retries limit and delete created pods after that.

          Alex Pliev added a comment - We are also affected by this issue, and it will be great figure out some way to have some retries limit and delete created pods after that.

          ok, so there are some things that can be done in the plugin and some that can not because they happen in jenkins core. Let's work on a proposal

          1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap.

          jobs stay running waiting for an agent to come up and seems the right thing to do. We could make the provisioner look at the last spawned pod for an agent template and not launch new ones if last one was error

          2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions.

          if you talk about ClosedChannelException I don't think there's anything that can be done here

          3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around.

          pods in error state are left for inspection. kubernetes-plugin right now is only called when provisioning so may need a service that periodically cleans up

          4. The left over pod never gets deleted, even with container cleanup timeout specified.

          if you talk about kubernetes cleanup I guess it won't delete it until the pod is deleted

          Carlos Sanchez added a comment - ok, so there are some things that can be done in the plugin and some that can not because they happen in jenkins core. Let's work on a proposal 1. The job never times out and provisioning doesn't seem to timeout. It spawns pods that continue to fail up to the instance cap. jobs stay running waiting for an agent to come up and seems the right thing to do. We could make the provisioner look at the last spawned pod for an agent template and not launch new ones if last one was error 2. When I cancel the job it's getting stuck and throwing exceptions because the agent is offline and continuously getting terminated exceptions. if you talk about ClosedChannelException I don't think there's anything that can be done here 3. Eventually forcing the job to cancel works, but the agent is removed from jenkins, but the pod is still left around. pods in error state are left for inspection. kubernetes-plugin right now is only called when provisioning so may need a service that periodically cleans up 4. The left over pod never gets deleted, even with container cleanup timeout specified. if you talk about kubernetes cleanup I guess it won't delete it until the pod is deleted

          Karl-Philipp Richter added a comment - - edited

          Karl-Philipp Richter added a comment - - edited Possible duplicate of https://issues.jenkins-ci.org/browse/JENKINS-54540

          yes i confirm. i had same issue. i am cleaning them now manually with k -n jenkins delete pod -ljenkins=slave

          Abdennour Toumi added a comment - yes i confirm. i had same issue. i am cleaning them now manually with k -n jenkins delete pod -ljenkins=slave

          Björn added a comment - - edited

          For us this happened tonight. The agent inside the pod failed with an exception:

          INFO: [JNLP4-connect connection to jenkins-agent/10.32.0.233:50000] Local headers refused by remote: Unknown client name: ephemeral-tools-namespace-expiration-908-7x68s--bv4pb
          Jun 16, 2021 2:42:03 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Protocol JNLP4-connect encountered an unexpected exception
          java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ephemeral-tools-namespace-expiration-908-7x68s--bv4pb
          	at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
          	at hudson.remoting.Engine.innerRun(Engine.java:743)
          	at hudson.remoting.Engine.run(Engine.java:518)
          Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ephemeral-tools-namespace-expiration-908-7x68s--bv4pb
          	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378)
          	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
          	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
          	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
          	at java.lang.Thread.run(Thread.java:748)
          	Suppressed: java.nio.channels.ClosedChannelExceptionjava.lang.Exception: The server rejected the connection: None of the protocols were accepted
          

          After this Jenkins scheduled again a new pod and again and again. We discovered it after around 5h and had ~ 6k of pods in Error, Pending etc state. Only an etcd disaster recovery to an older state helped us to survive. Please see also the attached picture (It's not the same run, but its exactly the same behavior).

           

          It would be great if there is a way to specify a max retry for such a create pod action. Attached also our configuration for kubernetes from jcasc:

           

           

          Björn added a comment - - edited For us this happened tonight. The agent inside the pod failed with an exception: INFO: [JNLP4-connect connection to jenkins-agent/10.32.0.233:50000] Local headers refused by remote: Unknown client name: ephemeral-tools-namespace-expiration-908-7x68s--bv4pb Jun 16, 2021 2:42:03 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Protocol JNLP4-connect encountered an unexpected exception java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ephemeral-tools-namespace-expiration-908-7x68s--bv4pb at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223) at hudson.remoting.Engine.innerRun(Engine.java:743) at hudson.remoting.Engine.run(Engine.java:518) Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ephemeral-tools-namespace-expiration-908-7x68s--bv4pb at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816) at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) at java.lang. Thread .run( Thread .java:748) Suppressed: java.nio.channels.ClosedChannelExceptionjava.lang.Exception: The server rejected the connection: None of the protocols were accepted After this Jenkins scheduled again a new pod and again and again. We discovered it after around 5h and had ~ 6k of pods in Error, Pending etc state. Only an etcd disaster recovery to an older state helped us to survive. Please see also the attached picture (It's not the same run, but its exactly the same behavior).   It would be great if there is a way to specify a max retry for such a create pod action. Attached also our configuration for kubernetes from jcasc:    

            Unassigned Unassigned
            chancez Chance Zibolski
            Votes:
            7 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: