Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-71796

Some times the kubernetes just stops creating agents.

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • kubernetes-plugin
    • None
    • Kubernetes 1.23, Ubuntu 20.04, Jenkins 2.401.2

      Sometimes we notice that the kubernetes plugin stops create agents.

      We regularly delete old agents that has been running for a wile. When new jobs starts the plugin will create new agents up to the max limit. But for some reason this creation sometime stops and we are stuck with a limited number of agents.

          [JENKINS-71796] Some times the kubernetes just stops creating agents.

          Lars Berntzon created issue -

          Ritchelle added a comment -

          We are experiencing a similar issue, with the environment:

          • Kubernetes 1.24/1.25
          • Jenkins version 2.414.1
          • kubernetes plugin version 4029.v5712230ccb_f8
          • kubernetes client API plugin version 6.8.1-224.vd388fca_4db_3b_

          Another thing to note is that there are no relevant controller logs so far. Therefore, the issue is hard to debug and predict.

          Ritchelle added a comment - We are experiencing a similar issue, with the environment: Kubernetes 1.24/1.25 Jenkins version 2.414.1 kubernetes plugin version 4029.v5712230ccb_f8 kubernetes client API plugin version 6.8.1-224.vd388fca_4db_3b_ Another thing to note is that there are no relevant controller logs so far. Therefore, the issue is hard to debug and predict.
          Dion made changes -
          Description Original: Sometimes we notice that the kubernetes plugin stops create agents.

          We regularly delete old agents that has been running for a wile. When new jobs starts the plugin will create new agents up to the max limit. But for some reason this creation sometime stops and we are stuck with a limited number of agents.
          New: Sometimes we notice that the kubernetes plugin stops create agents.

          We regularly delete old agents that has been running for a wile. When new jobs starts the plugin will create new agents up to the max limit. But for some reason this creation sometime stops and we are stuck with a limited number of agents.

          test
          Dion made changes -
          Description Original: Sometimes we notice that the kubernetes plugin stops create agents.

          We regularly delete old agents that has been running for a wile. When new jobs starts the plugin will create new agents up to the max limit. But for some reason this creation sometime stops and we are stuck with a limited number of agents.

          test
          New: Sometimes we notice that the kubernetes plugin stops create agents.

          We regularly delete old agents that has been running for a wile. When new jobs starts the plugin will create new agents up to the max limit. But for some reason this creation sometime stops and we are stuck with a limited number of agents.

          Dion added a comment - - edited

          Also experiencing this issue. 

          • Kubernetes 1.23/1.24
          • Jenkins version 2.414.1
          • Kubernetes plugin 4007.v633279962016
          • Kubernetes Client API 6.4.1-215.v2ed17097a_8e9

          I'd like to provide some more details from my troubleshooting:

          Occasionally, Jenkins will timeout after attempting to create the pod and the console logs looks like:

          12:20:56  Still waiting to schedule task
          12:20:56  All nodes of label ‘REDACTED_278-w4h5t’ are offline
          12:35:34  Created Pod: ns-dev build-ns-dev/REDACTED-278-32670
          

          If I check the kubernetes event logs or kubeAPI for the pod in question, I don’t see it at all, as if request never hit the API. You can see that the suffix on the created pod is different than the one it was initially waiting for.

          If I check jenkins system logs and search for the string `278-w4h5t`, I find nothing.  I'm now attempting to capture this with additional logging enabled (org.csanchez.jenkins.plugins.kubernetes).

          After about 15 minutes of waiting, Jenkins eventually times out and tries again and is usually successful, though sometimes this issue may occur back-to-back, resulting in a 30+ min delay.

          12:35:34  Created Pod: ns-dev build-ns-dev/REDACTED-278-32670
          12:35:42  Agent REDACTED-278-32670 is provisioned from template REDACTED_278-w4h5t-wq4p4
          12:35:42  ---
          12:35:42  apiVersion: "v1"
          12:35:42  kind: "Pod"
          12:35:42  metadata:
          12:35:42    annotations:
          12:35:42      iam.amazonaws.com/role: "arn:aws:iam::REDACTED:role/REDACTED"
          12:35:42      buildUrl: "https://REDACTED/job/REDACTED/job/REDACTED/job/REDACTED/278/"
          12:35:42      runUrl: "job/REDACTED/job/REDACTED/job/REDACTED/278/"
          12:35:42      jobPattern: "(^job/REDACTED/.*/(.*)/[0-9]+)"
          12:35:42    labels:
          12:35:42      jenkins: "slave"
          12:35:42      jenkins/label-digest: "REDACTED"
          12:35:42      jenkins/label: "REDACTED_278-w4h5t"
          12:35:42    name: "REDACTED-278-32670"
          12:35:42    namespace: "build-ns-dev"
          12:35:42  spec:
          12:35:42    containers:
          12:35:42    - image: "REDACTED"
          12:35:42      name: "REDACTED"
          12:35:42      resources:
          12:35:42        requests:
          12:35:42          cpu: "1"
          12:35:42          memory: "4Gi"
          12:35:42      tty: true
          12:35:42      volumeMounts:
          12:35:42      - mountPath: "/mnt/gpg-certificate"
          12:35:42        name: "gpg-certificate"
          12:35:42      - mountPath: "/mnt/gpg-certificate-phrase"
          12:35:42        name: "gpg-certificate-phrase"
          12:35:42      - mountPath: "/home/jenkins/agent"
          12:35:42        name: "workspace-volume"
          12:35:42        readOnly: false
          12:35:42    - env:
          12:35:42      - name: "JENKINS_SECRET"
          12:35:42        value: "********"
          12:35:42      - name: "JENKINS_AGENT_NAME"
          12:35:42        value: "REDACTED-278-32670"
          12:35:42      - name: "JENKINS_WEB_SOCKET"
          12:35:42        value: "true"
          12:35:42      - name: "JENKINS_NAME"
          12:35:42        value: "REDACTED-278-32670"
          12:35:42      - name: "JENKINS_AGENT_WORKDIR"
          12:35:42        value: "/home/jenkins/agent"
          12:35:42      - name: "JENKINS_URL"
          12:35:42        value: "https://REDACTED/"
          12:35:42      image: "jenkins/inbound-agent:3142.vcfca_0cd92128-1"
          12:35:42      name: "jnlp"
          12:35:42      resources:
          12:35:42        requests:
          12:35:42          memory: "256Mi"
          12:35:42          cpu: "100m"
          12:35:42      volumeMounts:
          12:35:42      - mountPath: "/home/jenkins/agent"
          12:35:42        name: "workspace-volume"
          12:35:42        readOnly: false
          12:35:42    nodeSelector:
          12:35:42      kubernetes.io/os: "linux"
          12:35:42    restartPolicy: "Never"
          12:35:42    serviceAccountName: "am"
          12:35:42    volumes:
          12:35:42    - name: "gpg-certificate"
          12:35:42      secret:
          12:35:42        secretName: "REDACTED"
          12:35:42    - name: "gpg-certificate-phrase"
          12:35:42      secret:
          12:35:42        secretName: "REDACTED"
          12:35:42    - emptyDir:
          12:35:42        medium: ""
          12:35:42      name: "workspace-volume"
          12:35:42  
          12:35:43  Running on REDACTED-278-32670 in /home/jenkins/agent/workspace/REDACTED
          

          I can then see the second call to generate the pod was successful in the event logs.

          Dion added a comment - - edited Also experiencing this issue.  Kubernetes 1.23/1.24 Jenkins version 2.414.1 Kubernetes plugin 4007.v633279962016 Kubernetes Client API 6.4.1-215.v2ed17097a_8e9 I'd like to provide some more details from my troubleshooting: Occasionally, Jenkins will timeout after attempting to create the pod and the console logs looks like: 12:20:56 Still waiting to schedule task 12:20:56 All nodes of label ‘REDACTED_278-w4h5t’ are offline 12:35:34 Created Pod: ns-dev build-ns-dev/REDACTED-278-32670 If I check the kubernetes event logs or kubeAPI for the pod in question, I don’t see it at all, as if request never hit the API. You can see that the suffix on the created pod is different than the one it was initially waiting for. If I check jenkins system logs and search for the string `278-w4h5t`, I find nothing.  I'm now attempting to capture this with additional logging enabled (org.csanchez.jenkins.plugins.kubernetes). After about 15 minutes of waiting, Jenkins eventually times out and tries again and is usually successful, though sometimes this issue may occur back-to-back, resulting in a 30+ min delay. 12:35:34 Created Pod: ns-dev build-ns-dev/REDACTED-278-32670 12:35:42 Agent REDACTED-278-32670 is provisioned from template REDACTED_278-w4h5t-wq4p4 12:35:42 --- 12:35:42 apiVersion: "v1" 12:35:42 kind: "Pod" 12:35:42 metadata: 12:35:42 annotations: 12:35:42 iam.amazonaws.com/role: "arn:aws:iam::REDACTED:role/REDACTED" 12:35:42 buildUrl: "https: //REDACTED/job/REDACTED/job/REDACTED/job/REDACTED/278/" 12:35:42 runUrl: "job/REDACTED/job/REDACTED/job/REDACTED/278/" 12:35:42 jobPattern: "(^job/REDACTED/.*/(.*)/[0-9]+)" 12:35:42 labels: 12:35:42 jenkins: "slave" 12:35:42 jenkins/label-digest: "REDACTED" 12:35:42 jenkins/label: "REDACTED_278-w4h5t" 12:35:42 name: "REDACTED-278-32670" 12:35:42 namespace: "build-ns-dev" 12:35:42 spec: 12:35:42 containers: 12:35:42 - image: "REDACTED" 12:35:42 name: "REDACTED" 12:35:42 resources: 12:35:42 requests: 12:35:42 cpu: "1" 12:35:42 memory: "4Gi" 12:35:42 tty: true 12:35:42 volumeMounts: 12:35:42 - mountPath: "/mnt/gpg-certificate" 12:35:42 name: "gpg-certificate" 12:35:42 - mountPath: "/mnt/gpg-certificate-phrase" 12:35:42 name: "gpg-certificate-phrase" 12:35:42 - mountPath: "/home/jenkins/agent" 12:35:42 name: "workspace-volume" 12:35:42 readOnly: false 12:35:42 - env: 12:35:42 - name: "JENKINS_SECRET" 12:35:42 value: "********" 12:35:42 - name: "JENKINS_AGENT_NAME" 12:35:42 value: "REDACTED-278-32670" 12:35:42 - name: "JENKINS_WEB_SOCKET" 12:35:42 value: " true " 12:35:42 - name: "JENKINS_NAME" 12:35:42 value: "REDACTED-278-32670" 12:35:42 - name: "JENKINS_AGENT_WORKDIR" 12:35:42 value: "/home/jenkins/agent" 12:35:42 - name: "JENKINS_URL" 12:35:42 value: "https: //REDACTED/" 12:35:42 image: "jenkins/inbound-agent:3142.vcfca_0cd92128-1" 12:35:42 name: "jnlp" 12:35:42 resources: 12:35:42 requests: 12:35:42 memory: "256Mi" 12:35:42 cpu: "100m" 12:35:42 volumeMounts: 12:35:42 - mountPath: "/home/jenkins/agent" 12:35:42 name: "workspace-volume" 12:35:42 readOnly: false 12:35:42 nodeSelector: 12:35:42 kubernetes.io/os: "linux" 12:35:42 restartPolicy: "Never" 12:35:42 serviceAccountName: "am" 12:35:42 volumes: 12:35:42 - name: "gpg-certificate" 12:35:42 secret: 12:35:42 secretName: "REDACTED" 12:35:42 - name: "gpg-certificate-phrase" 12:35:42 secret: 12:35:42 secretName: "REDACTED" 12:35:42 - emptyDir: 12:35:42 medium: "" 12:35:42 name: "workspace-volume" 12:35:42 12:35:43 Running on REDACTED-278-32670 in /home/jenkins/agent/workspace/REDACTED I can then see the second call to generate the pod was successful in the event logs.

          We have not found any reasonable log output telling us why Jenkins was unable to create a pod and we also don't see anything in the api itself.

          Even with setting org.csanchez.jenkins.plugins.kubernetes.

          We should look into the plugin to see what could block the plugin to do something or log something. There might be something we just don't see.

          Or dionj did you see the timeout or anything in your logs?

          Sigi Kiermayer added a comment - We have not found any reasonable log output telling us why Jenkins was unable to create a pod and we also don't see anything in the api itself. Even with setting org.csanchez.jenkins.plugins.kubernetes. We should look into the plugin to see what could block the plugin to do something or log something. There might be something we just don't see. Or dionj did you see the timeout or anything in your logs?

          Dion added a comment - - edited

          siegfried  I've had a difficult time gathering logs.  I've added the kubernetes logger, but the logs aren't maintained for very long, so by the time I see the failure, the logs from the provisioning have already rotated out.

          Anecdotally, I was watching this Friday when it was occurring and the Jenkins UI had quite a lot of builds queued up and the Kubernetes logs were very quiet.  No relevant logs that I could find. After some time it suddenly created all of the agents without issue (except for the ones that timed out before).

          Dion added a comment - - edited siegfried   I've had a difficult time gathering logs.  I've added the kubernetes logger, but the logs aren't maintained for very long, so by the time I see the failure, the logs from the provisioning have already rotated out. Anecdotally, I was watching this Friday when it was occurring and the Jenkins UI had quite a lot of builds queued up and the Kubernetes logs were very quiet.  No relevant logs that I could find. After some time it suddenly created all of the agents without issue (except for the ones that timed out before).

          dionj there has to be a code path which we could add logging. I'm not sure yet how easy it will be to contribute.

           

          There is also the risk that its correlated to two code positions: either the one were jenkins is executing the provisioning plugin (kubernetes) or the kubernetes plugin itself

          Sigi Kiermayer added a comment - dionj there has to be a code path which we could add logging. I'm not sure yet how easy it will be to contribute.   There is also the risk that its correlated to two code positions: either the one were jenkins is executing the provisioning plugin (kubernetes) or the kubernetes plugin itself

          Dion added a comment - - edited

          Turns out the logs are all saved on disk, but the UI only shows a small portion of it. Managed to capture this in action with the following loggers set to `ALL`

          • org.csanchez.jenkins.plugins.kubernetes
          • hudson.slaves.NodeProvisioner
          • hudson.slaves.AbstractCloudSlave

          When I logged in this morning, I saw that there was a backup in the build queue.  Investigating the jobs in the queue, I happened across this one job which had been timing out and retrying every 15min for over 4hrs. During those 15minutes the Build Queue grows and no other pods are scheduled.  Once it times out, all of the queued items get executed and it goes back to queuing for another 15 minutes.

          Cranked the logs and collected the full 15minutes following one of the pods: `REDACTED-155-rh-gt9vs`

          It would be difficult to sanitize these logs, but I didn't see much of value tbh.  It looked normal and when I went digging through my Kubernetes event logs I could find it, but it was quite short-lived.

          Console log looping through these logs as it continues to retry:

          05:54:20  Created Pod: REDACTED
          06:11:00  ERROR: Failed to launch REDACTED
          06:11:00  io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[REDACTED] in namespace [REDACTED].
          06:11:00  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:896)
          06:11:00  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:878)
          06:11:00  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:93)
          06:11:00  	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:169)
          06:11:00  	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
          06:11:00  	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
          06:11:00  	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
          06:11:00  	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
          06:11:00  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          06:11:00  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          06:11:00  	at java.base/java.lang.Thread.run(Unknown Source)

          Found this in the Jenkins system log...

          2023-10-10 10:00:42.387+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED
          2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: --> DELETE https://172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED h2
          2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Authorization: Bearer REDACTED
          2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: User-Agent: fabric8-kubernetes-client/6.4.1
          2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Type: application/json; charset=utf-8
          2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Length: 75
          2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Host: 172.20.0.1
          2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Connection: Keep-Alive
          2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Accept-Encoding: gzip
          2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log:
          2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: {"apiVersion":"v1","kind":"DeleteOptions","propagationPolicy":"Background"}
          2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: --> END DELETE (75-byte body)
          2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- 200 https://172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED (54ms)
          2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: audit-id: acf7caf0-7f7d-44be-951f-fbf599cbde5c
          2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: cache-control: no-cache, private
          2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: content-type: application/json
          2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-flowschema-uid: dec317e9-e558-46c2-bfb7-ce848aaccd93
          2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-prioritylevel-uid: d25f04bc-e0f0-446b-bff0-210a6f4bd563
          2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: date: Tue, 10 Oct 2023 10:00:42 GMT
          2023-10-10 10:00:42.458+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- END HTTP (16645-byte body)
           

          Dion added a comment - - edited Turns out the logs are all saved on disk, but the UI only shows a small portion of it. Managed to capture this in action with the following loggers set to `ALL` org.csanchez.jenkins.plugins.kubernetes hudson.slaves.NodeProvisioner hudson.slaves.AbstractCloudSlave When I logged in this morning, I saw that there was a backup in the build queue.  Investigating the jobs in the queue, I happened across this one job which had been timing out and retrying every 15min for over 4hrs. During those 15minutes the Build Queue grows and no other pods are scheduled.  Once it times out, all of the queued items get executed and it goes back to queuing for another 15 minutes. Cranked the logs and collected the full 15minutes following one of the pods: `REDACTED-155-rh-gt9vs` It would be difficult to sanitize these logs, but I didn't see much of value tbh.  It looked normal and when I went digging through my Kubernetes event logs I could find it, but it was quite short-lived. Console log looping through these logs as it continues to retry: 05:54:20 Created Pod: REDACTED 06:11:00 ERROR: Failed to launch REDACTED 06:11:00 io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[REDACTED] in namespace [REDACTED]. 06:11:00 at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:896) 06:11:00 at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:878) 06:11:00 at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:93) 06:11:00 at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:169) 06:11:00 at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297) 06:11:00 at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) 06:11:00 at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) 06:11:00 at java.base/java.util.concurrent.FutureTask.run(Unknown Source) 06:11:00 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 06:11:00 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 06:11:00 at java.base/java.lang. Thread .run(Unknown Source) Found this in the Jenkins system log... 2023-10-10 10:00:42.387+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: --> DELETE https: //172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED h2 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Authorization: Bearer REDACTED 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: User-Agent: fabric8-kubernetes-client/6.4.1 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Type: application/json; charset=utf-8 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Length: 75 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Host: 172.20.0.1 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Connection: Keep-Alive 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Accept-Encoding: gzip 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: { "apiVersion" : "v1" , "kind" : "DeleteOptions" , "propagationPolicy" : "Background" } 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: --> END DELETE (75- byte body) 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- 200 https: //172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED (54ms) 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: audit-id: acf7caf0-7f7d-44be-951f-fbf599cbde5c 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: cache-control: no-cache, private 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: content-type: application/json 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-flowschema-uid: dec317e9-e558-46c2-bfb7-ce848aaccd93 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-prioritylevel-uid: d25f04bc-e0f0-446b-bff0-210a6f4bd563 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: date: Tue, 10 Oct 2023 10:00:42 GMT 2023-10-10 10:00:42.458+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- END HTTP (16645- byte body)  

          Dion added a comment -

          I am now able to reproduce the issue.  Simply try to launch an agent with an invalid secret/configmap configured.  Kubernetes will error and Jenkins will sit idle, waiting for it to connect.  While this is ongoing, the queue builds up until it times out and restarts.

          It seems to be unable to catch failures from Kubernetes and causes it to get hung if the pod is "created" but never starts and doesn't get terminated.

          Name:             test
          Namespace:        REDACTED
          Priority:         0
          Service Account:  default
          Node:             REDACTED
          Start Time:       Tue, 10 Oct 2023 10:36:01 -0400
          Annotations:      kubernetes.io/psp: eks.privileged
          Status:           Pending
          IP:               10.224.23.15
          IPs:
            IP:  10.224.23.15
          Containers:
            test:
              Container ID:   
              Image:          alpine:latest
              Image ID:       
              Port:           <none>
              Host Port:      <none>
              State:          Waiting
                Reason:       CreateContainerConfigError
              Ready:          False
              Restart Count:  0
              Environment:
                USERNAME:  <set to the key 'test' in secret 'user'>  Optional: false
              Mounts:
                /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6gf4b (ro)

          Looks like we need to improve the way that Jenkins monitors the pod after creation so that it can catch these. IIRC previous versions of Kubernetes did not cause containers to enter this Waiting state.

          Dion added a comment - I am now able to reproduce the issue.  Simply try to launch an agent with an invalid secret/configmap configured.  Kubernetes will error and Jenkins will sit idle, waiting for it to connect.  While this is ongoing, the queue builds up until it times out and restarts. It seems to be unable to catch failures from Kubernetes and causes it to get hung if the pod is "created" but never starts and doesn't get terminated. Name:             test Namespace:        REDACTED Priority:         0 Service Account:  default Node:             REDACTED Start Time:       Tue, 10 Oct 2023 10:36:01 -0400 Annotations:      kubernetes.io/psp: eks.privileged Status:           Pending IP:               10.224.23.15 IPs:   IP:  10.224.23.15 Containers:   test:     Container ID:        Image:          alpine:latest     Image ID:            Port:           <none>     Host Port:      <none>     State:          Waiting       Reason:       CreateContainerConfigError     Ready:          False     Restart Count:  0     Environment:       USERNAME:  <set to the key 'test' in secret 'user' >  Optional: false     Mounts:       / var /run/secrets/kubernetes.io/serviceaccount from kube-api-access-6gf4b (ro) Looks like we need to improve the way that Jenkins monitors the pod after creation so that it can catch these. IIRC previous versions of Kubernetes did not cause containers to enter this Waiting state.

            Unassigned Unassigned
            bildrulle Lars Berntzon
            Votes:
            10 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated: