[JENKINS-71796] Some times the kubernetes just stops creating agents.

Ritchelle added a comment - 2023-09-20 20:36

We are experiencing a similar issue, with the environment:

Kubernetes 1.24/1.25
Jenkins version 2.414.1
kubernetes plugin version 4029.v5712230ccb_f8
kubernetes client API plugin version 6.8.1-224.vd388fca_4db_3b_

Another thing to note is that there are no relevant controller logs so far. Therefore, the issue is hard to debug and predict.

Ritchelle added a comment - 2023-09-20 20:36 We are experiencing a similar issue, with the environment: Kubernetes 1.24/1.25 Jenkins version 2.414.1 kubernetes plugin version 4029.v5712230ccb_f8 kubernetes client API plugin version 6.8.1-224.vd388fca_4db_3b_ Another thing to note is that there are no relevant controller logs so far. Therefore, the issue is hard to debug and predict.

Dion added a comment - 2023-10-05 17:13 - edited

Also experiencing this issue.

Kubernetes 1.23/1.24
Jenkins version 2.414.1
Kubernetes plugin 4007.v633279962016
Kubernetes Client API 6.4.1-215.v2ed17097a_8e9

I'd like to provide some more details from my troubleshooting:

Occasionally, Jenkins will timeout after attempting to create the pod and the console logs looks like:

12:20:56  Still waiting to schedule task
12:20:56  All nodes of label ‘REDACTED_278-w4h5t’ are offline
12:35:34  Created Pod: ns-dev build-ns-dev/REDACTED-278-32670

If I check the kubernetes event logs or kubeAPI for the pod in question, I don’t see it at all, as if request never hit the API. You can see that the suffix on the created pod is different than the one it was initially waiting for.

If I check jenkins system logs and search for the string `278-w4h5t`, I find nothing. I'm now attempting to capture this with additional logging enabled (org.csanchez.jenkins.plugins.kubernetes).

After about 15 minutes of waiting, Jenkins eventually times out and tries again and is usually successful, though sometimes this issue may occur back-to-back, resulting in a 30+ min delay.

12:35:34  Created Pod: ns-dev build-ns-dev/REDACTED-278-32670
12:35:42  Agent REDACTED-278-32670 is provisioned from template REDACTED_278-w4h5t-wq4p4
12:35:42  ---
12:35:42  apiVersion: "v1"
12:35:42  kind: "Pod"
12:35:42  metadata:
12:35:42    annotations:
12:35:42      iam.amazonaws.com/role: "arn:aws:iam::REDACTED:role/REDACTED"
12:35:42      buildUrl: "https://REDACTED/job/REDACTED/job/REDACTED/job/REDACTED/278/"
12:35:42      runUrl: "job/REDACTED/job/REDACTED/job/REDACTED/278/"
12:35:42      jobPattern: "(^job/REDACTED/.*/(.*)/[0-9]+)"
12:35:42    labels:
12:35:42      jenkins: "slave"
12:35:42      jenkins/label-digest: "REDACTED"
12:35:42      jenkins/label: "REDACTED_278-w4h5t"
12:35:42    name: "REDACTED-278-32670"
12:35:42    namespace: "build-ns-dev"
12:35:42  spec:
12:35:42    containers:
12:35:42    - image: "REDACTED"
12:35:42      name: "REDACTED"
12:35:42      resources:
12:35:42        requests:
12:35:42          cpu: "1"
12:35:42          memory: "4Gi"
12:35:42      tty: true
12:35:42      volumeMounts:
12:35:42      - mountPath: "/mnt/gpg-certificate"
12:35:42        name: "gpg-certificate"
12:35:42      - mountPath: "/mnt/gpg-certificate-phrase"
12:35:42        name: "gpg-certificate-phrase"
12:35:42      - mountPath: "/home/jenkins/agent"
12:35:42        name: "workspace-volume"
12:35:42        readOnly: false
12:35:42    - env:
12:35:42      - name: "JENKINS_SECRET"
12:35:42        value: "********"
12:35:42      - name: "JENKINS_AGENT_NAME"
12:35:42        value: "REDACTED-278-32670"
12:35:42      - name: "JENKINS_WEB_SOCKET"
12:35:42        value: "true"
12:35:42      - name: "JENKINS_NAME"
12:35:42        value: "REDACTED-278-32670"
12:35:42      - name: "JENKINS_AGENT_WORKDIR"
12:35:42        value: "/home/jenkins/agent"
12:35:42      - name: "JENKINS_URL"
12:35:42        value: "https://REDACTED/"
12:35:42      image: "jenkins/inbound-agent:3142.vcfca_0cd92128-1"
12:35:42      name: "jnlp"
12:35:42      resources:
12:35:42        requests:
12:35:42          memory: "256Mi"
12:35:42          cpu: "100m"
12:35:42      volumeMounts:
12:35:42      - mountPath: "/home/jenkins/agent"
12:35:42        name: "workspace-volume"
12:35:42        readOnly: false
12:35:42    nodeSelector:
12:35:42      kubernetes.io/os: "linux"
12:35:42    restartPolicy: "Never"
12:35:42    serviceAccountName: "am"
12:35:42    volumes:
12:35:42    - name: "gpg-certificate"
12:35:42      secret:
12:35:42        secretName: "REDACTED"
12:35:42    - name: "gpg-certificate-phrase"
12:35:42      secret:
12:35:42        secretName: "REDACTED"
12:35:42    - emptyDir:
12:35:42        medium: ""
12:35:42      name: "workspace-volume"
12:35:42  
12:35:43  Running on REDACTED-278-32670 in /home/jenkins/agent/workspace/REDACTED

I can then see the second call to generate the pod was successful in the event logs.

Dion added a comment - 2023-10-05 17:13 - edited Also experiencing this issue. Kubernetes 1.23/1.24 Jenkins version 2.414.1 Kubernetes plugin 4007.v633279962016 Kubernetes Client API 6.4.1-215.v2ed17097a_8e9 I'd like to provide some more details from my troubleshooting: Occasionally, Jenkins will timeout after attempting to create the pod and the console logs looks like: 12:20:56 Still waiting to schedule task 12:20:56 All nodes of label ‘REDACTED_278-w4h5t’ are offline 12:35:34 Created Pod: ns-dev build-ns-dev/REDACTED-278-32670 If I check the kubernetes event logs or kubeAPI for the pod in question, I don’t see it at all, as if request never hit the API. You can see that the suffix on the created pod is different than the one it was initially waiting for. If I check jenkins system logs and search for the string `278-w4h5t`, I find nothing. I'm now attempting to capture this with additional logging enabled (org.csanchez.jenkins.plugins.kubernetes). After about 15 minutes of waiting, Jenkins eventually times out and tries again and is usually successful, though sometimes this issue may occur back-to-back, resulting in a 30+ min delay. 12:35:34 Created Pod: ns-dev build-ns-dev/REDACTED-278-32670 12:35:42 Agent REDACTED-278-32670 is provisioned from template REDACTED_278-w4h5t-wq4p4 12:35:42 --- 12:35:42 apiVersion: "v1" 12:35:42 kind: "Pod" 12:35:42 metadata: 12:35:42 annotations: 12:35:42 iam.amazonaws.com/role: "arn:aws:iam::REDACTED:role/REDACTED" 12:35:42 buildUrl: "https: //REDACTED/job/REDACTED/job/REDACTED/job/REDACTED/278/" 12:35:42 runUrl: "job/REDACTED/job/REDACTED/job/REDACTED/278/" 12:35:42 jobPattern: "(^job/REDACTED/.*/(.*)/[0-9]+)" 12:35:42 labels: 12:35:42 jenkins: "slave" 12:35:42 jenkins/label-digest: "REDACTED" 12:35:42 jenkins/label: "REDACTED_278-w4h5t" 12:35:42 name: "REDACTED-278-32670" 12:35:42 namespace: "build-ns-dev" 12:35:42 spec: 12:35:42 containers: 12:35:42 - image: "REDACTED" 12:35:42 name: "REDACTED" 12:35:42 resources: 12:35:42 requests: 12:35:42 cpu: "1" 12:35:42 memory: "4Gi" 12:35:42 tty: true 12:35:42 volumeMounts: 12:35:42 - mountPath: "/mnt/gpg-certificate" 12:35:42 name: "gpg-certificate" 12:35:42 - mountPath: "/mnt/gpg-certificate-phrase" 12:35:42 name: "gpg-certificate-phrase" 12:35:42 - mountPath: "/home/jenkins/agent" 12:35:42 name: "workspace-volume" 12:35:42 readOnly: false 12:35:42 - env: 12:35:42 - name: "JENKINS_SECRET" 12:35:42 value: "********" 12:35:42 - name: "JENKINS_AGENT_NAME" 12:35:42 value: "REDACTED-278-32670" 12:35:42 - name: "JENKINS_WEB_SOCKET" 12:35:42 value: " true " 12:35:42 - name: "JENKINS_NAME" 12:35:42 value: "REDACTED-278-32670" 12:35:42 - name: "JENKINS_AGENT_WORKDIR" 12:35:42 value: "/home/jenkins/agent" 12:35:42 - name: "JENKINS_URL" 12:35:42 value: "https: //REDACTED/" 12:35:42 image: "jenkins/inbound-agent:3142.vcfca_0cd92128-1" 12:35:42 name: "jnlp" 12:35:42 resources: 12:35:42 requests: 12:35:42 memory: "256Mi" 12:35:42 cpu: "100m" 12:35:42 volumeMounts: 12:35:42 - mountPath: "/home/jenkins/agent" 12:35:42 name: "workspace-volume" 12:35:42 readOnly: false 12:35:42 nodeSelector: 12:35:42 kubernetes.io/os: "linux" 12:35:42 restartPolicy: "Never" 12:35:42 serviceAccountName: "am" 12:35:42 volumes: 12:35:42 - name: "gpg-certificate" 12:35:42 secret: 12:35:42 secretName: "REDACTED" 12:35:42 - name: "gpg-certificate-phrase" 12:35:42 secret: 12:35:42 secretName: "REDACTED" 12:35:42 - emptyDir: 12:35:42 medium: "" 12:35:42 name: "workspace-volume" 12:35:42 12:35:43 Running on REDACTED-278-32670 in /home/jenkins/agent/workspace/REDACTED I can then see the second call to generate the pod was successful in the event logs.

Sigi Kiermayer added a comment - 2023-10-09 09:47

We have not found any reasonable log output telling us why Jenkins was unable to create a pod and we also don't see anything in the api itself.

Even with setting org.csanchez.jenkins.plugins.kubernetes.

We should look into the plugin to see what could block the plugin to do something or log something. There might be something we just don't see.

Or dionj did you see the timeout or anything in your logs?

Sigi Kiermayer added a comment - 2023-10-09 09:47 We have not found any reasonable log output telling us why Jenkins was unable to create a pod and we also don't see anything in the api itself. Even with setting org.csanchez.jenkins.plugins.kubernetes. We should look into the plugin to see what could block the plugin to do something or log something. There might be something we just don't see. Or dionj did you see the timeout or anything in your logs?

Dion added a comment - 2023-10-09 15:15 - edited

siegfried I've had a difficult time gathering logs. I've added the kubernetes logger, but the logs aren't maintained for very long, so by the time I see the failure, the logs from the provisioning have already rotated out.

Anecdotally, I was watching this Friday when it was occurring and the Jenkins UI had quite a lot of builds queued up and the Kubernetes logs were very quiet. No relevant logs that I could find. After some time it suddenly created all of the agents without issue (except for the ones that timed out before).

Dion added a comment - 2023-10-09 15:15 - edited siegfried I've had a difficult time gathering logs. I've added the kubernetes logger, but the logs aren't maintained for very long, so by the time I see the failure, the logs from the provisioning have already rotated out. Anecdotally, I was watching this Friday when it was occurring and the Jenkins UI had quite a lot of builds queued up and the Kubernetes logs were very quiet. No relevant logs that I could find. After some time it suddenly created all of the agents without issue (except for the ones that timed out before).

Sigi Kiermayer added a comment - 2023-10-09 15:41

dionj there has to be a code path which we could add logging. I'm not sure yet how easy it will be to contribute.

There is also the risk that its correlated to two code positions: either the one were jenkins is executing the provisioning plugin (kubernetes) or the kubernetes plugin itself

Sigi Kiermayer added a comment - 2023-10-09 15:41 dionj there has to be a code path which we could add logging. I'm not sure yet how easy it will be to contribute. There is also the risk that its correlated to two code positions: either the one were jenkins is executing the provisioning plugin (kubernetes) or the kubernetes plugin itself

Dion added a comment - 2023-10-10 11:36 - edited

Turns out the logs are all saved on disk, but the UI only shows a small portion of it. Managed to capture this in action with the following loggers set to `ALL`

org.csanchez.jenkins.plugins.kubernetes
hudson.slaves.NodeProvisioner
hudson.slaves.AbstractCloudSlave

When I logged in this morning, I saw that there was a backup in the build queue. Investigating the jobs in the queue, I happened across this one job which had been timing out and retrying every 15min for over 4hrs. During those 15minutes the Build Queue grows and no other pods are scheduled. Once it times out, all of the queued items get executed and it goes back to queuing for another 15 minutes.

Cranked the logs and collected the full 15minutes following one of the pods: `REDACTED-155-rh-gt9vs`

It would be difficult to sanitize these logs, but I didn't see much of value tbh. It looked normal and when I went digging through my Kubernetes event logs I could find it, but it was quite short-lived.

Console log looping through these logs as it continues to retry:

05:54:20  Created Pod: REDACTED
06:11:00  ERROR: Failed to launch REDACTED
06:11:00  io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[REDACTED] in namespace [REDACTED].
06:11:00  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:896)
06:11:00  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:878)
06:11:00  	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:93)
06:11:00  	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:169)
06:11:00  	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
06:11:00  	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
06:11:00  	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
06:11:00  	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
06:11:00  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
06:11:00  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
06:11:00  	at java.base/java.lang.Thread.run(Unknown Source)

Found this in the Jenkins system log...

2023-10-10 10:00:42.387+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: --> DELETE https://172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED h2
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Authorization: Bearer REDACTED
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: User-Agent: fabric8-kubernetes-client/6.4.1
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Type: application/json; charset=utf-8
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Length: 75
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Host: 172.20.0.1
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Connection: Keep-Alive
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Accept-Encoding: gzip
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log:
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: {"apiVersion":"v1","kind":"DeleteOptions","propagationPolicy":"Background"}
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: --> END DELETE (75-byte body)
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- 200 https://172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED (54ms)
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: audit-id: acf7caf0-7f7d-44be-951f-fbf599cbde5c
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: cache-control: no-cache, private
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: content-type: application/json
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-flowschema-uid: dec317e9-e558-46c2-bfb7-ce848aaccd93
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-prioritylevel-uid: d25f04bc-e0f0-446b-bff0-210a6f4bd563
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: date: Tue, 10 Oct 2023 10:00:42 GMT
2023-10-10 10:00:42.458+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- END HTTP (16645-byte body)

Dion added a comment - 2023-10-10 11:36 - edited Turns out the logs are all saved on disk, but the UI only shows a small portion of it. Managed to capture this in action with the following loggers set to `ALL` org.csanchez.jenkins.plugins.kubernetes hudson.slaves.NodeProvisioner hudson.slaves.AbstractCloudSlave When I logged in this morning, I saw that there was a backup in the build queue. Investigating the jobs in the queue, I happened across this one job which had been timing out and retrying every 15min for over 4hrs. During those 15minutes the Build Queue grows and no other pods are scheduled. Once it times out, all of the queued items get executed and it goes back to queuing for another 15 minutes. Cranked the logs and collected the full 15minutes following one of the pods: `REDACTED-155-rh-gt9vs` It would be difficult to sanitize these logs, but I didn't see much of value tbh. It looked normal and when I went digging through my Kubernetes event logs I could find it, but it was quite short-lived. Console log looping through these logs as it continues to retry: 05:54:20 Created Pod: REDACTED 06:11:00 ERROR: Failed to launch REDACTED 06:11:00 io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[REDACTED] in namespace [REDACTED]. 06:11:00 at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:896) 06:11:00 at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:878) 06:11:00 at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:93) 06:11:00 at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:169) 06:11:00 at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297) 06:11:00 at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) 06:11:00 at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) 06:11:00 at java.base/java.util.concurrent.FutureTask.run(Unknown Source) 06:11:00 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 06:11:00 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 06:11:00 at java.base/java.lang. Thread .run(Unknown Source) Found this in the Jenkins system log... 2023-10-10 10:00:42.387+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: --> DELETE https: //172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED h2 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Authorization: Bearer REDACTED 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: User-Agent: fabric8-kubernetes-client/6.4.1 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Type: application/json; charset=utf-8 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Length: 75 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Host: 172.20.0.1 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Connection: Keep-Alive 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Accept-Encoding: gzip 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: { "apiVersion" : "v1" , "kind" : "DeleteOptions" , "propagationPolicy" : "Background" } 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: --> END DELETE (75- byte body) 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- 200 https: //172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED (54ms) 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: audit-id: acf7caf0-7f7d-44be-951f-fbf599cbde5c 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: cache-control: no-cache, private 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: content-type: application/json 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-flowschema-uid: dec317e9-e558-46c2-bfb7-ce848aaccd93 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-prioritylevel-uid: d25f04bc-e0f0-446b-bff0-210a6f4bd563 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: date: Tue, 10 Oct 2023 10:00:42 GMT 2023-10-10 10:00:42.458+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- END HTTP (16645- byte body)

Dion added a comment - 2023-10-10 15:05

I am now able to reproduce the issue. Simply try to launch an agent with an invalid secret/configmap configured. Kubernetes will error and Jenkins will sit idle, waiting for it to connect. While this is ongoing, the queue builds up until it times out and restarts.

It seems to be unable to catch failures from Kubernetes and causes it to get hung if the pod is "created" but never starts and doesn't get terminated.

Name:             test
Namespace:        REDACTED
Priority:         0
Service Account:  default
Node:             REDACTED
Start Time:       Tue, 10 Oct 2023 10:36:01 -0400
Annotations:      kubernetes.io/psp: eks.privileged
Status:           Pending
IP:               10.224.23.15
IPs:
  IP:  10.224.23.15
Containers:
  test:
    Container ID:   
    Image:          alpine:latest
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Environment:
      USERNAME:  <set to the key 'test' in secret 'user'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6gf4b (ro)

Looks like we need to improve the way that Jenkins monitors the pod after creation so that it can catch these. IIRC previous versions of Kubernetes did not cause containers to enter this Waiting state.

Dion added a comment - 2023-10-10 15:05 I am now able to reproduce the issue. Simply try to launch an agent with an invalid secret/configmap configured. Kubernetes will error and Jenkins will sit idle, waiting for it to connect. While this is ongoing, the queue builds up until it times out and restarts. It seems to be unable to catch failures from Kubernetes and causes it to get hung if the pod is "created" but never starts and doesn't get terminated. Name: test Namespace: REDACTED Priority: 0 Service Account: default Node: REDACTED Start Time: Tue, 10 Oct 2023 10:36:01 -0400 Annotations: kubernetes.io/psp: eks.privileged Status: Pending IP: 10.224.23.15 IPs: IP: 10.224.23.15 Containers: test: Container ID: Image: alpine:latest Image ID: Port: <none> Host Port: <none> State: Waiting Reason: CreateContainerConfigError Ready: False Restart Count: 0 Environment: USERNAME: <set to the key 'test' in secret 'user' > Optional: false Mounts: / var /run/secrets/kubernetes.io/serviceaccount from kube-api-access-6gf4b (ro) Looks like we need to improve the way that Jenkins monitors the pod after creation so that it can catch these. IIRC previous versions of Kubernetes did not cause containers to enter this Waiting state.

Sigi Kiermayer added a comment - 2023-10-11 08:36

dionj is that a response from the kubernetes api or did you get that output by describing a pending pod?

Sigi Kiermayer added a comment - 2023-10-11 08:36 dionj is that a response from the kubernetes api or did you get that output by describing a pending pod?

Dion added a comment - 2023-10-11 18:56 - edited

siegfried describing the pending pod.

I managed to sanitize all the verbose logs, removing the extraneous information and focusing on the single pod: sanitized_reproduction.log

I'll boil it down further to what I think is the main points here:

Node successfully provisioned and Pod created:

2023-10-10 09:54:20.576+0000 [id=34]  INFO  h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning REDACTED from REDACTED with 1 executors. Remaining excess workload: -02023-10-10 09:54:20.576+0000 [id=34]  FINER hudson.slaves.NodeProvisioner#update: Provisioning strategy hudson.slaves.NodeProvisioner$StandardStrategyImpl@3981c444 declared provisioning complete
2023-10-10 09:54:20.581+0000 [id=162189]  FINEST  o.c.j.p.k.KubernetesCloud#connect: Building connection to Kubernetes REDACTED URL null namespace REDACTED
2023-10-10 09:54:20.581+0000 [id=162189]  FINE  o.c.j.p.k.KubernetesCloud#connect: Connected to Kubernetes REDACTED URL https://172.20.0.1:443/ namespace REDACTED
2023-10-10 09:54:20.587+0000 [id=34]  INFO  hudson.slaves.NodeProvisioner#update: REDACTED provisioning successfully completed. We have now 4 computer(s)
2023-10-10 09:54:20.589+0000 [id=162189]  FINE  o.c.j.p.k.KubernetesLauncher#launch: Creating Pod: REDACTED REDACTED/REDACTED
(100x) 2023-10-10 09:54:20.601+0000 [id=34] FINER  hudson.slaves.NodeProvisioner#update: ran update on REDACTED in 0ms
2023-10-10 09:54:20.850+0000 [id=162189]  INFO  o.c.j.p.k.KubernetesLauncher#launch: Created Pod: REDACTED REDACTED/REDACTED

At this time I check Kubernetes to see the pod and I can see that it's in a Pending status, but is no longer retrying as the container is in a "Waiting" state with reason of `CreateContainerConfigError` after failing to pull a secret that does not exist in the namespace.

Pod is deleted after passing the 5 minute connection timeout

2023-10-10 10:00:42.387+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: --> DELETE https://172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED h2
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Authorization: Bearer REDACTED
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: User-Agent: fabric8-kubernetes-client/6.4.1
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Type: application/json; charset=utf-8
2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Length: 75
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Host: 172.20.0.1
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Connection: Keep-Alive
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Accept-Encoding: gzip
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log:
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: {"apiVersion":"v1","kind":"DeleteOptions","propagationPolicy":"Background"}
2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: --> END DELETE (75-byte body)
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- 200 https://172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED (54ms)
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: audit-id: acf7caf0-7f7d-44be-951f-fbf599cbde5c
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: cache-control: no-cache, private
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: content-type: application/json
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-flowschema-uid: dec317e9-e558-46c2-bfb7-ce848aaccd93
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-prioritylevel-uid: d25f04bc-e0f0-446b-bff0-210a6f4bd563
2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: date: Tue, 10 Oct 2023 10:00:42 GMT
2023-10-10 10:00:42.458+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- END HTTP (16645-byte body)
2023-10-10 10:00:42.459+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#deleteSlavePod: Terminated Kubernetes instance for agent REDACTED/REDACTED
2023-10-10 10:00:42.459+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Disconnected computer REDACTED
2023-10-10 10:00:42.461+0000 [id=165373] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-165373 for REDACTED terminated: java.nio.channels.ClosedChannelException

The Jenkins console log does not indicate anything at this point, continuing to wait.

During this time, several builds tried to run and got caught in the queue.

And then the 15 min read timeout finally being hit.

2023-10-10 10:11:00.852+0000 [id=162189]  FINER o.c.j.p.k.KubernetesLauncher#launch: Removing Jenkins node: REDACTED
2023-10-10 10:11:00.852+0000 [id=162189]  INFO  o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED
2023-10-10 10:11:00.852+0000 [id=162189]  FINEST  o.c.j.p.k.KubernetesCloud#connect: Building connection to Kubernetes REDACTED URL null namespace REDACTED
2023-10-10 10:11:00.852+0000 [id=162189]  FINE  o.c.j.p.k.KubernetesFactoryAdapter#createClient: Autoconfiguring Kubernetes client
2023-10-10 10:11:00.852+0000 [id=162189]  FINE  o.c.j.p.k.KubernetesFactoryAdapter#createClient: Creating Kubernetes client: KubernetesFactoryAdapter [serviceAddress=null, namespace=REDACTED, caCertData=null, credentials=null, skipTlsVerify=false, connectTimeout=5, readTimeout=15]
2023-10-10 10:11:00.852+0000 [id=162189]  FINE  o.c.j.p.k.KubernetesFactoryAdapter#createClient: Proxy Settings for Cloud: false
2023-10-10 10:11:00.859+0000 [id=162189]  FINE  o.c.j.p.k.KubernetesClientProvider#createClient: Created new Kubernetes client: REDACTED io.fabric8.kubernetes.client.impl.KubernetesClientImpl@67588030
2023-10-10 10:11:00.859+0000 [id=162189]  FINE  o.c.j.p.k.KubernetesCloud#connect: Connected to Kubernetes REDACTED URL https://172.20.0.1:443/ namespace REDACTED
2023-10-10 10:11:00.859+0000 [id=162189]  SEVERE  o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: REDACTED
2023-10-10 10:11:00.859+0000 [id=162189]  INFO  hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null: REDACTED

After this, the logs explode and all the queued builds get launched.

Dion added a comment - 2023-10-11 18:56 - edited siegfried describing the pending pod. I managed to sanitize all the verbose logs, removing the extraneous information and focusing on the single pod: sanitized_reproduction.log I'll boil it down further to what I think is the main points here: Node successfully provisioned and Pod created: 2023-10-10 09:54:20.576+0000 [id=34] INFO h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning REDACTED from REDACTED with 1 executors. Remaining excess workload: -02023-10-10 09:54:20.576+0000 [id=34] FINER hudson.slaves.NodeProvisioner#update: Provisioning strategy hudson.slaves.NodeProvisioner$StandardStrategyImpl@3981c444 declared provisioning complete 2023-10-10 09:54:20.581+0000 [id=162189] FINEST o.c.j.p.k.KubernetesCloud#connect: Building connection to Kubernetes REDACTED URL null namespace REDACTED 2023-10-10 09:54:20.581+0000 [id=162189] FINE o.c.j.p.k.KubernetesCloud#connect: Connected to Kubernetes REDACTED URL https: //172.20.0.1:443/ namespace REDACTED 2023-10-10 09:54:20.587+0000 [id=34] INFO hudson.slaves.NodeProvisioner#update: REDACTED provisioning successfully completed. We have now 4 computer(s) 2023-10-10 09:54:20.589+0000 [id=162189] FINE o.c.j.p.k.KubernetesLauncher#launch: Creating Pod: REDACTED REDACTED/REDACTED (100x) 2023-10-10 09:54:20.601+0000 [id=34] FINER hudson.slaves.NodeProvisioner#update: ran update on REDACTED in 0ms 2023-10-10 09:54:20.850+0000 [id=162189] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: REDACTED REDACTED/REDACTED At this time I check Kubernetes to see the pod and I can see that it's in a Pending status, but is no longer retrying as the container is in a "Waiting" state with reason of `CreateContainerConfigError` after failing to pull a secret that does not exist in the namespace. Pod is deleted after passing the 5 minute connection timeout 2023-10-10 10:00:42.387+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: --> DELETE https: //172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED h2 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Authorization: Bearer REDACTED 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: User-Agent: fabric8-kubernetes-client/6.4.1 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Type: application/json; charset=utf-8 2023-10-10 10:00:42.402+0000 [id=167962] INFO o.internal.platform.Platform#log: Content-Length: 75 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Host: 172.20.0.1 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Connection: Keep-Alive 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: Accept-Encoding: gzip 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: { "apiVersion" : "v1" , "kind" : "DeleteOptions" , "propagationPolicy" : "Background" } 2023-10-10 10:00:42.403+0000 [id=167962] INFO o.internal.platform.Platform#log: --> END DELETE (75- byte body) 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- 200 https: //172.20.0.1/api/v1/namespaces/REDACTED/pods/REDACTED (54ms) 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: audit-id: acf7caf0-7f7d-44be-951f-fbf599cbde5c 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: cache-control: no-cache, private 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: content-type: application/json 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-flowschema-uid: dec317e9-e558-46c2-bfb7-ce848aaccd93 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: x-kubernetes-pf-prioritylevel-uid: d25f04bc-e0f0-446b-bff0-210a6f4bd563 2023-10-10 10:00:42.457+0000 [id=167962] INFO o.internal.platform.Platform#log: date: Tue, 10 Oct 2023 10:00:42 GMT 2023-10-10 10:00:42.458+0000 [id=167962] INFO o.internal.platform.Platform#log: <-- END HTTP (16645- byte body) 2023-10-10 10:00:42.459+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#deleteSlavePod: Terminated Kubernetes instance for agent REDACTED/REDACTED 2023-10-10 10:00:42.459+0000 [id=167889] INFO o.c.j.p.k.KubernetesSlave#_terminate: Disconnected computer REDACTED 2023-10-10 10:00:42.461+0000 [id=165373] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-165373 for REDACTED terminated: java.nio.channels.ClosedChannelException The Jenkins console log does not indicate anything at this point, continuing to wait. During this time, several builds tried to run and got caught in the queue. And then the 15 min read timeout finally being hit. 2023-10-10 10:11:00.852+0000 [id=162189] FINER o.c.j.p.k.KubernetesLauncher#launch: Removing Jenkins node: REDACTED 2023-10-10 10:11:00.852+0000 [id=162189] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent REDACTED 2023-10-10 10:11:00.852+0000 [id=162189] FINEST o.c.j.p.k.KubernetesCloud#connect: Building connection to Kubernetes REDACTED URL null namespace REDACTED 2023-10-10 10:11:00.852+0000 [id=162189] FINE o.c.j.p.k.KubernetesFactoryAdapter#createClient: Autoconfiguring Kubernetes client 2023-10-10 10:11:00.852+0000 [id=162189] FINE o.c.j.p.k.KubernetesFactoryAdapter#createClient: Creating Kubernetes client: KubernetesFactoryAdapter [serviceAddress= null , namespace=REDACTED, caCertData= null , credentials= null , skipTlsVerify= false , connectTimeout=5, readTimeout=15] 2023-10-10 10:11:00.852+0000 [id=162189] FINE o.c.j.p.k.KubernetesFactoryAdapter#createClient: Proxy Settings for Cloud: false 2023-10-10 10:11:00.859+0000 [id=162189] FINE o.c.j.p.k.KubernetesClientProvider#createClient: Created new Kubernetes client: REDACTED io.fabric8.kubernetes.client.impl.KubernetesClientImpl@67588030 2023-10-10 10:11:00.859+0000 [id=162189] FINE o.c.j.p.k.KubernetesCloud#connect: Connected to Kubernetes REDACTED URL https: //172.20.0.1:443/ namespace REDACTED 2023-10-10 10:11:00.859+0000 [id=162189] SEVERE o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null : REDACTED 2023-10-10 10:11:00.859+0000 [id=162189] INFO hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null : REDACTED After this, the logs explode and all the queued builds get launched.

Amit added a comment - 2023-10-20 23:24 - edited

dionj siegfried

Experiencing the same problem.

Kubernetes : 1.27
Jenkins Version: 2.414.2
Kubernetes Plugin: 4054.v2da_8e2794884
Kubernetes client API: 6.8.1-224.vd388fca_4db_3b_

I managed to capture logs with the following loggers set to 'ALL'

io.fabric8.kubernetes

I was able to locate the exact stack trace which kicks in when the job is stuck in below stage

Still waiting to schedule task
All nodes of label ‘REDACTED’ are offline

Based on the below trace , looks like the dispatcher was shut down

Trying to configure client from Kubernetes config...
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryKubeConfig
Did not find Kubernetes config at: [/var/jenkins_home/.kube/config]. Ignoring.
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount
Trying to configure client from service account...
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount
Found service account host and port: 172.20.0.1:443
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount
Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt}].
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount
Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryNamespaceFromPath
Trying to configure client namespace from Kubernetes service account namespace path...
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryNamespaceFromPath
Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.utils.HttpClientUtils getHttpClientFactory
Using httpclient io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory factory
Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl close
Shutting down dispatcher okhttp3.Dispatcher@2effffe9 at the following call stack: 
	at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl.close(OkHttpClientImpl.java:255)
	at io.fabric8.kubernetes.client.impl.BaseClient.close(BaseClient.java:139)
	at org.csanchez.jenkins.plugins.kubernetes.PodTemplateUtils.parseFromYaml(PodTemplateUtils.java:627)
	at org.csanchez.jenkins.plugins.kubernetes.PodTemplateUtils.validateYamlContainerNames(PodTemplateUtils.java:683)
	at org.csanchez.jenkins.plugins.kubernetes.PodTemplateUtils.validateYamlContainerNames(PodTemplateUtils.java:673)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.PodTemplateStepExecution.start(PodTemplateStepExecution.java:145)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:323)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:196)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:124)
	at jdk.internal.reflect.GeneratedMethodAccessor1666.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:98)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1225)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1034)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:41)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:180)
	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:163)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:148)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:178)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:182)
	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
	at org.jenkinsci.plugins.workflow.cps.LoggingInvoker.methodCall(LoggingInvoker.java:105)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:90)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:116)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:85)
	at jdk.internal.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ClosureBlock.eval(ClosureBlock.java:46)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:152)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:97)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The pods are provisioned only after 15 minutes, and I see the below statement for resync

Oct 20, 2023 9:01:07 PM FINE io.fabric8.kubernetes.client.informers.impl.DefaultSharedIndexInformer start
Ready to run resync and reflector for v1/namespaces/tooling-jenkins/pods with resync 0
Oct 20, 2023 9:01:07 PM FINE io.fabric8.kubernetes.client.informers.impl.DefaultSharedIndexInformer scheduleResync
Resync skipped due to 0 full resync period for v1/namespaces/tooling-jenkins/pods
Oct 20, 2023 9:01:07 PM FINEST io.fabric8.kubernetes.client.http.HttpLoggingInterceptor$HttpLogger logStart

Amit added a comment - 2023-10-20 23:24 - edited dionj siegfried Experiencing the same problem. Kubernetes : 1.27 Jenkins Version: 2.414.2 Kubernetes Plugin: 4054.v2da_8e2794884 Kubernetes client API: 6.8.1-224.vd388fca_4db_3b_ I managed to capture logs with the following loggers set to 'ALL' io.fabric8.kubernetes I was able to locate the exact stack trace which kicks in when the job is stuck in below stage Still waiting to schedule task All nodes of label ‘REDACTED’ are offline Based on the below trace , looks like the dispatcher was shut down Trying to configure client from Kubernetes config... Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryKubeConfig Did not find Kubernetes config at: [/ var /jenkins_home/.kube/config]. Ignoring. Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount Trying to configure client from service account... Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount Found service account host and port: 172.20.0.1:443 Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount Found service account ca cert at: [/ var /run/secrets/kubernetes.io/serviceaccount/ca.crt}]. Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryServiceAccount Found service account token at: [/ var /run/secrets/kubernetes.io/serviceaccount/token]. Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryNamespaceFromPath Trying to configure client namespace from Kubernetes service account namespace path... Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.Config tryNamespaceFromPath Found service account namespace at: [/ var /run/secrets/kubernetes.io/serviceaccount/namespace]. Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.utils.HttpClientUtils getHttpClientFactory Using httpclient io.fabric8.kubernetes.client.okhttp.OkHttpClientFactory factory Oct 20, 2023 8:35:30 PM FINE io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl close Shutting down dispatcher okhttp3.Dispatcher@2effffe9 at the following call stack: at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl.close(OkHttpClientImpl.java:255) at io.fabric8.kubernetes.client.impl.BaseClient.close(BaseClient.java:139) at org.csanchez.jenkins.plugins.kubernetes.PodTemplateUtils.parseFromYaml(PodTemplateUtils.java:627) at org.csanchez.jenkins.plugins.kubernetes.PodTemplateUtils.validateYamlContainerNames(PodTemplateUtils.java:683) at org.csanchez.jenkins.plugins.kubernetes.PodTemplateUtils.validateYamlContainerNames(PodTemplateUtils.java:673) at org.csanchez.jenkins.plugins.kubernetes.pipeline.PodTemplateStepExecution.start(PodTemplateStepExecution.java:145) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:323) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:196) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:124) at jdk.internal.reflect.GeneratedMethodAccessor1666.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:98) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1225) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1034) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:41) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:180) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:163) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:148) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:178) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:182) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at org.jenkinsci.plugins.workflow.cps.LoggingInvoker.methodCall(LoggingInvoker.java:105) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:90) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:116) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:85) at jdk.internal.reflect.GeneratedMethodAccessor165.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ClosureBlock.eval(ClosureBlock.java:46) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:152) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:97) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:829) The pods are provisioned only after 15 minutes, and I see the below statement for resync Oct 20, 2023 9:01:07 PM FINE io.fabric8.kubernetes.client.informers.impl.DefaultSharedIndexInformer start Ready to run resync and reflector for v1/namespaces/tooling-jenkins/pods with resync 0 Oct 20, 2023 9:01:07 PM FINE io.fabric8.kubernetes.client.informers.impl.DefaultSharedIndexInformer scheduleResync Resync skipped due to 0 full resync period for v1/namespaces/tooling-jenkins/pods Oct 20, 2023 9:01:07 PM FINEST io.fabric8.kubernetes.client.http.HttpLoggingInterceptor$HttpLogger logStart

Yael added a comment - 2023-11-20 15:10

Also experiencing this issue.

GKE version 1.27.3-gke.100
Jenkins version 2.4111
Kubernetes plugin 4054.v2da_8e2794884

Kubernetes Client API plugin 6.8.1-224.vd388fca_4db_3b_

Occasionally, Jenkins will timeout after attempting to create the pod and the console logs looks like:

 Still waiting to schedule task
All nodes of label ‘X’ are offline

Or:

 Still waiting to schedule task
 `Jenkins` does not have label `X`

When X is the label name.

After some time, Jenkins eventually time out or we have to abort.

Yael added a comment - 2023-11-20 15:10 Also experiencing this issue. GKE version 1.27.3-gke.100 Jenkins version 2.4111 Kubernetes plugin 4054.v2da_8e2794884 Kubernetes Client API plugin 6.8.1-224.vd388fca_4db_3b_ Occasionally, Jenkins will timeout after attempting to create the pod and the console logs looks like: Still waiting to schedule task All nodes of label ‘X’ are offline Or: Still waiting to schedule task `Jenkins` does not have label `X` When X is the label name. After some time, Jenkins eventually time out or we have to abort.

B added a comment - 2023-11-23 21:02

We also saw this issue when trying to update from version 3937.vd7b_82db_e347b_ to kubernetes:4054.v2da_8e2794884 so I think the issue was introduced somewhere in between.

Seeing similar issues to dionj where pods just are not created in the k8s API for up to 15-20 minutes in some cases. It seems to be somewhat sporadic but might be when Jenkins is under higher load?

B added a comment - 2023-11-23 21:02 We also saw this issue when trying to update from version 3937.vd7b_82db_e347b_ to kubernetes:4054.v2da_8e2794884 so I think the issue was introduced somewhere in between. Seeing similar issues to dionj where pods just are not created in the k8s API for up to 15-20 minutes in some cases. It seems to be somewhat sporadic but might be when Jenkins is under higher load?

Robyn added a comment - 2023-12-20 12:29

We are also experiencing this issue. In our case a majority of our pods would not even get created. We found that pods that used yamlMergeStrategy were the pods that hit the issue most of the time.
We had upgrade to LTS 2.414.3 with the following plugins:

kubernetes:4054.v2da_8e2794884
kubernetes-client-api:6.8.1-224.vd388fca_4db_3b_
kubernetes-credentials:0.11
snakeyaml-api:2.2-111.vc6598e30cc65

The only way we were able to get back into a working state was downgrade the plugins to the following versions:
kubernetes:4007.v633279962016
kubernetes-client-api:6.4.1-215.v2ed17097a_8e9

kubernetes-credentials:0.10.0
snakeyaml-api:1.33-95.va_b_a_e3e47b_fa_4

Robyn added a comment - 2023-12-20 12:29 We are also experiencing this issue. In our case a majority of our pods would not even get created. We found that pods that used yamlMergeStrategy were the pods that hit the issue most of the time. We had upgrade to LTS 2.414.3 with the following plugins: kubernetes:4054.v2da_8e2794884 kubernetes-client-api:6.8.1-224.vd388fca_4db_3b_ kubernetes-credentials:0.11 snakeyaml-api:2.2-111.vc6598e30cc65 The only way we were able to get back into a working state was downgrade the plugins to the following versions: kubernetes:4007.v633279962016 kubernetes-client-api:6.4.1-215.v2ed17097a_8e9 kubernetes-credentials:0.10.0 snakeyaml-api:1.33-95.va_b_a_e3e47b_fa_4

Robyn added a comment - 2024-06-05 19:47 - edited

I was wondering if there is any updates here. This is preventing us from update a bunch of plugins we use, and that need to be upgraded due to security issues as well as other issues.

Robyn added a comment - 2024-06-05 19:47 - edited I was wondering if there is any updates here. This is preventing us from update a bunch of plugins we use, and that need to be upgraded due to security issues as well as other issues.

Sigi Kiermayer added a comment - 2024-06-06 12:30

At least from our side, while we have seen this issue very sporadically, we don't have this issue in any way blocking us.

Sigi Kiermayer added a comment - 2024-06-06 12:30 At least from our side, while we have seen this issue very sporadically, we don't have this issue in any way blocking us.

Ofir added a comment - 2024-06-06 13:42

I agree with rsndv this issue is blocking us from upgrading Jenkins (2.387.2) and its plugins.

From my side, the issue occurred when Jenkins was loaded (running 100-300 concurrent jobs) and consumed lots of memory.

Ofir added a comment - 2024-06-06 13:42 I agree with rsndv this issue is blocking us from upgrading Jenkins ( 2.387.2 ) and its plugins. From my side, the issue occurred when Jenkins was loaded (running 100-300 concurrent jobs) and consumed lots of memory.

Ofir added a comment - 2024-06-18 09:12

Hi guys,

Following some tests and coredumps we made when we faced the issue we noticed that Jenkins tried to execute jobs on offline/non-existent pods (JNLP agent) which we suspect is the root cause of this issue.

Following this Beta release https://plugins.jenkins.io/kubernetes/#plugin-content-garbage-collection-beta a new GB mechanism was implemented to clean "left behind" agents and may resolve this issue.

Did someone had a chance to test it?

Ofir added a comment - 2024-06-18 09:12 Hi guys, Following some tests and coredumps we made when we faced the issue we noticed that Jenkins tried to execute jobs on offline/non-existent pods (JNLP agent) which we suspect is the root cause of this issue. Following this Beta release https://plugins.jenkins.io/kubernetes/#plugin-content-garbage-collection-beta a new GB mechanism was implemented to clean "left behind" agents and may resolve this issue. Did someone had a chance to test it?

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Ritchelle added a comment - 2023-09-20 20:36

Expand comment: Ritchelle added a comment - 2023-09-20 20:36

Collapse comment: Dion added a comment - 2023-10-05 17:13, Edited by Dion - 2023-10-05 18:27

Expand comment: Dion added a comment - 2023-10-05 17:13, Edited by Dion - 2023-10-05 18:27

Collapse comment: Sigi Kiermayer added a comment - 2023-10-09 09:47

Expand comment: Sigi Kiermayer added a comment - 2023-10-09 09:47

Collapse comment: Dion added a comment - 2023-10-09 15:15, Edited by Dion - 2023-10-09 15:16

Expand comment: Dion added a comment - 2023-10-09 15:15, Edited by Dion - 2023-10-09 15:16

Collapse comment: Sigi Kiermayer added a comment - 2023-10-09 15:41

Expand comment: Sigi Kiermayer added a comment - 2023-10-09 15:41

Collapse comment: Dion added a comment - 2023-10-10 11:36, Edited by Dion - 2023-10-10 14:07

Expand comment: Dion added a comment - 2023-10-10 11:36, Edited by Dion - 2023-10-10 14:07

Collapse comment: Dion added a comment - 2023-10-10 15:05

Expand comment: Dion added a comment - 2023-10-10 15:05

Collapse comment: Sigi Kiermayer added a comment - 2023-10-11 08:36

Expand comment: Sigi Kiermayer added a comment - 2023-10-11 08:36

Collapse comment: Dion added a comment - 2023-10-11 18:56, Edited by Dion - 2023-10-12 11:58

Expand comment: Dion added a comment - 2023-10-11 18:56, Edited by Dion - 2023-10-12 11:58

Collapse comment: Amit added a comment - 2023-10-20 23:24, Edited by Amit - 2023-10-20 23:25

Expand comment: Amit added a comment - 2023-10-20 23:24, Edited by Amit - 2023-10-20 23:25

Collapse comment: Yael added a comment - 2023-11-20 15:10

Expand comment: Yael added a comment - 2023-11-20 15:10

Collapse comment: B added a comment - 2023-11-23 21:02

Expand comment: B added a comment - 2023-11-23 21:02

Collapse comment: Robyn added a comment - 2023-12-20 12:29

Expand comment: Robyn added a comment - 2023-12-20 12:29

Collapse comment: Robyn added a comment - 2024-06-05 19:47, Edited by Robyn - 2024-06-05 19:48

Expand comment: Robyn added a comment - 2024-06-05 19:47, Edited by Robyn - 2024-06-05 19:48

Collapse comment: Sigi Kiermayer added a comment - 2024-06-06 12:30

Expand comment: Sigi Kiermayer added a comment - 2024-06-06 12:30

Collapse comment: Ofir added a comment - 2024-06-06 13:42

Expand comment: Ofir added a comment - 2024-06-06 13:42

Collapse comment: Ofir added a comment - 2024-06-18 09:12

Expand comment: Ofir added a comment - 2024-06-18 09:12

People

Dates