Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin
Labels:
None
Environment:
Jenkins 2.504.2 LTS, Kubernetes Plugin 4350.va_0283de0d6d6

Similar Issues:

Show

Before restarting Jenkins Controller, I had some builds running.

I watched one specific build before the restart and after the restart and noticed that it did not come back because after the restart, another build which was in queue had picked the pod agent of the original build.

The original build was kept as:

# WHEN BUILD STARTED
16:18:16  Still waiting to schedule task
16:18:16  Waiting for next available executor on ‘medium’
16:19:35  Agent cluster003-medium-f1jjw is provisioned from template cluster003-medium
16:19:35  Running on cluster003-medium-f1jjw in /home/jenkins/agent/workspace/mybuild

# AFTER RESTART
16:44:27  Resuming build at Fri May 30 16:44:27 EDT 2025 after Jenkins restart
16:44:28  Waiting for reconnection of cluster003-medium-f1jjw before proceeding with build

Then, I was inspecting the pod logs, and noticed it had correctly reconnected to the Jenkins after the Restart:

cluster003-medium-f1jjw jnlp May 30, 2025 8:42:55 PM hudson.remoting.Launcher$CuiListener status
cluster003-medium-f1jjw jnlp INFO: https://myjenkins.com/jenkins/login is not ready: 503
cluster003-medium-f1jjw jnlp May 30, 2025 8:42:55 PM hudson.remoting.Launcher$CuiListener status
cluster003-medium-f1jjw jnlp INFO: Waiting 10 seconds before retry
cluster003-medium-f1jjw jnlp May 30, 2025 8:43:07 PM hudson.remoting.Launcher$CuiListener status
cluster003-medium-f1jjw jnlp INFO: WebSocket connection open
cluster003-medium-f1jjw jnlp May 30, 2025 8:43:07 PM hudson.remoting.Launcher$CuiListener status
cluster003-medium-f1jjw jnlp INFO: Connected

And the agent was indeed connected in the Jenkins computers. However, to my surprise, another build had picked it up:

16:43:43  [Pipeline] Start of Pipeline
16:43:45  [Pipeline] node
16:43:45  Running on cluster003-medium-f1jjw in /home/jenkins/agent/workspace/anotherbuild

And that is why the original build was not able to recover after the restart.

I checked my controller logs and I was able to find these related:

2025-05-30 20:42:27.655+0000 [id=594]   INFO    o.c.j.p.k.pod.retention.Reaper#watchCloud: set up watcher on cluster003
2025-05-30 20:42:27.655+0000 [id=594]   INFO    o.c.j.p.k.KubernetesLauncher#launch: Agent has already been launched, activating: cluster003-medium-f1jjw
2025-05-30 20:42:27.656+0000 [id=586]   INFO    o.c.j.p.k.p.r.Reaper$CloudPodWatcher#stop: Stopping watch for kubernetes cloud cluster003

There's nothing beside that.

Also, I think it is worth mentioning my Kubernetes Cloud configuration:

jenkins: 
  clouds: 
    - kubernetes: 
        name: cluster003
        credentialsId: cluster003-kubeconfig
        namespace: jenkins-agents
        containerCap: 175
        retentionTimeout: 15
        webSocket: true
        templates: 
          - name: cluster003-medium
            id: 52ad9e1d-418e-4d57-b13a-075404c50164
            label: cluster003-medium medium
            showRawYaml: false
            workspaceVolume: 
              genericEphemeralVolume: 
                accessModes: ReadWriteOnce
                requestsSize: 48Gi
                storageClassName: local-path
            yaml: |
              apiVersion: v1
              kind: Pod
              spec: 
                hostNetwork: false
                automountServiceAccountToken: false
                enableServiceLinks: false
                terminationGracePeriodSeconds: 30
                dnsPolicy: Default
                restartPolicy: Never
                containers: 
                  - name: jnlp
                    image: ghcr.io/felipecrs/jenkins-agent-dind:2
                    imagePullPolicy: Always
                    resources: 
                      limits: 
                        cpu: "3.5"
                        memory: 14G
                        ephemeral-storage: 8Gi
                      requests: 
                        cpu: "3.5"
                        memory: 14G
                        ephemeral-storage: 8Gi
                    securityContext: 
                      privileged: true
                    workingDir: /home/jenkins/agent
                    terminationMessagePolicy: FallbackToLogsOnError

Also, a reference of my Jenkinsfiles:

pipeline {
  agent {
    label 'medium'
  }
}

I have many concurrent builds requesting the same label "medium".

This issue affects me really bad, and in fact renders me unable to restart Jenkins safely for maintenance reasons.

Any help would be deeply appreciated. Also, please let me know if there's anything else I can do to help get this fixed.

Assignee:: Unassigned

Reporter:: Felipe Santos

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025-05-30 21:41

Updated:: 2025-06-19 00:04

Details

Description

Attachments

Activity

People

Dates