Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-75731

After restarting Jenkins controller, a different build picks up the pod agent

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • kubernetes-plugin
    • None
    • Jenkins 2.504.2 LTS, Kubernetes Plugin 4350.va_0283de0d6d6

      Before restarting Jenkins Controller, I had some builds running.

      I watched one specific build before the restart and after the restart and noticed that it did not come back because after the restart, another build which was in queue had picked the pod agent of the original build.

      The original build was kept as:

      # WHEN BUILD STARTED
      16:18:16  Still waiting to schedule task
      16:18:16  Waiting for next available executor on ‘medium’
      16:19:35  Agent cluster003-medium-f1jjw is provisioned from template cluster003-medium
      16:19:35  Running on cluster003-medium-f1jjw in /home/jenkins/agent/workspace/mybuild
      
      # AFTER RESTART
      16:44:27  Resuming build at Fri May 30 16:44:27 EDT 2025 after Jenkins restart
      16:44:28  Waiting for reconnection of cluster003-medium-f1jjw before proceeding with build
      

      Then, I was inspecting the pod logs, and noticed it had correctly reconnected to the Jenkins after the Restart:

      cluster003-medium-f1jjw jnlp May 30, 2025 8:42:55 PM hudson.remoting.Launcher$CuiListener status
      cluster003-medium-f1jjw jnlp INFO: https://myjenkins.com/jenkins/login is not ready: 503
      cluster003-medium-f1jjw jnlp May 30, 2025 8:42:55 PM hudson.remoting.Launcher$CuiListener status
      cluster003-medium-f1jjw jnlp INFO: Waiting 10 seconds before retry
      cluster003-medium-f1jjw jnlp May 30, 2025 8:43:07 PM hudson.remoting.Launcher$CuiListener status
      cluster003-medium-f1jjw jnlp INFO: WebSocket connection open
      cluster003-medium-f1jjw jnlp May 30, 2025 8:43:07 PM hudson.remoting.Launcher$CuiListener status
      cluster003-medium-f1jjw jnlp INFO: Connected
      

      And the agent was indeed connected in the Jenkins computers. However, to my surprise, another build had picked it up:

      16:43:43  [Pipeline] Start of Pipeline
      16:43:45  [Pipeline] node
      16:43:45  Running on cluster003-medium-f1jjw in /home/jenkins/agent/workspace/anotherbuild
      

      And that is why the original build was not able to recover after the restart.

      I checked my controller logs and I was able to find these related:

      2025-05-30 20:42:27.655+0000 [id=594]   INFO    o.c.j.p.k.pod.retention.Reaper#watchCloud: set up watcher on cluster003
      2025-05-30 20:42:27.655+0000 [id=594]   INFO    o.c.j.p.k.KubernetesLauncher#launch: Agent has already been launched, activating: cluster003-medium-f1jjw
      2025-05-30 20:42:27.656+0000 [id=586]   INFO    o.c.j.p.k.p.r.Reaper$CloudPodWatcher#stop: Stopping watch for kubernetes cloud cluster003
      

      There's nothing beside that.

      Also, I think it is worth mentioning my Kubernetes Cloud configuration:

      jenkins: 
        clouds: 
          - kubernetes: 
              name: cluster003
              credentialsId: cluster003-kubeconfig
              namespace: jenkins-agents
              containerCap: 175
              retentionTimeout: 15
              webSocket: true
              templates: 
                - name: cluster003-medium
                  id: 52ad9e1d-418e-4d57-b13a-075404c50164
                  label: cluster003-medium medium
                  showRawYaml: false
                  workspaceVolume: 
                    genericEphemeralVolume: 
                      accessModes: ReadWriteOnce
                      requestsSize: 48Gi
                      storageClassName: local-path
                  yaml: |
                    apiVersion: v1
                    kind: Pod
                    spec: 
                      hostNetwork: false
                      automountServiceAccountToken: false
                      enableServiceLinks: false
                      terminationGracePeriodSeconds: 30
                      dnsPolicy: Default
                      restartPolicy: Never
                      containers: 
                        - name: jnlp
                          image: ghcr.io/felipecrs/jenkins-agent-dind:2
                          imagePullPolicy: Always
                          resources: 
                            limits: 
                              cpu: "3.5"
                              memory: 14G
                              ephemeral-storage: 8Gi
                            requests: 
                              cpu: "3.5"
                              memory: 14G
                              ephemeral-storage: 8Gi
                          securityContext: 
                            privileged: true
                          workingDir: /home/jenkins/agent
                          terminationMessagePolicy: FallbackToLogsOnError
      

      Also, a reference of my Jenkinsfiles:

      pipeline {
        agent {
          label 'medium'
        }
      }
      

      I have many concurrent builds requesting the same label "medium".

      This issue affects me really bad, and in fact renders me unable to restart Jenkins safely for maintenance reasons.

      Any help would be deeply appreciated. Also, please let me know if there's anything else I can do to help get this fixed.

            Unassigned Unassigned
            felipecassiors Felipe Santos
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: