Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65957

Kubernetes plugin creates backlog of workloads in GKE when failed scheduling and job aborted

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: kubernetes-plugin
    • Labels:
      None
    • Environment:
      jenkins 1.277.1
      kubernetes plugin 1.30
    • Similar Issues:

      Description

      I started a job for which there was not enough CPU to run in the GKE cluster. jenkins console logs for the job showed that it kept trying... and created a backlog of 500+ workloads in GKE. I added more nodes/CPU a few days later and it started churning through the backlog even though the job had been aborted 3 days before. When the pod comes up successfully, it's also too late because jenkins does not care about it anymore. I'm not sure how long it would take to go through the full workload backlog.

       

      In the jenkins console log you'll see something like

      Created Pod: kubernetes ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
      Created Pod: kubernetes ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.

       

      In GKE cluster's workload I can search for the job name and it shows 500+ workloads for a job that was aborted.

       

      How to reproduce:

      1. I would assume if you create a small cluster, and you request more CPU than is available on that node. I assume a node selector that does not match anything would also reproduce.
      2. Let it run for a while as it tries but fails to schedule the pods.
      3. Then you can check the cluster's workload by going to GKE/workload and search for the job / failed pod name.
      4. Abort the job.
      5. Once you add more CPUs, the backlog of workloads starts running.

       

       

        Attachments

          Activity

          There are no comments yet on this issue.

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            sbeaulie Samuel Beaulieu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated: