Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65957

Kubernetes plugin creates backlog of workloads in GKE when failed scheduling and job aborted

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • kubernetes-plugin
    • None
    • jenkins 1.277.1
      kubernetes plugin 1.30

      I started a job for which there was not enough CPU to run in the GKE cluster. jenkins console logs for the job showed that it kept trying... and created a backlog of 500+ workloads in GKE. I added more nodes/CPU a few days later and it started churning through the backlog even though the job had been aborted 3 days before. When the pod comes up successfully, it's also too late because jenkins does not care about it anymore. I'm not sure how long it would take to go through the full workload backlog.

       

      In the jenkins console log you'll see something like

      Created Pod: kubernetes ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
      Created Pod: kubernetes ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
      [Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.

       

      In GKE cluster's workload I can search for the job name and it shows 500+ workloads for a job that was aborted.

       

      How to reproduce:

      1. I would assume if you create a small cluster, and you request more CPU than is available on that node. I assume a node selector that does not match anything would also reproduce.
      2. Let it run for a while as it tries but fails to schedule the pods.
      3. Then you can check the cluster's workload by going to GKE/workload and search for the job / failed pod name.
      4. Abort the job.
      5. Once you add more CPUs, the backlog of workloads starts running.

       

       

            Unassigned Unassigned
            sbeaulie Samuel Beaulieu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: