[JENKINS-65957] Kubernetes plugin creates backlog of workloads in GKE when failed scheduling and job aborted

Type: Bug
Resolution: Unresolved
Priority: Critical
Component/s: kubernetes-plugin
Labels:
None
Environment:
jenkins 1.277.1
kubernetes plugin 1.30

Similar Issues:
Powered by SuggestiMate

Show

I started a job for which there was not enough CPU to run in the GKE cluster. jenkins console logs for the job showed that it kept trying... and created a backlog of 500+ workloads in GKE. I added more nodes/CPU a few days later and it started churning through the backlog even though the job had been aborted 3 days before. When the pod comes up successfully, it's also too late because jenkins does not care about it anymore. I'm not sure how long it would take to go through the full workload backlog.

In the jenkins console log you'll see something like

Created Pod: kubernetes ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk
[Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
[Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-gphrk][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
Created Pod: kubernetes ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d
[Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.
[Warning][ci-jenkins-server/jenkinsfile-k8s-13-cmm5l-qnfbz-tcw8d][FailedScheduling] 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity.

In GKE cluster's workload I can search for the job name and it shows 500+ workloads for a job that was aborted.

How to reproduce:

I would assume if you create a small cluster, and you request more CPU than is available on that node. I assume a node selector that does not match anything would also reproduce.
Let it run for a while as it tries but fails to schedule the pods.
Then you can check the cluster's workload by going to GKE/workload and search for the job / failed pod name.
Abort the job.
Once you add more CPUs, the backlog of workloads starts running.

There are no comments yet on this issue.

Assignee:: Unassigned

Reporter:: Samuel Beaulieu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2021-06-22 20:07

Updated:: 2021-06-22 20:07

Jenkins

Details

Description

Attachments

Activity

People

Dates