I started a job for which there was not enough CPU to run in the GKE cluster. jenkins console logs for the job showed that it kept trying... and created a backlog of 500+ workloads in GKE. I added more nodes/CPU a few days later and it started churning through the backlog even though the job had been aborted 3 days before. When the pod comes up successfully, it's also too late because jenkins does not care about it anymore. I'm not sure how long it would take to go through the full workload backlog.
In the jenkins console log you'll see something like
In GKE cluster's workload I can search for the job name and it shows 500+ workloads for a job that was aborted.
How to reproduce:
- I would assume if you create a small cluster, and you request more CPU than is available on that node. I assume a node selector that does not match anything would also reproduce.
- Let it run for a while as it tries but fails to schedule the pods.
- Then you can check the cluster's workload by going to GKE/workload and search for the job / failed pod name.
- Abort the job.
- Once you add more CPUs, the backlog of workloads starts running.