-
Bug
-
Resolution: Unresolved
-
Major
-
None
We run jobs that end up putting 600+ builds in the queue, and it will start some of them then seemingly get stuck and not update anything even though the resources are ready to run (not blocked).
k8s plugin. one loop for a label can take a very long time.
Apr 29, 2022 12:02:38 PM FINER hudson.slaves.NodeProvisioner ran update on k8s-beaker in 3378791ms
While this slow update is running, a lock is kept and any suggesting to review the label is ignored, so the behaviour we notice is a system with a lot of jobs that could run, but jenkins not spinning up new k8s pods to fullfil the requests, until eventually the update completes and then a bunch of them start. In some of our scenarios it can be stuck like this for about one hour (see above timer).
Turns out its the k8s nodeprovisioner which calls this class StandardPlannedNodeBuilder.java in a single threaded manner. Based on tests with load increasing up to 600+ jobs in the queue, it might take from 0s to 30 or 40 seconds per .build() which can make it very slow for hundreds of agents to provision. By adding a thread pool, at least they get created concurrently.
https://github.com/jonathannewman/kubernetes-plugin/commit/33c2022bfa8c539726cbda021dc007072efee92a