-
Bug
-
Resolution: Unresolved
-
Minor
-
None
When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:
- Node is created
- Launcher tries to launch the pod and fail with
- Node is removed
If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.
Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.
Evidence
In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
Typically you'd see many threads removing nodes but waiting on the queue lock:
at hudson.model.Queue._withLock(Queue.java:1408) at hudson.model.Queue.withLock(Queue.java:1284) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711) at jenkins.model.Nodes.removeNode(Nodes.java:297) at jenkins.model.Jenkins.removeNode(Jenkins.java:2277) at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:
"Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
[...]
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.ResourceController.execute(ResourceController.java:104)
at hudson.model.Executor.run(Executor.java:443)
or:
"Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
[...]
at hudson.model.Queue._withLock(Queue.java:1469)
at hudson.model.Queue.withLock(Queue.java:1327)
at hudson.model.Executor.run(Executor.java:353)
Workaround
A workaround is to reflect the limit on the Kubernetes Cloud configuration.