[JENKINS-73293] Excessive Node creation/deletion when hitting Resource Quotas

Allan BURDAJEWICZ created issue - 2024-06-10 23:25

Allan BURDAJEWICZ made changes - 2024-06-10 23:25

Description

Original: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

* Node is created
* Launcher tries to launch the pod and fail with
* Node is removed

If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

****

h3. Evidence

In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

{code}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
{code}

Typically you'd see many threads removing nodes but waiting on the queue lock:

{code}
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.Queue.withLock(Queue.java:1284)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
at jenkins.model.Nodes.removeNode(Nodes.java:297)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
{code}

And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

{code}
"Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
[...]
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.ResourceController.execute(ResourceController.java:104)
at hudson.model.Executor.run(Executor.java:443)
{code}

or:

{code}
"Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
[...]
at hudson.model.Queue._withLock(Queue.java:1469)
at hudson.model.Queue.withLock(Queue.java:1327)
at hudson.model.Executor.run(Executor.java:353)
{code}

New: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

* Node is created
* Launcher tries to launch the pod and fail with
* Node is removed

If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

h3. Evidence

In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

{code}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
{code}

Typically you'd see many threads removing nodes but waiting on the queue lock:

{code}
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.Queue.withLock(Queue.java:1284)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
at jenkins.model.Nodes.removeNode(Nodes.java:297)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
{code}

And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

{code}
"Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
[...]
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.ResourceController.execute(ResourceController.java:104)
at hudson.model.Executor.run(Executor.java:443)
{code}

or:

{code}
"Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
[...]
at hudson.model.Queue._withLock(Queue.java:1469)
at hudson.model.Queue.withLock(Queue.java:1327)
at hudson.model.Executor.run(Executor.java:353)
{code}

Allan BURDAJEWICZ made changes - 2024-06-10 23:25

Issue Type

Original: Improvement [ 4 ]

New: Bug [ 1 ]

Allan BURDAJEWICZ made changes - 2024-06-10 23:53

Description

Original: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

* Node is created
* Launcher tries to launch the pod and fail with
* Node is removed

If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

h3. Evidence

In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

{code}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
{code}

Typically you'd see many threads removing nodes but waiting on the queue lock:

{code}
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.Queue.withLock(Queue.java:1284)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
at jenkins.model.Nodes.removeNode(Nodes.java:297)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
{code}

And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

{code}
"Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
[...]
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.ResourceController.execute(ResourceController.java:104)
at hudson.model.Executor.run(Executor.java:443)
{code}

or:

{code}
"Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
[...]
at hudson.model.Queue._withLock(Queue.java:1469)
at hudson.model.Queue.withLock(Queue.java:1327)
at hudson.model.Executor.run(Executor.java:353)
{code}

New: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

* Node is created
* Launcher tries to launch the pod and fail with
* Node is removed

If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

h3. Evidence

In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

{code}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
{code}

Typically you'd see many threads removing nodes but waiting on the queue lock:

{code}
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.Queue.withLock(Queue.java:1284)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
at jenkins.model.Nodes.removeNode(Nodes.java:297)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
{code}

And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

{code}
"Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
[...]
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.ResourceController.execute(ResourceController.java:104)
at hudson.model.Executor.run(Executor.java:443)
{code}

or:

{code}
"Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
[...]
at hudson.model.Queue._withLock(Queue.java:1469)
at hudson.model.Queue.withLock(Queue.java:1327)
at hudson.model.Executor.run(Executor.java:353)
{code}

.h3 Workaround

A workaround is to reflect the limit on the Kubernetes Cloud configuration.

Allan BURDAJEWICZ made changes - 2024-06-10 23:53

Description

Original: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

* Node is created
* Launcher tries to launch the pod and fail with
* Node is removed

If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

h3. Evidence

In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

{code}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
{code}

Typically you'd see many threads removing nodes but waiting on the queue lock:

{code}
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.Queue.withLock(Queue.java:1284)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
at jenkins.model.Nodes.removeNode(Nodes.java:297)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
{code}

And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

{code}
"Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
[...]
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.ResourceController.execute(ResourceController.java:104)
at hudson.model.Executor.run(Executor.java:443)
{code}

or:

{code}
"Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
[...]
at hudson.model.Queue._withLock(Queue.java:1469)
at hudson.model.Queue.withLock(Queue.java:1327)
at hudson.model.Executor.run(Executor.java:353)
{code}

.h3 Workaround

A workaround is to reflect the limit on the Kubernetes Cloud configuration.

New: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

* Node is created
* Launcher tries to launch the pod and fail with
* Node is removed

If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

h3. Evidence

In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

{code}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
{code}

Typically you'd see many threads removing nodes but waiting on the queue lock:

{code}
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.Queue.withLock(Queue.java:1284)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
at jenkins.model.Nodes.removeNode(Nodes.java:297)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
{code}

And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

{code}
"Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
[...]
at hudson.model.Queue._withLock(Queue.java:1408)
at hudson.model.ResourceController.execute(ResourceController.java:104)
at hudson.model.Executor.run(Executor.java:443)
{code}

or:

{code}
"Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
[...]
at hudson.model.Queue._withLock(Queue.java:1469)
at hudson.model.Queue.withLock(Queue.java:1327)
at hudson.model.Executor.run(Executor.java:353)
{code}

h3. Workaround

A workaround is to reflect the limit on the Kubernetes Cloud configuration.

Allan BURDAJEWICZ made changes - 2024-12-17 07:16

Remote Link

New: This issue links to "CloudBees Internal Issue (Web Link)" [ 30423 ]

Jenkins

Details

Description

Evidence

Workaround

Attachments

Issue Links

Activity

People

Dates