Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73293

Excessive Node creation/deletion when hitting Resource Quotas

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • kubernetes-plugin
    • None

      When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

      • Node is created
      • Launcher tries to launch the pod and fail with
      • Node is removed

      If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

      Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

      Evidence

      In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

      io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300. 
      

      Typically you'd see many threads removing nodes but waiting on the queue lock:

      	at hudson.model.Queue._withLock(Queue.java:1408)
      	at hudson.model.Queue.withLock(Queue.java:1284)
      	at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
      	at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
      	at jenkins.model.Nodes.removeNode(Nodes.java:297)
      	at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
      	at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
      

      And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

      "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition  [0x00007efd152c3000]
          [...]
      	at hudson.model.Queue._withLock(Queue.java:1408)
      	at hudson.model.ResourceController.execute(ResourceController.java:104)
      	at hudson.model.Executor.run(Executor.java:443)
      

      or:

      "Executor #0 for <otherAgentName>" .... waiting on condition  [0x00007efcd4201000]
         [...]
      	at hudson.model.Queue._withLock(Queue.java:1469)
      	at hudson.model.Queue.withLock(Queue.java:1327)
      	at hudson.model.Executor.run(Executor.java:353)
      

      Workaround

      A workaround is to reflect the limit on the Kubernetes Cloud configuration.

          [JENKINS-73293] Excessive Node creation/deletion when hitting Resource Quotas

          Allan BURDAJEWICZ created issue -
          Allan BURDAJEWICZ made changes -
          Description Original: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

          * Node is created
          * Launcher tries to launch the pod and fail with
          * Node is removed

          If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

          Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

          ****

          h3. Evidence

          In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

          {code}
          io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
          {code}

          Typically you'd see many threads removing nodes but waiting on the queue lock:

          {code}
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.Queue.withLock(Queue.java:1284)
          at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
          at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
          at jenkins.model.Nodes.removeNode(Nodes.java:297)
          at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
          at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
          at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
          {code}

          And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

          {code}
          "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
              [...]
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.ResourceController.execute(ResourceController.java:104)
          at hudson.model.Executor.run(Executor.java:443)
          {code}

          or:

          {code}
          "Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
             [...]
          at hudson.model.Queue._withLock(Queue.java:1469)
          at hudson.model.Queue.withLock(Queue.java:1327)
          at hudson.model.Executor.run(Executor.java:353)
          {code}
          New: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

          * Node is created
          * Launcher tries to launch the pod and fail with
          * Node is removed

          If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

          Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

          h3. Evidence

          In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

          {code}
          io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
          {code}

          Typically you'd see many threads removing nodes but waiting on the queue lock:

          {code}
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.Queue.withLock(Queue.java:1284)
          at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
          at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
          at jenkins.model.Nodes.removeNode(Nodes.java:297)
          at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
          at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
          at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
          {code}

          And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

          {code}
          "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
              [...]
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.ResourceController.execute(ResourceController.java:104)
          at hudson.model.Executor.run(Executor.java:443)
          {code}

          or:

          {code}
          "Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
             [...]
          at hudson.model.Queue._withLock(Queue.java:1469)
          at hudson.model.Queue.withLock(Queue.java:1327)
          at hudson.model.Executor.run(Executor.java:353)
          {code}
          Allan BURDAJEWICZ made changes -
          Issue Type Original: Improvement [ 4 ] New: Bug [ 1 ]
          Allan BURDAJEWICZ made changes -
          Description Original: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

          * Node is created
          * Launcher tries to launch the pod and fail with
          * Node is removed

          If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

          Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

          h3. Evidence

          In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

          {code}
          io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
          {code}

          Typically you'd see many threads removing nodes but waiting on the queue lock:

          {code}
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.Queue.withLock(Queue.java:1284)
          at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
          at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
          at jenkins.model.Nodes.removeNode(Nodes.java:297)
          at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
          at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
          at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
          {code}

          And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

          {code}
          "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
              [...]
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.ResourceController.execute(ResourceController.java:104)
          at hudson.model.Executor.run(Executor.java:443)
          {code}

          or:

          {code}
          "Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
             [...]
          at hudson.model.Queue._withLock(Queue.java:1469)
          at hudson.model.Queue.withLock(Queue.java:1327)
          at hudson.model.Executor.run(Executor.java:353)
          {code}
          New: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

          * Node is created
          * Launcher tries to launch the pod and fail with
          * Node is removed

          If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

          Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

          h3. Evidence

          In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

          {code}
          io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
          {code}

          Typically you'd see many threads removing nodes but waiting on the queue lock:

          {code}
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.Queue.withLock(Queue.java:1284)
          at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
          at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
          at jenkins.model.Nodes.removeNode(Nodes.java:297)
          at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
          at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
          at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
          {code}

          And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

          {code}
          "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
              [...]
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.ResourceController.execute(ResourceController.java:104)
          at hudson.model.Executor.run(Executor.java:443)
          {code}

          or:

          {code}
          "Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
             [...]
          at hudson.model.Queue._withLock(Queue.java:1469)
          at hudson.model.Queue.withLock(Queue.java:1327)
          at hudson.model.Executor.run(Executor.java:353)
          {code}

          .h3 Workaround

          A workaround is to reflect the limit on the Kubernetes Cloud configuration.
          Allan BURDAJEWICZ made changes -
          Description Original: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

          * Node is created
          * Launcher tries to launch the pod and fail with
          * Node is removed

          If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

          Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

          h3. Evidence

          In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

          {code}
          io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
          {code}

          Typically you'd see many threads removing nodes but waiting on the queue lock:

          {code}
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.Queue.withLock(Queue.java:1284)
          at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
          at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
          at jenkins.model.Nodes.removeNode(Nodes.java:297)
          at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
          at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
          at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
          {code}

          And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

          {code}
          "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
              [...]
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.ResourceController.execute(ResourceController.java:104)
          at hudson.model.Executor.run(Executor.java:443)
          {code}

          or:

          {code}
          "Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
             [...]
          at hudson.model.Queue._withLock(Queue.java:1469)
          at hudson.model.Queue.withLock(Queue.java:1327)
          at hudson.model.Executor.run(Executor.java:353)
          {code}

          .h3 Workaround

          A workaround is to reflect the limit on the Kubernetes Cloud configuration.
          New: When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

          * Node is created
          * Launcher tries to launch the pod and fail with
          * Node is removed

          If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

          Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

          h3. Evidence

          In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

          {code}
          io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300.
          {code}

          Typically you'd see many threads removing nodes but waiting on the queue lock:

          {code}
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.Queue.withLock(Queue.java:1284)
          at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
          at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
          at jenkins.model.Nodes.removeNode(Nodes.java:297)
          at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
          at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
          at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
          {code}

          And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

          {code}
          "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition [0x00007efd152c3000]
              [...]
          at hudson.model.Queue._withLock(Queue.java:1408)
          at hudson.model.ResourceController.execute(ResourceController.java:104)
          at hudson.model.Executor.run(Executor.java:443)
          {code}

          or:

          {code}
          "Executor #0 for <otherAgentName>" .... waiting on condition [0x00007efcd4201000]
             [...]
          at hudson.model.Queue._withLock(Queue.java:1469)
          at hudson.model.Queue.withLock(Queue.java:1327)
          at hudson.model.Executor.run(Executor.java:353)
          {code}

          h3. Workaround

          A workaround is to reflect the limit on the Kubernetes Cloud configuration.
          Allan BURDAJEWICZ made changes -
          Remote Link New: This issue links to "CloudBees Internal Issue (Web Link)" [ 30423 ]

            Unassigned Unassigned
            allan_burdajewicz Allan BURDAJEWICZ
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: