Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-73293

Excessive Node creation/deletion when hitting Resource Quotas

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • kubernetes-plugin
    • None

      When hitting Kubernetes resource quotas limit (for example a pod limit), Jenkins nodes are created and then removed over and over after each queue cycle:

      • Node is created
      • Launcher tries to launch the pod and fail with
      • Node is removed

      If the queue has a lot of items, this can slows down the queue maintenance thread and the start of build executions considerably. As each node operation requires a queue lock.

      Kubernetes Plugin should maybe better adapt to the kubernetes limits to avoid this behavior.

      Evidence

      In case of a resource quota with pod limit, the following exception would happen at every pod creation failure:

      io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: <KUBERNETES_URL>/api/v1/namespaces/<NAMESPACE>/pods. Message: pods "<AGENTS_NAME>" is forbidden: exceeded quota: pod-limit, requested: pods=1, used: pods=300, limited: pods=300. 
      

      Typically you'd see many threads removing nodes but waiting on the queue lock:

      	at hudson.model.Queue._withLock(Queue.java:1408)
      	at hudson.model.Queue.withLock(Queue.java:1284)
      	at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
      	at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1711)
      	at jenkins.model.Nodes.removeNode(Nodes.java:297)
      	at jenkins.model.Jenkins.removeNode(Jenkins.java:2277)
      	at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
      	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:285)
      

      And dependeing on the load (queue size and number of nodes), executors that try to execute queued tasks are also stuck on the queue lock:

      "Executor #0 for <agentName> : executing <jobFullName> #<buildNumber>" ... waiting on condition  [0x00007efd152c3000]
          [...]
      	at hudson.model.Queue._withLock(Queue.java:1408)
      	at hudson.model.ResourceController.execute(ResourceController.java:104)
      	at hudson.model.Executor.run(Executor.java:443)
      

      or:

      "Executor #0 for <otherAgentName>" .... waiting on condition  [0x00007efcd4201000]
         [...]
      	at hudson.model.Queue._withLock(Queue.java:1469)
      	at hudson.model.Queue.withLock(Queue.java:1327)
      	at hudson.model.Executor.run(Executor.java:353)
      

      Workaround

      A workaround is to reflect the limit on the Kubernetes Cloud configuration.

            Unassigned Unassigned
            allan_burdajewicz Allan BURDAJEWICZ
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: