runATH leads to deadlock of resource consumption for core PR builds

This issue is archived. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Evergreen - Milestone 1

      This weekend we experienced a denial-of-service on ci.jenkins.io due to this resource contention caused by the runATH step in the core Jenkinsfile.

      Basically, an executor on the "linux" label was occupied while blocking and waiting for an executor on "docker&&highmem". When Jenkins couldn't provision "highmem" due to capacity issues, the runATH step blocks the "linux" executor indefinitely.

      At the bottom of the Jenkinsfile for core, is some code along these lines:

      node('linux') {
        /* some setup */
        runAth()
      }
      

      In runATH(), the first ensureInNode statement ensure that the Pipeline only uses on node, since the execution is already in a "linux" NODE_LABEL.

      When the second ensureInNode executes, it's attempting to ensure that the execution is in docker&&highmem, which it is of course not. This causes Pipeline to block waiting for this node, while occupying the outer "linux" node declaration.

      This is kind of a big problem and will cause additional resource contention whenever more than one or two core PRs are merged in quick succession.

            Assignee:
            Raul Arabaolaza
            Reporter:
            R. Tyler Croy
            Archiver:
            Jenkins Service Account

              Created:
              Updated:
              Resolved:
              Archived: