Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50405

runATH leads to deadlock of resource consumption for core PR builds

    XMLWordPrintable

    Details

    • Similar Issues:
    • Sprint:
      Evergreen - Milestone 1

      Description

      This weekend we experienced a denial-of-service on ci.jenkins.io due to this resource contention caused by the runATH step in the core Jenkinsfile.

      Basically, an executor on the "linux" label was occupied while blocking and waiting for an executor on "docker&&highmem". When Jenkins couldn't provision "highmem" due to capacity issues, the runATH step blocks the "linux" executor indefinitely.

      At the bottom of the Jenkinsfile for core, is some code along these lines:

      node('linux') {
        /* some setup */
        runAth()
      }
      

      In runATH(), the first ensureInNode statement ensure that the Pipeline only uses on node, since the execution is already in a "linux" NODE_LABEL.

      When the second ensureInNode executes, it's attempting to ensure that the execution is in docker&&highmem, which it is of course not. This causes Pipeline to block waiting for this node, while occupying the outer "linux" node declaration.

      This is kind of a big problem and will cause additional resource contention whenever more than one or two core PRs are merged in quick succession.

        Attachments

          Issue Links

            Activity

            Hide
            abayer Andrew Bayer added a comment -

            So it appears that ensureInNode in runATH.groovy doesn't handle label expressions (i.e., docker && highmem), just label atoms (i.e., highmem), since it's looking for the literal string docker && highmem in the NODE_LABELS environment variable...which is just a space-delimited list of the individual label atoms on the node. So, e.g., if the only two labels on the node are docker and highmem, then NODE_LABELS is docker highmem. Which obviously doesn't contain docker && highmem.

            This doesn't create a deadlock per se, but it does double up the executor usage per run, with one nested within another. I'm not sure what the solution is, exactly - in this particular case, it's actually pretty simple - just switch to highmem alone, since that'll get you the same thing as docker && highmem, but some smarter logic for determining what node you're on and what node you want to be on would be handy. However, that probably involves diving into the core label logic to do parsing/comparing/etc, and that is not shared library material.

            Show
            abayer Andrew Bayer added a comment - So it appears that ensureInNode in runATH.groovy doesn't handle label expressions (i.e., docker && highmem ), just label atoms (i.e., highmem ), since it's looking for the literal string docker && highmem in the NODE_LABELS environment variable...which is just a space-delimited list of the individual label atoms on the node. So, e.g., if the only two labels on the node are docker and highmem , then NODE_LABELS is docker highmem . Which obviously doesn't contain docker && highmem . This doesn't create a deadlock per se, but it does double up the executor usage per run, with one nested within another. I'm not sure what the solution is, exactly - in this particular case, it's actually pretty simple - just switch to highmem alone, since that'll get you the same thing as docker && highmem , but some smarter logic for determining what node you're on and what node you want to be on would be handy. However, that probably involves diving into the core label logic to do parsing/comparing/etc, and that is not shared library material.
            Hide
            rarabaolaza Raul Arabaolaza added a comment -

            In the meanwhile, I can just make the ensureInNode accept a list of labels and check that all are present in the current node

            Andrew Bayer Do you believe an ensureInNode step able to deal with label expressions could be an interesting addition for workflow.durable-task-step plugin?

            Show
            rarabaolaza Raul Arabaolaza added a comment - In the meanwhile, I can just make the ensureInNode accept a list of labels and check that all are present in the current node Andrew Bayer Do you believe an ensureInNode step able to deal with label expressions could be an interesting addition for  workflow.durable-task-step plugin?
            Hide
            rarabaolaza Raul Arabaolaza added a comment -

            PR#34 should fix the issue, basically what I have done is change ensureInNode so it takes a comma-separated list of labels and checks that all of those labels are present individually in the current node, if not it allocates a new node with the labels joined by "&&".

            Caveats, is still unable to deal with complex labels, but that functionality is not needed to run on the current infra as you can simply enclose the entire runATH call in a node("docker&&highmem") as the Jenkinsfile for core does, whith the changes in #34 no node allocation will be done at all and no node will be blocked waiting for another one 

            Show
            rarabaolaza Raul Arabaolaza added a comment - PR#34  should fix the issue, basically what I have done is change  ensureInNode  so it takes a comma-separated list of labels and checks that all of those labels are present individually in the current node, if not it allocates a new node with the labels joined by "&&". Caveats, is still unable to deal with complex labels, but that functionality is not needed to run on the current infra as you can simply enclose the entire runATH  call in a node("docker&&highmem") as the Jenkinsfile for core does, whith the changes in #34 no node allocation will be done at all and no node will be blocked waiting for another one 
            Hide
            rarabaolaza Raul Arabaolaza added a comment -

            PR is merged and ATH is working again, see  here Also according to logs there are only three nodes used (as expected). One for linux, another for windows and the last one for ath, so it seems the contention issue is also fixed.

             

            I am going to keep this in review two or three days just in case and close if no problems arise

            Show
            rarabaolaza Raul Arabaolaza added a comment - PR is merged and ATH is working again, see  here  Also according to logs there are only three nodes used (as expected). One for linux, another for windows and the last one for ath, so it seems the contention issue is also fixed.   I am going to keep this in review two or three days just in case and close if no problems arise
            Hide
            rtyler R. Tyler Croy added a comment -

            I believe this is safe to close up now

            Show
            rtyler R. Tyler Croy added a comment - I believe this is safe to close up now

              People

              Assignee:
              rarabaolaza Raul Arabaolaza
              Reporter:
              rtyler R. Tyler Croy
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: