Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-41127

Multiple pipeline instances running concurrently when concurrent execution disabled

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core
    • Jenkins 2.7.1
      Jenkins 2.49
      Pipeline plugin 2.4
    • Jenkins 2.136

      I have configured a Jenkins pipeline to disable concurrent builds:

      properties([
          disableConcurrentBuilds()
      ])
      

      However, I have noticed on some occasions the next 2 builds are pulled from the pipeline's queue and executed concurrently. Why this occurs is not obvious at all.

          [JENKINS-41127] Multiple pipeline instances running concurrently when concurrent execution disabled

          Andrew Bayer added a comment -

          So Queue#maintain() is running twice, one immediately after the other, in some cases - probably race conditiony, not yet sure how the two are getting called. Anyway, the first run is making the first item in the queue buildable and calls makeBuildable on the item, removing said item from blockedProjects, and, via makeFlyweightTaskBuildable and createFlyWeightTaskRunnable, starting the flyweight task and adding the first item to pendings. All is well and good. But then the next run of maintain starts - and it can't find the task for the item we just started (theoretically) and put in pendings on any executor...so it removes the item from pendings. Then it gets to checking the queue again, and the new first item doesn't have anything blocking it (i.e., nothing in buildables or pending) and so...it goes through the same process as the previous item did in the previous maintain run. End result: two builds get started at the same time.

          So - definitely a race condition.

          Andrew Bayer added a comment - So Queue#maintain() is running twice, one immediately after the other, in some cases - probably race conditiony, not yet sure how the two are getting called. Anyway, the first run is making the first item in the queue buildable and calls makeBuildable on the item, removing said item from blockedProjects , and, via makeFlyweightTaskBuildable and createFlyWeightTaskRunnable , starting the flyweight task and adding the first item to pendings . All is well and good. But then the next run of maintain starts - and it can't find the task for the item we just started (theoretically) and put in pendings on any executor...so it removes the item from pendings . Then it gets to checking the queue again, and the new first item doesn't have anything blocking it (i.e., nothing in buildables or pending ) and so...it goes through the same process as the previous item did in the previous maintain run. End result: two builds get started at the same time. So - definitely a race condition.

          Andrew Bayer added a comment -

          fwiw, I think this likely only will happen with a flyweight task - so you could probably brew up a reproduction case with a matrix job, but I doubt you could do so with a freestyle job.

          Andrew Bayer added a comment - fwiw, I think this likely only will happen with a flyweight task - so you could probably brew up a reproduction case with a matrix job, but I doubt you could do so with a freestyle job.

          Sam Van Oort added a comment -

          abayer Could we recategorize to core on the basis of your analysis?

          Sam Van Oort added a comment - abayer Could we recategorize to core on the basis of your analysis?

          Ryan Campbell added a comment -

          Noting the relationship to JENKINS-30231

          Ryan Campbell added a comment - Noting the relationship to JENKINS-30231

          Devin Nusbaum added a comment -

          In my reproductions, the call to Queue#maintain that kicks off the second job concurrently has the following abbreviated state in its initial snapshot:

          Queue.Snapshot { 
              waitingList=[...], 
              blocked=[pipeline #2, ...],
              buildables=[],
              pendings=[pipeline #1]
          }
          

          Interestingly, this is the only call to Queue#maintain out of ~250 builds where pendings is not an empty list.

          Inside of Queue#maintain, pipeline #1 (which is pending) gets removed from pendings, and because the result of makeBuildable on the next line is ignored, pipeline #1 is no longer part of the queue at all, and so nothing is blocking pipeline #2 from being built later on in Queue#maintain.

          I'm not exactly sure why pipeline #1 is removed from the pendings list. Maybe the lostPendings logic is messed up for flyweight tasks? For now I am looking at that logic to see if anything looks wrong. If it looks fine, then I'll try to understand why pipeline #1 is in pendings (maybe the flyweight task is half-started and gets blocked waiting for the Queue lock or something?) .

          Devin Nusbaum added a comment - In my reproductions, the call to Queue#maintain that kicks off the second job concurrently has the following abbreviated state in its initial snapshot: Queue.Snapshot { waitingList=[...], blocked=[pipeline #2, ...], buildables=[], pendings=[pipeline #1] } Interestingly, this is the only call to Queue#maintain out of ~250 builds where pendings is not an empty list. Inside of Queue#maintain , pipeline #1 (which is pending) gets removed from pendings , and because the result of makeBuildable on the next line is ignored, pipeline #1 is no longer part of the queue at all, and so nothing is blocking pipeline #2 from being built later on in Queue#maintain . I'm not exactly sure why pipeline #1 is removed from the pendings list. Maybe the lostPendings logic is messed up for flyweight tasks? For now I am looking at that logic to see if anything looks wrong. If it looks fine, then I'll try to understand why pipeline #1 is in pendings  (maybe the flyweight task is half-started and gets blocked waiting for the Queue lock or something?) .

          Devin Nusbaum added a comment - - edited

          Ok, I think the issue with lostPendings and flyweight tasks is that we loop through executors but not oneOffExecutors, which is where flyweight tasks are executed.

          I will test out looping through both tomorrow to see if that fixes it.

          Devin Nusbaum added a comment - - edited Ok, I think the issue with lostPendings and flyweight tasks is that we loop through executors but not oneOffExecutors, which is where flyweight tasks are executed . I will test out looping through both tomorrow to see if that fixes it.

          Sam Van Oort added a comment -

          If that fixes it, it will probably be a very welcome change.

          Sam Van Oort added a comment - If that fixes it, it will probably be a very welcome change.

          Devin Nusbaum added a comment -

          PR is up: https://github.com/jenkinsci/jenkins/pull/3562. Still looking into creating a regression test for it. I verified the change by running the same reproduction case as Alex/Andrew. Previously, a concurrent build would occur after ~250-750 builds, but after my fix I was able to run 3200 builds without any of them running concurrently.

          Devin Nusbaum added a comment - PR is up: https://github.com/jenkinsci/jenkins/pull/3562 . Still looking into creating a regression test for it. I verified the change by running the same reproduction case as Alex/Andrew. Previously, a concurrent build would occur after ~250-750 builds, but after my fix I was able to run 3200 builds without any of them running concurrently.

          Devin Nusbaum added a comment -

          My best guess as to why this happens so infrequently is that normally after a call to Queue#maintain, the executor owning the flyweight task is the next thread that acquires the Queue's lock (in Executor#run), so Queue.pendings is cleared before the next call to Queue#maintain, but in the problematic case 2 calls to {Queue#maintain happen consecutively without Executor#run being executed yet, so the task is still in Queue.pendings in the second call to {Queue#maintain.

          I wonder if using a fair ordering policy for the Queue's lock would make this less likely, or if the Executor's run method isn't even waiting on the lock yet in the problematic case.

          Devin Nusbaum added a comment - My best guess as to why this happens so infrequently is that normally after a call to Queue#maintain , the executor owning the flyweight task is the next thread that acquires the Queue's lock (in Executor#run ), so Queue.pendings is cleared before the next call to Queue#maintain , but in the problematic case 2 calls to { Queue#maintain happen consecutively without Executor#run being executed yet, so the task is still in Queue.pendings in the second call to { Queue#maintain . I wonder if using a fair ordering policy for the Queue's lock would make this less likely, or if the Executor's run method isn't even waiting on the lock yet in the problematic case.

          Devin Nusbaum added a comment -

          Fixed in Jenkins 2.136. I am marking this as an LTS candidate given the impact and simplicity of the fix, although we will have to give it some time to make sure there are no regressions.

          Devin Nusbaum added a comment - Fixed in Jenkins 2.136 . I am marking this as an LTS candidate given the impact and simplicity of the fix, although we will have to give it some time to make sure there are no regressions.

            dnusbaum Devin Nusbaum
            boon Joe Harte
            Votes:
            3 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: