Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-41127

Multiple pipeline instances running concurrently when concurrent execution disabled

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Critical
    • Resolution: Fixed
    • core
    • Jenkins 2.7.1
      Jenkins 2.49
      Pipeline plugin 2.4
    • Jenkins 2.136

    Description

      I have configured a Jenkins pipeline to disable concurrent builds:

      properties([
          disableConcurrentBuilds()
      ])
      

      However, I have noticed on some occasions the next 2 builds are pulled from the pipeline's queue and executed concurrently. Why this occurs is not obvious at all.

      Attachments

        Issue Links

          Activity

            boon Joe Harte added a comment -

            Bump. Seeing this again. My pipeline is configured to disable concurrent builds, yet I saw two instances running when they were both triggered at exactly the same time, down to the second. Jenkins version now is 2.49.

            boon Joe Harte added a comment - Bump. Seeing this again. My pipeline is configured to disable concurrent builds, yet I saw two instances running when they were both triggered at exactly the same time, down to the second. Jenkins version now is 2.49.
            boon Joe Harte added a comment - - edited

            Saw this yet again. Haven't seen the problem in hundreds of builds since my last comment above, and today I see that Jenkins pulled the next 2 builds from the queue and ran them concurrently, event though concurrent building in explicitly disabled.

             

            Using latest version of Pipeline plugins available at time of writing, and Jenkins version 2.49

             

            jglick FYI

            boon Joe Harte added a comment - - edited Saw this yet again. Haven't seen the problem in hundreds of builds since my last comment above, and today I see that Jenkins pulled the next 2 builds from the queue and ran them concurrently, event though concurrent building in explicitly disabled.   Using latest version of Pipeline plugins available at time of writing, and Jenkins version 2.49   jglick FYI
            jglick Jesse Glick added a comment -

            Without a way to reproduce there is nothing to go on I am afraid.

            jglick Jesse Glick added a comment - Without a way to reproduce there is nothing to go on I am afraid.

            Seen this yet again more frequently today. Not quite sure how to reproduce this.

            nlassai LAKSHMI ANANTHA NALLAMOTHU added a comment - Seen this yet again more frequently today. Not quite sure how to reproduce this.
            boon Joe Harte added a comment -

            jglick Is there anything I can do to help diagnose the problem? This is becoming a serious issue for us, as it is critical we only push one build at a time through certain pipelines.

            The pipeline works fine 90% of the time, and then when a build complete Jenkins will (seemingly randomly) pull the next 2 builds from the queue at once and start executing both concurrently, which totally messes up our pipeline environment.

            boon Joe Harte added a comment - jglick Is there anything I can do to help diagnose the problem? This is becoming a serious issue for us, as it is critical we only push one build at a time through certain pipelines. The pipeline works fine 90% of the time, and then when a build complete Jenkins will (seemingly randomly) pull the next 2 builds from the queue at once and start executing both concurrently, which totally messes up our pipeline environment.
            jglick Jesse Glick added a comment -

            Is there anything I can do to help diagnose the problem?

            I guess set breakpoints in, or add logging to, WorkflowJob.isConcurrentBuild or Queue.allowNewBuildableTask.

            Workaround be to use the lock step instead of job-level granularity.

            jglick Jesse Glick added a comment - Is there anything I can do to help diagnose the problem? I guess set breakpoints in, or add logging to, WorkflowJob.isConcurrentBuild or Queue.allowNewBuildableTask . Workaround be to use the lock step instead of job-level granularity.

            This has just happened to one of my builds.  I'll try adding the logging you suggested and look into using lock.

            cjbosnic Cameron Bosnic added a comment - This has just happened to one of my builds.  I'll try adding the logging you suggested and look into using lock.
            mstave mstave added a comment -

            There seems to be some sort of race condition, as we're seeing this intermittently.  It may be that it's only happening when there are two queued instances of a job with different parameters.

            mstave mstave added a comment - There seems to be some sort of race condition, as we're seeing this intermittently.  It may be that it's only happening when there are two queued instances of a job with different parameters.
            ataylor Alex Taylor added a comment -

            jglick I have a consistent way to reproduce this if it helps:

            1. Create a Freestyle job(just to catch when the error happens) called "FreestyleTest" with string parameters for "CurrentBuild", "PreviousBuild", and "QueueStatus"
            2. Create a Pipeline job with the "disable concurrent builds" turned on and a string parameter called "TestString" with the following script: 
              lock('TestResource') {
              def item = Jenkins.instance.getItem("PipelineTest")
              if(item.getLastSuccessfulBuild().number == (currentBuild.number.toInteger()-1))//finding previous number
              {
              def number = params.TestString.toInteger()+1
              node () {
              stage ('Build') {
              sleep 2
              build job: 'PipelineTest', parameters: [string(name: 'TestString', value: "${number}")], wait: false
              }
              stage ('Again'){
              sleep 2
              number = number+number
              build job: 'PipelineTest', parameters: [string(name: 'TestString', value: "${number}")], wait: false
              }
              
              }
              }
              else
              {
              currentBuild.result == "SUCCESS" 
              def RunningBuildsString = ""
              Jenkins.instance.getAllItems(Job).each{
              def jobBuilds=it.getBuilds()
              jobBuilds.each{
              if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " ") }
              }
              }
              build job: 'FreestyleTest', parameters: [string(name: 'PreviousBuild', value: "${item.getLastSuccessfulBuild().number}"), string(name: 'CurrentBuild', value: "${currentBuild.number.toInteger()}"), string(name: 'QueueStatus', value: "${RunningBuildsString}")]
              
              }
              }
            1. You will have to run this a couple of times since there is some approval you will have to do(the groovy scripting I am doing is not recommended but needed to check the queue status and the getLastSuccessfulBuild from the filesystem(I wanted to see if it was just not updating the filesystem in time)
            2. Once it is ready to run you will get one "success" and then you will have to start it one more time where it will trigger infinite downstream builds. You just need to wait for the next build of FreestyleTest which will show you when the previous build was not one previous which will then show the queue status. This process takes around 800 builds for me locally but does not use very much memory and resources which is nice

            I am still testing if this same issue can happen with freestyle builds. Additionally the lock is not needed but you will see that the lock does not seem to matter either. I can also enable throttle concurrent builds to limit the number of builds per minute and it will also reproduce.

            ataylor Alex Taylor added a comment - jglick I have a consistent way to reproduce this if it helps: Create a Freestyle job(just to catch when the error happens) called "FreestyleTest" with string parameters for "CurrentBuild", "PreviousBuild", and "QueueStatus" Create a Pipeline job with the "disable concurrent builds" turned on and a string parameter called "TestString" with the following script:  lock( 'TestResource' ) { def item = Jenkins.instance.getItem( "PipelineTest" ) if (item.getLastSuccessfulBuild().number == (currentBuild.number.toInteger()-1)) //finding previous number { def number = params.TestString.toInteger()+1 node () { stage ( 'Build' ) { sleep 2 build job: 'PipelineTest' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } stage ( 'Again' ){ sleep 2 number = number+number build job: 'PipelineTest' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } } } else { currentBuild.result == "SUCCESS" def RunningBuildsString = "" Jenkins.instance.getAllItems(Job).each{ def jobBuilds=it.getBuilds() jobBuilds.each{ if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " " ) } } } build job: 'FreestyleTest' , parameters: [string(name: 'PreviousBuild' , value: "${item.getLastSuccessfulBuild().number}" ), string(name: 'CurrentBuild' , value: "${currentBuild.number.toInteger()}" ), string(name: 'QueueStatus' , value: "${RunningBuildsString}" )] } } You will have to run this a couple of times since there is some approval you will have to do(the groovy scripting I am doing is not recommended but needed to check the queue status and the getLastSuccessfulBuild from the filesystem(I wanted to see if it was just not updating the filesystem in time) Once it is ready to run you will get one "success" and then you will have to start it one more time where it will trigger infinite downstream builds. You just need to wait for the next build of FreestyleTest which will show you when the previous build was not one previous which will then show the queue status. This process takes around 800 builds for me locally but does not use very much memory and resources which is nice I am still testing if this same issue can happen with freestyle builds. Additionally the lock is not needed but you will see that the lock does not seem to matter either. I can also enable throttle concurrent builds to limit the number of builds per minute and it will also reproduce.
            abayer Andrew Bayer added a comment -

            I'm running a tweaked version of that now to see what happens - had to make some changes due to serialization.

            @NonCPS
            def getLastNum() {
                def item = Jenkins.instance.getItemByFullName("bug-reproduction/jenkins-41127")
                echo "${item}"
                return item.getLastSuccessfulBuild().number
            }
            
            def lastNum = getLastNum()
            if(lastNum == (currentBuild.number.toInteger()-1)) {//finding previous number 
                def number = params.TestString.toInteger()+1
                node () {
                    stage ('Build') {
                        sleep 2
                        build job: 'jenkins-41127', parameters: [string(name: 'TestString', value: "${number}")], wait: false
                    }
                    stage ('Again'){
                        sleep 2
                        number = number+number
                        build job: 'jenkins-41127', parameters: [string(name: 'TestString', value: "${number}")], wait: false
                    }
                }
            }
            else {
                currentBuild.result == "SUCCESS" 
                def RunningBuildsString = getRunStr()
                build job: 'jenkins-41127-fs', parameters: [string(name: 'PreviousBuild', value: "${lastNum}"), string(name: 'CurrentBuild', value: "${currentBuild.number.toInteger()}"), string(name: 'QueueStatus', value: "${RunningBuildsString}")]
            }
            
            @NonCPS
            def getRunStr() {
                def RunningBuildsString = ""
                Jenkins.instance.getAllItems(Job).each{
                    def jobBuilds=it.getBuilds()
                    jobBuilds.each{
                        if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " ") }
                    }  
                }
                return RunningBuildsString
            }
            
            abayer Andrew Bayer added a comment - I'm running a tweaked version of that now to see what happens - had to make some changes due to serialization. @NonCPS def getLastNum() { def item = Jenkins.instance.getItemByFullName( "bug-reproduction/jenkins-41127" ) echo "${item}" return item.getLastSuccessfulBuild().number } def lastNum = getLastNum() if (lastNum == (currentBuild.number.toInteger()-1)) { //finding previous number def number = params.TestString.toInteger()+1 node () { stage ( 'Build' ) { sleep 2 build job: 'jenkins-41127' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } stage ( 'Again' ){ sleep 2 number = number+number build job: 'jenkins-41127' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } } } else { currentBuild.result == "SUCCESS" def RunningBuildsString = getRunStr() build job: 'jenkins-41127-fs' , parameters: [string(name: 'PreviousBuild' , value: "${lastNum}" ), string(name: 'CurrentBuild' , value: "${currentBuild.number.toInteger()}" ), string(name: 'QueueStatus' , value: "${RunningBuildsString}" )] } @NonCPS def getRunStr() { def RunningBuildsString = "" Jenkins.instance.getAllItems(Job).each{ def jobBuilds=it.getBuilds() jobBuilds.each{ if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " " ) } } } return RunningBuildsString }
            abayer Andrew Bayer added a comment -

            Got it to reproduce eventually, while I had some extra logging in Queue#getCauseOfBlockageForItem (and some other places, but that's the one that gave me something interesting). For the first few hundred jobs, everything was consistent: all the pending items would not be blocked by either Queue#getCauseOfBlockageForTask or QueueTaskDispatcher, they all were not BuildableItem, and they all had isConcurrentBuild() == false. The first item would not find its task in either buildables or pendings, and so would kick off. All the other pending items would find their tasks in pendings and so would stay queued. Yay, that's how it's supposed to be.

            But eventually...first item fine, many consecutive items fine, and then...one of them couldn't find its task in pendings and so kicked off too. That was followed by the rest of the queued items behaving like normal. I haven't yet navigated the Queue#maintain code enough to be sure what exactly the code path here is, but I'm fairly sure that the first item got removed from pendings before the queue processing was complete. I'm trying it again with some additional logging to try to make it more clear what's happening when.

            abayer Andrew Bayer added a comment - Got it to reproduce eventually, while I had some extra logging in Queue#getCauseOfBlockageForItem (and some other places, but that's the one that gave me something interesting). For the first few hundred jobs, everything was consistent: all the pending items would not be blocked by either Queue#getCauseOfBlockageForTask or QueueTaskDispatcher , they all were not BuildableItem , and they all had isConcurrentBuild() == false . The first item would not find its task in either buildables or pendings , and so would kick off. All the other pending items would find their tasks in pendings and so would stay queued. Yay, that's how it's supposed to be. But eventually...first item fine, many consecutive items fine, and then...one of them couldn't find its task in pendings and so kicked off too. That was followed by the rest of the queued items behaving like normal. I haven't yet navigated the Queue#maintain code enough to be sure what exactly the code path here is, but I'm fairly sure that the first item got removed from pendings before the queue processing was complete. I'm trying it again with some additional logging to try to make it more clear what's happening when.
            abayer Andrew Bayer added a comment -

            So Queue#maintain() is running twice, one immediately after the other, in some cases - probably race conditiony, not yet sure how the two are getting called. Anyway, the first run is making the first item in the queue buildable and calls makeBuildable on the item, removing said item from blockedProjects, and, via makeFlyweightTaskBuildable and createFlyWeightTaskRunnable, starting the flyweight task and adding the first item to pendings. All is well and good. But then the next run of maintain starts - and it can't find the task for the item we just started (theoretically) and put in pendings on any executor...so it removes the item from pendings. Then it gets to checking the queue again, and the new first item doesn't have anything blocking it (i.e., nothing in buildables or pending) and so...it goes through the same process as the previous item did in the previous maintain run. End result: two builds get started at the same time.

            So - definitely a race condition.

            abayer Andrew Bayer added a comment - So Queue#maintain() is running twice, one immediately after the other, in some cases - probably race conditiony, not yet sure how the two are getting called. Anyway, the first run is making the first item in the queue buildable and calls makeBuildable on the item, removing said item from blockedProjects , and, via makeFlyweightTaskBuildable and createFlyWeightTaskRunnable , starting the flyweight task and adding the first item to pendings . All is well and good. But then the next run of maintain starts - and it can't find the task for the item we just started (theoretically) and put in pendings on any executor...so it removes the item from pendings . Then it gets to checking the queue again, and the new first item doesn't have anything blocking it (i.e., nothing in buildables or pending ) and so...it goes through the same process as the previous item did in the previous maintain run. End result: two builds get started at the same time. So - definitely a race condition.
            abayer Andrew Bayer added a comment -

            fwiw, I think this likely only will happen with a flyweight task - so you could probably brew up a reproduction case with a matrix job, but I doubt you could do so with a freestyle job.

            abayer Andrew Bayer added a comment - fwiw, I think this likely only will happen with a flyweight task - so you could probably brew up a reproduction case with a matrix job, but I doubt you could do so with a freestyle job.
            svanoort Sam Van Oort added a comment -

            abayer Could we recategorize to core on the basis of your analysis?

            svanoort Sam Van Oort added a comment - abayer Could we recategorize to core on the basis of your analysis?
            recampbell Ryan Campbell added a comment -

            Noting the relationship to JENKINS-30231

            recampbell Ryan Campbell added a comment - Noting the relationship to JENKINS-30231
            dnusbaum Devin Nusbaum added a comment -

            In my reproductions, the call to Queue#maintain that kicks off the second job concurrently has the following abbreviated state in its initial snapshot:

            Queue.Snapshot { 
                waitingList=[...], 
                blocked=[pipeline #2, ...],
                buildables=[],
                pendings=[pipeline #1]
            }
            

            Interestingly, this is the only call to Queue#maintain out of ~250 builds where pendings is not an empty list.

            Inside of Queue#maintain, pipeline #1 (which is pending) gets removed from pendings, and because the result of makeBuildable on the next line is ignored, pipeline #1 is no longer part of the queue at all, and so nothing is blocking pipeline #2 from being built later on in Queue#maintain.

            I'm not exactly sure why pipeline #1 is removed from the pendings list. Maybe the lostPendings logic is messed up for flyweight tasks? For now I am looking at that logic to see if anything looks wrong. If it looks fine, then I'll try to understand why pipeline #1 is in pendings (maybe the flyweight task is half-started and gets blocked waiting for the Queue lock or something?) .

            dnusbaum Devin Nusbaum added a comment - In my reproductions, the call to Queue#maintain that kicks off the second job concurrently has the following abbreviated state in its initial snapshot: Queue.Snapshot { waitingList=[...], blocked=[pipeline #2, ...], buildables=[], pendings=[pipeline #1] } Interestingly, this is the only call to Queue#maintain out of ~250 builds where pendings is not an empty list. Inside of Queue#maintain , pipeline #1 (which is pending) gets removed from pendings , and because the result of makeBuildable on the next line is ignored, pipeline #1 is no longer part of the queue at all, and so nothing is blocking pipeline #2 from being built later on in Queue#maintain . I'm not exactly sure why pipeline #1 is removed from the pendings list. Maybe the lostPendings logic is messed up for flyweight tasks? For now I am looking at that logic to see if anything looks wrong. If it looks fine, then I'll try to understand why pipeline #1 is in pendings  (maybe the flyweight task is half-started and gets blocked waiting for the Queue lock or something?) .
            dnusbaum Devin Nusbaum added a comment - - edited

            Ok, I think the issue with lostPendings and flyweight tasks is that we loop through executors but not oneOffExecutors, which is where flyweight tasks are executed.

            I will test out looping through both tomorrow to see if that fixes it.

            dnusbaum Devin Nusbaum added a comment - - edited Ok, I think the issue with lostPendings and flyweight tasks is that we loop through executors but not oneOffExecutors, which is where flyweight tasks are executed . I will test out looping through both tomorrow to see if that fixes it.
            svanoort Sam Van Oort added a comment -

            If that fixes it, it will probably be a very welcome change.

            svanoort Sam Van Oort added a comment - If that fixes it, it will probably be a very welcome change.
            dnusbaum Devin Nusbaum added a comment -

            PR is up: https://github.com/jenkinsci/jenkins/pull/3562. Still looking into creating a regression test for it. I verified the change by running the same reproduction case as Alex/Andrew. Previously, a concurrent build would occur after ~250-750 builds, but after my fix I was able to run 3200 builds without any of them running concurrently.

            dnusbaum Devin Nusbaum added a comment - PR is up: https://github.com/jenkinsci/jenkins/pull/3562 . Still looking into creating a regression test for it. I verified the change by running the same reproduction case as Alex/Andrew. Previously, a concurrent build would occur after ~250-750 builds, but after my fix I was able to run 3200 builds without any of them running concurrently.
            dnusbaum Devin Nusbaum added a comment -

            My best guess as to why this happens so infrequently is that normally after a call to Queue#maintain, the executor owning the flyweight task is the next thread that acquires the Queue's lock (in Executor#run), so Queue.pendings is cleared before the next call to Queue#maintain, but in the problematic case 2 calls to {Queue#maintain happen consecutively without Executor#run being executed yet, so the task is still in Queue.pendings in the second call to {Queue#maintain.

            I wonder if using a fair ordering policy for the Queue's lock would make this less likely, or if the Executor's run method isn't even waiting on the lock yet in the problematic case.

            dnusbaum Devin Nusbaum added a comment - My best guess as to why this happens so infrequently is that normally after a call to Queue#maintain , the executor owning the flyweight task is the next thread that acquires the Queue's lock (in Executor#run ), so Queue.pendings is cleared before the next call to Queue#maintain , but in the problematic case 2 calls to { Queue#maintain happen consecutively without Executor#run being executed yet, so the task is still in Queue.pendings in the second call to { Queue#maintain . I wonder if using a fair ordering policy for the Queue's lock would make this less likely, or if the Executor's run method isn't even waiting on the lock yet in the problematic case.
            dnusbaum Devin Nusbaum added a comment -

            Fixed in Jenkins 2.136. I am marking this as an LTS candidate given the impact and simplicity of the fix, although we will have to give it some time to make sure there are no regressions.

            dnusbaum Devin Nusbaum added a comment - Fixed in Jenkins 2.136 . I am marking this as an LTS candidate given the impact and simplicity of the fix, although we will have to give it some time to make sure there are no regressions.

            People

              dnusbaum Devin Nusbaum
              boon Joe Harte
              Votes:
              3 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: