[JENKINS-41127] Multiple pipeline instances running concurrently when concurrent execution disabled

Joe Harte added a comment - 2017-03-08 15:45

Bump. Seeing this again. My pipeline is configured to disable concurrent builds, yet I saw two instances running when they were both triggered at exactly the same time, down to the second. Jenkins version now is 2.49.

Joe Harte added a comment - 2017-03-08 15:45 Bump. Seeing this again. My pipeline is configured to disable concurrent builds, yet I saw two instances running when they were both triggered at exactly the same time, down to the second. Jenkins version now is 2.49.

Joe Harte added a comment - 2017-03-28 12:07 - edited

Saw this yet again. Haven't seen the problem in hundreds of builds since my last comment above, and today I see that Jenkins pulled the next 2 builds from the queue and ran them concurrently, event though concurrent building in explicitly disabled.

Using latest version of Pipeline plugins available at time of writing, and Jenkins version 2.49

jglick FYI

Joe Harte added a comment - 2017-03-28 12:07 - edited Saw this yet again. Haven't seen the problem in hundreds of builds since my last comment above, and today I see that Jenkins pulled the next 2 builds from the queue and ran them concurrently, event though concurrent building in explicitly disabled. Using latest version of Pipeline plugins available at time of writing, and Jenkins version 2.49 jglick FYI

Jesse Glick added a comment - 2017-04-13 19:53

Without a way to reproduce there is nothing to go on I am afraid.

Jesse Glick added a comment - 2017-04-13 19:53 Without a way to reproduce there is nothing to go on I am afraid.

LAKSHMI ANANTHA NALLAMOTHU added a comment - 2017-05-18 19:16

Seen this yet again more frequently today. Not quite sure how to reproduce this.

LAKSHMI ANANTHA NALLAMOTHU added a comment - 2017-05-18 19:16 Seen this yet again more frequently today. Not quite sure how to reproduce this.

Joe Harte added a comment - 2017-05-24 19:02

jglick Is there anything I can do to help diagnose the problem? This is becoming a serious issue for us, as it is critical we only push one build at a time through certain pipelines.

The pipeline works fine 90% of the time, and then when a build complete Jenkins will (seemingly randomly) pull the next 2 builds from the queue at once and start executing both concurrently, which totally messes up our pipeline environment.

Joe Harte added a comment - 2017-05-24 19:02 jglick Is there anything I can do to help diagnose the problem? This is becoming a serious issue for us, as it is critical we only push one build at a time through certain pipelines. The pipeline works fine 90% of the time, and then when a build complete Jenkins will (seemingly randomly) pull the next 2 builds from the queue at once and start executing both concurrently, which totally messes up our pipeline environment.

Jesse Glick added a comment - 2017-05-24 19:17

Is there anything I can do to help diagnose the problem?

I guess set breakpoints in, or add logging to, WorkflowJob.isConcurrentBuild or Queue.allowNewBuildableTask.

Workaround be to use the lock step instead of job-level granularity.

Jesse Glick added a comment - 2017-05-24 19:17 Is there anything I can do to help diagnose the problem? I guess set breakpoints in, or add logging to, WorkflowJob.isConcurrentBuild or Queue.allowNewBuildableTask . Workaround be to use the lock step instead of job-level granularity.

Cameron Bosnic added a comment - 2017-05-24 21:31

This has just happened to one of my builds. I'll try adding the logging you suggested and look into using lock.

Cameron Bosnic added a comment - 2017-05-24 21:31 This has just happened to one of my builds. I'll try adding the logging you suggested and look into using lock.

mstave added a comment - 2017-08-15 16:25

There seems to be some sort of race condition, as we're seeing this intermittently. It may be that it's only happening when there are two queued instances of a job with different parameters.

mstave added a comment - 2017-08-15 16:25 There seems to be some sort of race condition, as we're seeing this intermittently. It may be that it's only happening when there are two queued instances of a job with different parameters.

Alex Taylor added a comment - 2018-01-11 15:56

jglick I have a consistent way to reproduce this if it helps:

Create a Freestyle job(just to catch when the error happens) called "FreestyleTest" with string parameters for "CurrentBuild", "PreviousBuild", and "QueueStatus"

Create a Pipeline job with the "disable concurrent builds" turned on and a string parameter called "TestString" with the following script:

lock('TestResource') {
def item = Jenkins.instance.getItem("PipelineTest")
if(item.getLastSuccessfulBuild().number == (currentBuild.number.toInteger()-1))//finding previous number
{
def number = params.TestString.toInteger()+1
node () {
stage ('Build') {
sleep 2
build job: 'PipelineTest', parameters: [string(name: 'TestString', value: "${number}")], wait: false
}
stage ('Again'){
sleep 2
number = number+number
build job: 'PipelineTest', parameters: [string(name: 'TestString', value: "${number}")], wait: false
}

}
}
else
{
currentBuild.result == "SUCCESS" 
def RunningBuildsString = ""
Jenkins.instance.getAllItems(Job).each{
def jobBuilds=it.getBuilds()
jobBuilds.each{
if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " ") }
}
}
build job: 'FreestyleTest', parameters: [string(name: 'PreviousBuild', value: "${item.getLastSuccessfulBuild().number}"), string(name: 'CurrentBuild', value: "${currentBuild.number.toInteger()}"), string(name: 'QueueStatus', value: "${RunningBuildsString}")]

}
}

You will have to run this a couple of times since there is some approval you will have to do(the groovy scripting I am doing is not recommended but needed to check the queue status and the getLastSuccessfulBuild from the filesystem(I wanted to see if it was just not updating the filesystem in time)
Once it is ready to run you will get one "success" and then you will have to start it one more time where it will trigger infinite downstream builds. You just need to wait for the next build of FreestyleTest which will show you when the previous build was not one previous which will then show the queue status. This process takes around 800 builds for me locally but does not use very much memory and resources which is nice

I am still testing if this same issue can happen with freestyle builds. Additionally the lock is not needed but you will see that the lock does not seem to matter either. I can also enable throttle concurrent builds to limit the number of builds per minute and it will also reproduce.

Alex Taylor added a comment - 2018-01-11 15:56 jglick I have a consistent way to reproduce this if it helps: Create a Freestyle job(just to catch when the error happens) called "FreestyleTest" with string parameters for "CurrentBuild", "PreviousBuild", and "QueueStatus" Create a Pipeline job with the "disable concurrent builds" turned on and a string parameter called "TestString" with the following script: lock( 'TestResource' ) { def item = Jenkins.instance.getItem( "PipelineTest" ) if (item.getLastSuccessfulBuild().number == (currentBuild.number.toInteger()-1)) //finding previous number { def number = params.TestString.toInteger()+1 node () { stage ( 'Build' ) { sleep 2 build job: 'PipelineTest' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } stage ( 'Again' ){ sleep 2 number = number+number build job: 'PipelineTest' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } } } else { currentBuild.result == "SUCCESS" def RunningBuildsString = "" Jenkins.instance.getAllItems(Job).each{ def jobBuilds=it.getBuilds() jobBuilds.each{ if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " " ) } } } build job: 'FreestyleTest' , parameters: [string(name: 'PreviousBuild' , value: "${item.getLastSuccessfulBuild().number}" ), string(name: 'CurrentBuild' , value: "${currentBuild.number.toInteger()}" ), string(name: 'QueueStatus' , value: "${RunningBuildsString}" )] } } You will have to run this a couple of times since there is some approval you will have to do(the groovy scripting I am doing is not recommended but needed to check the queue status and the getLastSuccessfulBuild from the filesystem(I wanted to see if it was just not updating the filesystem in time) Once it is ready to run you will get one "success" and then you will have to start it one more time where it will trigger infinite downstream builds. You just need to wait for the next build of FreestyleTest which will show you when the previous build was not one previous which will then show the queue status. This process takes around 800 builds for me locally but does not use very much memory and resources which is nice I am still testing if this same issue can happen with freestyle builds. Additionally the lock is not needed but you will see that the lock does not seem to matter either. I can also enable throttle concurrent builds to limit the number of builds per minute and it will also reproduce.

Andrew Bayer added a comment - 2018-03-16 13:39

I'm running a tweaked version of that now to see what happens - had to make some changes due to serialization.

@NonCPS
def getLastNum() {
    def item = Jenkins.instance.getItemByFullName("bug-reproduction/jenkins-41127")
    echo "${item}"
    return item.getLastSuccessfulBuild().number
}

def lastNum = getLastNum()
if(lastNum == (currentBuild.number.toInteger()-1)) {//finding previous number 
    def number = params.TestString.toInteger()+1
    node () {
        stage ('Build') {
            sleep 2
            build job: 'jenkins-41127', parameters: [string(name: 'TestString', value: "${number}")], wait: false
        }
        stage ('Again'){
            sleep 2
            number = number+number
            build job: 'jenkins-41127', parameters: [string(name: 'TestString', value: "${number}")], wait: false
        }
    }
}
else {
    currentBuild.result == "SUCCESS" 
    def RunningBuildsString = getRunStr()
    build job: 'jenkins-41127-fs', parameters: [string(name: 'PreviousBuild', value: "${lastNum}"), string(name: 'CurrentBuild', value: "${currentBuild.number.toInteger()}"), string(name: 'QueueStatus', value: "${RunningBuildsString}")]
}

@NonCPS
def getRunStr() {
    def RunningBuildsString = ""
    Jenkins.instance.getAllItems(Job).each{
        def jobBuilds=it.getBuilds()
        jobBuilds.each{
            if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " ") }
        }  
    }
    return RunningBuildsString
}

Andrew Bayer added a comment - 2018-03-16 13:39 I'm running a tweaked version of that now to see what happens - had to make some changes due to serialization. @NonCPS def getLastNum() { def item = Jenkins.instance.getItemByFullName( "bug-reproduction/jenkins-41127" ) echo "${item}" return item.getLastSuccessfulBuild().number } def lastNum = getLastNum() if (lastNum == (currentBuild.number.toInteger()-1)) { //finding previous number def number = params.TestString.toInteger()+1 node () { stage ( 'Build' ) { sleep 2 build job: 'jenkins-41127' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } stage ( 'Again' ){ sleep 2 number = number+number build job: 'jenkins-41127' , parameters: [string(name: 'TestString' , value: "${number}" )], wait: false } } } else { currentBuild.result == "SUCCESS" def RunningBuildsString = getRunStr() build job: 'jenkins-41127-fs' , parameters: [string(name: 'PreviousBuild' , value: "${lastNum}" ), string(name: 'CurrentBuild' , value: "${currentBuild.number.toInteger()}" ), string(name: 'QueueStatus' , value: "${RunningBuildsString}" )] } @NonCPS def getRunStr() { def RunningBuildsString = "" Jenkins.instance.getAllItems(Job).each{ def jobBuilds=it.getBuilds() jobBuilds.each{ if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " " ) } } } return RunningBuildsString }

Andrew Bayer added a comment - 2018-03-16 15:28

Got it to reproduce eventually, while I had some extra logging in Queue#getCauseOfBlockageForItem (and some other places, but that's the one that gave me something interesting). For the first few hundred jobs, everything was consistent: all the pending items would not be blocked by either Queue#getCauseOfBlockageForTask or QueueTaskDispatcher, they all were not BuildableItem, and they all had isConcurrentBuild() == false. The first item would not find its task in either buildables or pendings, and so would kick off. All the other pending items would find their tasks in pendings and so would stay queued. Yay, that's how it's supposed to be.

But eventually...first item fine, many consecutive items fine, and then...one of them couldn't find its task in pendings and so kicked off too. That was followed by the rest of the queued items behaving like normal. I haven't yet navigated the Queue#maintain code enough to be sure what exactly the code path here is, but I'm fairly sure that the first item got removed from pendings before the queue processing was complete. I'm trying it again with some additional logging to try to make it more clear what's happening when.

Andrew Bayer added a comment - 2018-03-16 15:28 Got it to reproduce eventually, while I had some extra logging in Queue#getCauseOfBlockageForItem (and some other places, but that's the one that gave me something interesting). For the first few hundred jobs, everything was consistent: all the pending items would not be blocked by either Queue#getCauseOfBlockageForTask or QueueTaskDispatcher , they all were not BuildableItem , and they all had isConcurrentBuild() == false . The first item would not find its task in either buildables or pendings , and so would kick off. All the other pending items would find their tasks in pendings and so would stay queued. Yay, that's how it's supposed to be. But eventually...first item fine, many consecutive items fine, and then...one of them couldn't find its task in pendings and so kicked off too. That was followed by the rest of the queued items behaving like normal. I haven't yet navigated the Queue#maintain code enough to be sure what exactly the code path here is, but I'm fairly sure that the first item got removed from pendings before the queue processing was complete. I'm trying it again with some additional logging to try to make it more clear what's happening when.

Andrew Bayer added a comment - 2018-03-16 16:41

So Queue#maintain() is running twice, one immediately after the other, in some cases - probably race conditiony, not yet sure how the two are getting called. Anyway, the first run is making the first item in the queue buildable and calls makeBuildable on the item, removing said item from blockedProjects, and, via makeFlyweightTaskBuildable and createFlyWeightTaskRunnable, starting the flyweight task and adding the first item to pendings. All is well and good. But then the next run of maintain starts - and it can't find the task for the item we just started (theoretically) and put in pendings on any executor...so it removes the item from pendings. Then it gets to checking the queue again, and the new first item doesn't have anything blocking it (i.e., nothing in buildables or pending) and so...it goes through the same process as the previous item did in the previous maintain run. End result: two builds get started at the same time.

So - definitely a race condition.

Andrew Bayer added a comment - 2018-03-16 16:41 So Queue#maintain() is running twice, one immediately after the other, in some cases - probably race conditiony, not yet sure how the two are getting called. Anyway, the first run is making the first item in the queue buildable and calls makeBuildable on the item, removing said item from blockedProjects , and, via makeFlyweightTaskBuildable and createFlyWeightTaskRunnable , starting the flyweight task and adding the first item to pendings . All is well and good. But then the next run of maintain starts - and it can't find the task for the item we just started (theoretically) and put in pendings on any executor...so it removes the item from pendings . Then it gets to checking the queue again, and the new first item doesn't have anything blocking it (i.e., nothing in buildables or pending ) and so...it goes through the same process as the previous item did in the previous maintain run. End result: two builds get started at the same time. So - definitely a race condition.

Andrew Bayer added a comment - 2018-03-16 17:21

fwiw, I think this likely only will happen with a flyweight task - so you could probably brew up a reproduction case with a matrix job, but I doubt you could do so with a freestyle job.

Andrew Bayer added a comment - 2018-03-16 17:21 fwiw, I think this likely only will happen with a flyweight task - so you could probably brew up a reproduction case with a matrix job, but I doubt you could do so with a freestyle job.

Sam Van Oort added a comment - 2018-04-12 15:30

abayer Could we recategorize to core on the basis of your analysis?

Sam Van Oort added a comment - 2018-04-12 15:30 abayer Could we recategorize to core on the basis of your analysis?

Ryan Campbell added a comment - 2018-07-18 13:57

Noting the relationship to ~~JENKINS-30231~~

Ryan Campbell added a comment - 2018-07-18 13:57 Noting the relationship to JENKINS-30231

Devin Nusbaum added a comment - 2018-07-23 17:43

In my reproductions, the call to Queue#maintain that kicks off the second job concurrently has the following abbreviated state in its initial snapshot:

Queue.Snapshot { 
    waitingList=[...], 
    blocked=[pipeline #2, ...],
    buildables=[],
    pendings=[pipeline #1]
}

Interestingly, this is the only call to Queue#maintain out of ~250 builds where pendings is not an empty list.

Inside of Queue#maintain, pipeline #1 (which is pending) gets removed from pendings, and because the result of makeBuildable on the next line is ignored, pipeline #1 is no longer part of the queue at all, and so nothing is blocking pipeline #2 from being built later on in Queue#maintain.

I'm not exactly sure why pipeline #1 is removed from the pendings list. Maybe the lostPendings logic is messed up for flyweight tasks? For now I am looking at that logic to see if anything looks wrong. If it looks fine, then I'll try to understand why pipeline #1 is in pendings (maybe the flyweight task is half-started and gets blocked waiting for the Queue lock or something?) .

Devin Nusbaum added a comment - 2018-07-23 17:43 In my reproductions, the call to Queue#maintain that kicks off the second job concurrently has the following abbreviated state in its initial snapshot: Queue.Snapshot { waitingList=[...], blocked=[pipeline #2, ...], buildables=[], pendings=[pipeline #1] } Interestingly, this is the only call to Queue#maintain out of ~250 builds where pendings is not an empty list. Inside of Queue#maintain , pipeline #1 (which is pending) gets removed from pendings , and because the result of makeBuildable on the next line is ignored, pipeline #1 is no longer part of the queue at all, and so nothing is blocking pipeline #2 from being built later on in Queue#maintain . I'm not exactly sure why pipeline #1 is removed from the pendings list. Maybe the lostPendings logic is messed up for flyweight tasks? For now I am looking at that logic to see if anything looks wrong. If it looks fine, then I'll try to understand why pipeline #1 is in pendings (maybe the flyweight task is half-started and gets blocked waiting for the Queue lock or something?) .

Devin Nusbaum added a comment - 2018-07-23 21:44 - edited

Ok, I think the issue with lostPendings and flyweight tasks is that we loop through executors but not oneOffExecutors, which is where flyweight tasks are executed.

I will test out looping through both tomorrow to see if that fixes it.

Devin Nusbaum added a comment - 2018-07-23 21:44 - edited Ok, I think the issue with lostPendings and flyweight tasks is that we loop through executors but not oneOffExecutors, which is where flyweight tasks are executed . I will test out looping through both tomorrow to see if that fixes it.

Sam Van Oort added a comment - 2018-07-23 21:53

If that fixes it, it will probably be a very welcome change.

Sam Van Oort added a comment - 2018-07-23 21:53 If that fixes it, it will probably be a very welcome change.

Devin Nusbaum added a comment - 2018-07-24 22:03

PR is up: https://github.com/jenkinsci/jenkins/pull/3562. Still looking into creating a regression test for it. I verified the change by running the same reproduction case as Alex/Andrew. Previously, a concurrent build would occur after ~250-750 builds, but after my fix I was able to run 3200 builds without any of them running concurrently.

Devin Nusbaum added a comment - 2018-07-24 22:03 PR is up: https://github.com/jenkinsci/jenkins/pull/3562 . Still looking into creating a regression test for it. I verified the change by running the same reproduction case as Alex/Andrew. Previously, a concurrent build would occur after ~250-750 builds, but after my fix I was able to run 3200 builds without any of them running concurrently.

Devin Nusbaum added a comment - 2018-07-25 14:36

My best guess as to why this happens so infrequently is that normally after a call to Queue#maintain, the executor owning the flyweight task is the next thread that acquires the Queue's lock (in Executor#run), so Queue.pendings is cleared before the next call to Queue#maintain, but in the problematic case 2 calls to {Queue#maintain happen consecutively without Executor#run being executed yet, so the task is still in Queue.pendings in the second call to {Queue#maintain.

I wonder if using a fair ordering policy for the Queue's lock would make this less likely, or if the Executor's run method isn't even waiting on the lock yet in the problematic case.

Devin Nusbaum added a comment - 2018-07-25 14:36 My best guess as to why this happens so infrequently is that normally after a call to Queue#maintain , the executor owning the flyweight task is the next thread that acquires the Queue's lock (in Executor#run ), so Queue.pendings is cleared before the next call to Queue#maintain , but in the problematic case 2 calls to { Queue#maintain happen consecutively without Executor#run being executed yet, so the task is still in Queue.pendings in the second call to { Queue#maintain . I wonder if using a fair ordering policy for the Queue's lock would make this less likely, or if the Executor's run method isn't even waiting on the lock yet in the problematic case.

Devin Nusbaum added a comment - 2018-08-06 13:29

Fixed in Jenkins 2.136. I am marking this as an LTS candidate given the impact and simplicity of the fix, although we will have to give it some time to make sure there are no regressions.

Devin Nusbaum added a comment - 2018-08-06 13:29 Fixed in Jenkins 2.136 . I am marking this as an LTS candidate given the impact and simplicity of the fix, although we will have to give it some time to make sure there are no regressions.

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Joe Harte added a comment - 2017-03-08 15:45

Expand comment: Joe Harte added a comment - 2017-03-08 15:45

Collapse comment: Joe Harte added a comment - 2017-03-28 12:07, Edited by Joe Harte - 2017-03-28 12:13

Expand comment: Joe Harte added a comment - 2017-03-28 12:07, Edited by Joe Harte - 2017-03-28 12:13

Collapse comment: Jesse Glick added a comment - 2017-04-13 19:53

Expand comment: Jesse Glick added a comment - 2017-04-13 19:53

Collapse comment: LAKSHMI ANANTHA NALLAMOTHU added a comment - 2017-05-18 19:16

Expand comment: LAKSHMI ANANTHA NALLAMOTHU added a comment - 2017-05-18 19:16

Collapse comment: Joe Harte added a comment - 2017-05-24 19:02

Expand comment: Joe Harte added a comment - 2017-05-24 19:02

Collapse comment: Jesse Glick added a comment - 2017-05-24 19:17

Expand comment: Jesse Glick added a comment - 2017-05-24 19:17

Collapse comment: Cameron Bosnic added a comment - 2017-05-24 21:31

Expand comment: Cameron Bosnic added a comment - 2017-05-24 21:31

Collapse comment: mstave added a comment - 2017-08-15 16:25

Expand comment: mstave added a comment - 2017-08-15 16:25

Collapse comment: Alex Taylor added a comment - 2018-01-11 15:56

Expand comment: Alex Taylor added a comment - 2018-01-11 15:56

Collapse comment: Andrew Bayer added a comment - 2018-03-16 13:39

Expand comment: Andrew Bayer added a comment - 2018-03-16 13:39

Collapse comment: Andrew Bayer added a comment - 2018-03-16 15:28

Expand comment: Andrew Bayer added a comment - 2018-03-16 15:28

Collapse comment: Andrew Bayer added a comment - 2018-03-16 16:41

Expand comment: Andrew Bayer added a comment - 2018-03-16 16:41

Collapse comment: Andrew Bayer added a comment - 2018-03-16 17:21

Expand comment: Andrew Bayer added a comment - 2018-03-16 17:21

Collapse comment: Sam Van Oort added a comment - 2018-04-12 15:30

Expand comment: Sam Van Oort added a comment - 2018-04-12 15:30

Collapse comment: Ryan Campbell added a comment - 2018-07-18 13:57

Expand comment: Ryan Campbell added a comment - 2018-07-18 13:57

Collapse comment: Devin Nusbaum added a comment - 2018-07-23 17:43

Expand comment: Devin Nusbaum added a comment - 2018-07-23 17:43

Collapse comment: Devin Nusbaum added a comment - 2018-07-23 21:44, Edited by Devin Nusbaum - 2018-07-23 21:45

Expand comment: Devin Nusbaum added a comment - 2018-07-23 21:44, Edited by Devin Nusbaum - 2018-07-23 21:45

Collapse comment: Sam Van Oort added a comment - 2018-07-23 21:53

Expand comment: Sam Van Oort added a comment - 2018-07-23 21:53

Collapse comment: Devin Nusbaum added a comment - 2018-07-24 22:03

Expand comment: Devin Nusbaum added a comment - 2018-07-24 22:03

Collapse comment: Devin Nusbaum added a comment - 2018-07-25 14:36

Expand comment: Devin Nusbaum added a comment - 2018-07-25 14:36

Collapse comment: Devin Nusbaum added a comment - 2018-08-06 13:29

Expand comment: Devin Nusbaum added a comment - 2018-08-06 13:29

People

Dates