-
Bug
-
Resolution: Fixed
-
Critical
-
Jenkins 2.7.1
Jenkins 2.49
Pipeline plugin 2.4
-
Powered by SuggestiMate -
Jenkins 2.136
I have configured a Jenkins pipeline to disable concurrent builds:
properties([ disableConcurrentBuilds() ])
However, I have noticed on some occasions the next 2 builds are pulled from the pipeline's queue and executed concurrently. Why this occurs is not obvious at all.
- relates to
-
JENKINS-30231 Build creates second workspace@2 for non-concurrent build configuration
-
- Resolved
-
- links to
[JENKINS-41127] Multiple pipeline instances running concurrently when concurrent execution disabled
Saw this yet again. Haven't seen the problem in hundreds of builds since my last comment above, and today I see that Jenkins pulled the next 2 builds from the queue and ran them concurrently, event though concurrent building in explicitly disabled.
Using latest version of Pipeline plugins available at time of writing, and Jenkins version 2.49
jglick FYI
Seen this yet again more frequently today. Not quite sure how to reproduce this.
jglick Is there anything I can do to help diagnose the problem? This is becoming a serious issue for us, as it is critical we only push one build at a time through certain pipelines.
The pipeline works fine 90% of the time, and then when a build complete Jenkins will (seemingly randomly) pull the next 2 builds from the queue at once and start executing both concurrently, which totally messes up our pipeline environment.
Is there anything I can do to help diagnose the problem?
I guess set breakpoints in, or add logging to, WorkflowJob.isConcurrentBuild or Queue.allowNewBuildableTask.
Workaround be to use the lock step instead of job-level granularity.
This has just happened to one of my builds. I'll try adding the logging you suggested and look into using lock.
There seems to be some sort of race condition, as we're seeing this intermittently. It may be that it's only happening when there are two queued instances of a job with different parameters.
jglick I have a consistent way to reproduce this if it helps:
- Create a Freestyle job(just to catch when the error happens) called "FreestyleTest" with string parameters for "CurrentBuild", "PreviousBuild", and "QueueStatus"
- Create a Pipeline job with the "disable concurrent builds" turned on and a string parameter called "TestString" with the following script:
lock('TestResource') { def item = Jenkins.instance.getItem("PipelineTest") if(item.getLastSuccessfulBuild().number == (currentBuild.number.toInteger()-1))//finding previous number { def number = params.TestString.toInteger()+1 node () { stage ('Build') { sleep 2 build job: 'PipelineTest', parameters: [string(name: 'TestString', value: "${number}")], wait: false } stage ('Again'){ sleep 2 number = number+number build job: 'PipelineTest', parameters: [string(name: 'TestString', value: "${number}")], wait: false } } } else { currentBuild.result == "SUCCESS" def RunningBuildsString = "" Jenkins.instance.getAllItems(Job).each{ def jobBuilds=it.getBuilds() jobBuilds.each{ if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " ") } } } build job: 'FreestyleTest', parameters: [string(name: 'PreviousBuild', value: "${item.getLastSuccessfulBuild().number}"), string(name: 'CurrentBuild', value: "${currentBuild.number.toInteger()}"), string(name: 'QueueStatus', value: "${RunningBuildsString}")] } }
- You will have to run this a couple of times since there is some approval you will have to do(the groovy scripting I am doing is not recommended but needed to check the queue status and the getLastSuccessfulBuild from the filesystem(I wanted to see if it was just not updating the filesystem in time)
- Once it is ready to run you will get one "success" and then you will have to start it one more time where it will trigger infinite downstream builds. You just need to wait for the next build of FreestyleTest which will show you when the previous build was not one previous which will then show the queue status. This process takes around 800 builds for me locally but does not use very much memory and resources which is nice
I am still testing if this same issue can happen with freestyle builds. Additionally the lock is not needed but you will see that the lock does not seem to matter either. I can also enable throttle concurrent builds to limit the number of builds per minute and it will also reproduce.
I'm running a tweaked version of that now to see what happens - had to make some changes due to serialization.
@NonCPS def getLastNum() { def item = Jenkins.instance.getItemByFullName("bug-reproduction/jenkins-41127") echo "${item}" return item.getLastSuccessfulBuild().number } def lastNum = getLastNum() if(lastNum == (currentBuild.number.toInteger()-1)) {//finding previous number def number = params.TestString.toInteger()+1 node () { stage ('Build') { sleep 2 build job: 'jenkins-41127', parameters: [string(name: 'TestString', value: "${number}")], wait: false } stage ('Again'){ sleep 2 number = number+number build job: 'jenkins-41127', parameters: [string(name: 'TestString', value: "${number}")], wait: false } } } else { currentBuild.result == "SUCCESS" def RunningBuildsString = getRunStr() build job: 'jenkins-41127-fs', parameters: [string(name: 'PreviousBuild', value: "${lastNum}"), string(name: 'CurrentBuild', value: "${currentBuild.number.toInteger()}"), string(name: 'QueueStatus', value: "${RunningBuildsString}")] } @NonCPS def getRunStr() { def RunningBuildsString = "" Jenkins.instance.getAllItems(Job).each{ def jobBuilds=it.getBuilds() jobBuilds.each{ if (it.isBuilding()) { RunningBuildsString = (RunningBuildsString + it.toString() + " ") } } } return RunningBuildsString }
Got it to reproduce eventually, while I had some extra logging in Queue#getCauseOfBlockageForItem (and some other places, but that's the one that gave me something interesting). For the first few hundred jobs, everything was consistent: all the pending items would not be blocked by either Queue#getCauseOfBlockageForTask or QueueTaskDispatcher, they all were not BuildableItem, and they all had isConcurrentBuild() == false. The first item would not find its task in either buildables or pendings, and so would kick off. All the other pending items would find their tasks in pendings and so would stay queued. Yay, that's how it's supposed to be.
But eventually...first item fine, many consecutive items fine, and then...one of them couldn't find its task in pendings and so kicked off too. That was followed by the rest of the queued items behaving like normal. I haven't yet navigated the Queue#maintain code enough to be sure what exactly the code path here is, but I'm fairly sure that the first item got removed from pendings before the queue processing was complete. I'm trying it again with some additional logging to try to make it more clear what's happening when.
So Queue#maintain() is running twice, one immediately after the other, in some cases - probably race conditiony, not yet sure how the two are getting called. Anyway, the first run is making the first item in the queue buildable and calls makeBuildable on the item, removing said item from blockedProjects, and, via makeFlyweightTaskBuildable and createFlyWeightTaskRunnable, starting the flyweight task and adding the first item to pendings. All is well and good. But then the next run of maintain starts - and it can't find the task for the item we just started (theoretically) and put in pendings on any executor...so it removes the item from pendings. Then it gets to checking the queue again, and the new first item doesn't have anything blocking it (i.e., nothing in buildables or pending) and so...it goes through the same process as the previous item did in the previous maintain run. End result: two builds get started at the same time.
So - definitely a race condition.
fwiw, I think this likely only will happen with a flyweight task - so you could probably brew up a reproduction case with a matrix job, but I doubt you could do so with a freestyle job.
In my reproductions, the call to Queue#maintain that kicks off the second job concurrently has the following abbreviated state in its initial snapshot:
Queue.Snapshot { waitingList=[...], blocked=[pipeline #2, ...], buildables=[], pendings=[pipeline #1] }
Interestingly, this is the only call to Queue#maintain out of ~250 builds where pendings is not an empty list.
Inside of Queue#maintain, pipeline #1 (which is pending) gets removed from pendings, and because the result of makeBuildable on the next line is ignored, pipeline #1 is no longer part of the queue at all, and so nothing is blocking pipeline #2 from being built later on in Queue#maintain.
I'm not exactly sure why pipeline #1 is removed from the pendings list. Maybe the lostPendings logic is messed up for flyweight tasks? For now I am looking at that logic to see if anything looks wrong. If it looks fine, then I'll try to understand why pipeline #1 is in pendings (maybe the flyweight task is half-started and gets blocked waiting for the Queue lock or something?) .
Ok, I think the issue with lostPendings and flyweight tasks is that we loop through executors but not oneOffExecutors, which is where flyweight tasks are executed.
I will test out looping through both tomorrow to see if that fixes it.
PR is up: https://github.com/jenkinsci/jenkins/pull/3562. Still looking into creating a regression test for it. I verified the change by running the same reproduction case as Alex/Andrew. Previously, a concurrent build would occur after ~250-750 builds, but after my fix I was able to run 3200 builds without any of them running concurrently.
My best guess as to why this happens so infrequently is that normally after a call to Queue#maintain, the executor owning the flyweight task is the next thread that acquires the Queue's lock (in Executor#run), so Queue.pendings is cleared before the next call to Queue#maintain, but in the problematic case 2 calls to {Queue#maintain happen consecutively without Executor#run being executed yet, so the task is still in Queue.pendings in the second call to {Queue#maintain.
I wonder if using a fair ordering policy for the Queue's lock would make this less likely, or if the Executor's run method isn't even waiting on the lock yet in the problematic case.
Fixed in Jenkins 2.136. I am marking this as an LTS candidate given the impact and simplicity of the fix, although we will have to give it some time to make sure there are no regressions.
Bump. Seeing this again. My pipeline is configured to disable concurrent builds, yet I saw two instances running when they were both triggered at exactly the same time, down to the second. Jenkins version now is 2.49.