-
Bug
-
Resolution: Fixed
-
Critical
-
Powered by SuggestiMate
Do a fresh install of Jenkins. Make 3+ executors.
Create project A, with downstream B.
Have it `sleep 15`.
Create project B, with downstream C.
Set it to block on building upstream projects.
Have it `sleep 30`.
Create project C.
Set it to block on building upstream projects.
Have it `sleep 30`.
Trigger A, B, and C together, in succession.
As A builds, notice that C is blocked, and waits for the (transitive) upstream project A to finish.
However, once A finished, B and C both start!
Only B should have started. C should have stayed blocked until B finished. Instead, C behaved as if A – and A alone – were the upstream project.
[JENKINS-27871] Block on upstream projects does not work
Could you test Jenkins 1.606 and 1.607 to see whether this was introduced in the latter?
Be aware that downgrading from 1.607+ to 1.605 or earlier will lose slave configuration.
I suspect that this is a bug introduced in Jenkins Core in 1.536 but which would only be visible on a single core non-hyperthreaded master or on a very highly contended master until Jenkins 1.607
It seems similar to the issue in https://issues.jenkins-ci.org/browse/JENKINS-27708
I may have an idea for a hack to solve this type of issue... I'll do some thinking
Yes 1.607 is the first affected version.
@stephenconnolly, I bisected it to 92147c3597308bc05e6448ccc41409fcc7c05fd7.
I'd take a crack at it, but seeing the author of that commit, I'm sure you'd be far better at this
Can you see if the "hack" I have made in http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150410.091817-1.war resolves this issue for you?
diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..6a5e006 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -1335,7 +1335,10 @@ public class Queue extends ResourceController implements Saveable { final QueueSorter s = sorter; if (s != null) s.sortBuildableItems(buildables); - + + // JENKINS-27708, JENKINS-27871, etc + updateSnapshot(); + // allocate buildable jobs to executors for (BuildableItem p : new ArrayList<BuildableItem>( buildables)) {// copy as we'll mutate the list in the loop @@ -1372,6 +1375,9 @@ public class Queue extends ResourceController implements Saveable { makePending(p); else LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?", p); + + // JENKINS-27708, JENKINS-27871, etc + updateSnapshot(); } } finally { updateSnapshot(); } } finally { lock.unlock();
If it is this easy to reproduce, should be easy to write a test for it too, no?
@Paul Draper, try Jenkins 1.536 with a single core non-hyper-threaded master and you should see (I'm 95% certain) that the issue is there but not in 1.534 (as 1.535 was a borked release)
http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150411.073620-2.war is an alternative attempt. It would be good if you could confirm that it also resolves the issue.
diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..3c9367a 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -844,6 +844,17 @@ public class Queue extends ResourceController implements Saveable { * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}. */ public List<BuildableItem> getBuildableItems(Computer c) { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + List<BuildableItem> result = new ArrayList<BuildableItem>(); + _getBuildableItems(c, buildables, result); + _getBuildableItems(c, pendings, result); + return result; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this.snapshot; List<BuildableItem> result = new ArrayList<BuildableItem>(); _getBuildableItems(c, snapshot.buildables, result); @@ -865,6 +876,16 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getBuildableItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables); + r.addAll(pendings); + return r; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this.snapshot; ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables); r.addAll(snapshot.pendings); @@ -875,6 +896,14 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getPendingItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + return new ArrayList<BuildableItem>(pendings); + } finally { + lock.unlock(); + } + } return new ArrayList<BuildableItem>(snapshot.pendings); } @@ -902,6 +931,19 @@ public class Queue extends ResourceController implements Saveable { * @since 1.402 */ public List<Item> getUnblockedItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + List<Item> queuedNotBlocked = new ArrayList<Item>(); + queuedNotBlocked.addAll(waitingList); + queuedNotBlocked.addAll(buildables); + queuedNotBlocked.addAll(pendings); + // but not 'blockedProjects' + return queuedNotBlocked; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this.snapshot; List<Item> queuedNotBlocked = new ArrayList<Item>(); queuedNotBlocked.addAll(snapshot.waitingList); @@ -928,6 +970,16 @@ public class Queue extends ResourceController implements Saveable { * Is the given task currently pending execution? */ public boolean isPending(Task t) { + if (lock.tryLock()) { + try { + for (BuildableItem i : pendings) + if (i.task.equals(t)) + return true; + return false; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this.snapshot; for (BuildableItem i : snapshot.pendings) if (i.task.equals(t))
If both work then I will fire each up on my test bed environment and see which passes the concurrency and lock stress... then we commit the simpler fix.
Cool I'll set them both up for a run-off to see if either breaks the issue I was trying to fix and see which is better performing... (I also have a 3rd variant of the 2nd version that should be more performant and less likely to cause a regression elsewhere, but as it's more complex...)
I would expect to get the fix committed towards 1.610... if 1.607-1.609 get selected for the next LTS then the fix for this should be a viable candidate for back-porting as it does not change any exposed APIs
https://github.com/jenkinsci/jenkins/pull/1645 and this should hopefully land for 1.610
Code changed in jenkins
User: Stephen Connolly
Path:
core/src/main/java/hudson/model/Queue.java
http://jenkins-ci.org/commit/jenkins/5880ed830201f9349ae9def6653c19a186e1eb18
Log:
[FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state.
- The creation of a snapshot itself should be relatively cheap given the expected rate of
job execution. You probably would need 100's of jobs starting execution every iteration
of maintain() before this could even start to become an issue and likely the calculation
of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
since the snapshot itself only ever has at most one reference originating outside of the stack
it should remain in the eden space and thus be cheap to GC.
JENKINS-27708comments 225819 and 225906 provide more complex but logically equivalent fixes of
this issue. I am favouring this approach as it is simpler and provides less scope for error as any
new helper methods can just rely on the snapshot being up to date whereas with the other
two candidates if a new helper method is introduced there is the potential to miss adding support
for the live view. The comment 225819 has the risk of introducing extra lock contention while
the comment 225906 version forces every access to the helper methods to pass a second memory
barrier
Code changed in jenkins
User: Oleg Nenashev
Path:
test/src/test/java/hudson/model/QueueTest.java
http://jenkins-ci.org/commit/jenkins/7514e8c6ce35283da4a8bb4422fe885350fc8681
Log:
JENKINS-27871 - Added a direct unit test for the issue
Code changed in jenkins
User: Stephen Connolly
Path:
test/src/test/java/hudson/model/QueueTest.java
http://jenkins-ci.org/commit/jenkins/57efbea3d0326f8b24bbe458fa1d25f6f45589f8
Log:
Merge pull request #2 from oleg-nenashev/JENKINS-27871
JENKINS-27871 - Added a direct unit test for the issue
Code changed in jenkins
User: Stephen Connolly
Path:
core/src/main/java/hudson/model/Queue.java
test/src/test/java/hudson/model/QueueTest.java
http://jenkins-ci.org/commit/jenkins/152d00ad09931c10f02fab4ac8a42e574d622bd3
Log:
Merge pull request #1645 from stephenc/jenkins-27708
JENKINS-27708, JENKINS-27871 Ensure that identification of blocked tasks is using the live state.
Compare: https://github.com/jenkinsci/jenkins/compare/b688047877cf...152d00ad0993
Code changed in jenkins
User: Stephen Connolly
Path:
core/src/main/java/hudson/model/Queue.java
http://jenkins-ci.org/commit/jenkins/11b89d9c0b46f0afad0ca86a11264b503b799c2e
Log:
[FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state.
- The creation of a snapshot itself should be relatively cheap given the expected rate of
job execution. You probably would need 100's of jobs starting execution every iteration
of maintain() before this could even start to become an issue and likely the calculation
of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
since the snapshot itself only ever has at most one reference originating outside of the stack
it should remain in the eden space and thus be cheap to GC.
JENKINS-27708comments 225819 and 225906 provide more complex but logically equivalent fixes of
this issue. I am favouring this approach as it is simpler and provides less scope for error as any
new helper methods can just rely on the snapshot being up to date whereas with the other
two candidates if a new helper method is introduced there is the potential to miss adding support
for the live view. The comment 225819 has the risk of introducing extra lock contention while
the comment 225906 version forces every access to the helper methods to pass a second memory
barrier
(cherry picked from commit 5880ed830201f9349ae9def6653c19a186e1eb18)
Code changed in jenkins
User: Oleg Nenashev
Path:
test/src/test/java/hudson/model/QueueTest.java
http://jenkins-ci.org/commit/jenkins/0d1efdad304df3c9cf5bd9157cce9e9fec5643aa
Log:
JENKINS-27871 - Added a direct unit test for the issue
(cherry picked from commit 7514e8c6ce35283da4a8bb4422fe885350fc8681)
Compare: https://github.com/jenkinsci/jenkins/compare/58edc35abf6c^...0d1efdad304d
Integrated in jenkins_main_trunk #4292
[FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. (Revision 11b89d9c0b46f0afad0ca86a11264b503b799c2e)
JENKINS-27871 - Added a direct unit test for the issue (Revision 0d1efdad304df3c9cf5bd9157cce9e9fec5643aa)
Result = UNSTABLE
ogondza : 11b89d9c0b46f0afad0ca86a11264b503b799c2e
Files :
- core/src/main/java/hudson/model/Queue.java
ogondza : 0d1efdad304df3c9cf5bd9157cce9e9fec5643aa
Files :
- test/src/test/java/hudson/model/QueueTest.java
I have found that this bug is present in 1.608, but not in 1.596.2.