Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27708

Concurrent build limits not honored on Jenkins 1.607

      After upgrading to 1.607 we noticed the throttle plugin doesn't always prevent jobs from running in parallel as expected.

      Steps to reproduce:
      1. Create a throttle category with global limit 1, per-node limit 1.
      2. Create 3 jobs using the category, restricted to a node with two executors, sleep 60 as a build step.
      3. Request builds of all 3 jobs.

      What should happen:
      4. Jobs run in sequence; 1 then 2 then 3.

      What actually happens:
      5. Job 1 starts building, jobs 2 and 3 wait in queue (OK).
      6. After job 1 finishes, both job 2 and 3 start running (not OK).

      Plugin version is 1.8.4. Issue does not appear in Jenkins 1.606 with the same version of the plugin. Issue is reproducible on a fresh install.

      We were forced to downgrade back to 1.606 as a workaround (which is unfortunately not trivial due to JENKINS-27700).

          [JENKINS-27708] Concurrent build limits not honored on Jenkins 1.607

          Oleg Nenashev added a comment -

          https://github.com/jenkinsci/jenkins/pull/1596 reworks thread synchronization approaches in the core. It may cause such issue.

          Oleg Nenashev added a comment - https://github.com/jenkinsci/jenkins/pull/1596 reworks thread synchronization approaches in the core. It may cause such issue.

          Actually I suspect that https://github.com/jenkinsci/jenkins/pull/1596 just exposes problems with throttle concurrent builds plugin that were masked.

          Throttle concurrent builds "works" by searching for jobs that are running during the QueueTaskDispatcher.canTake evaluation. In Jenkins core, pre: https://github.com/jenkinsci/jenkins/blob/jenkins-1.606/core/src/main/java/hudson/model/Queue.java#L1178 and post: https://github.com/jenkinsci/jenkins/blob/jenkins-607/core/src/main/java/hudson/model/Queue.java#L1352 the QueueTaskDispatcher methods are called for all Buildable jobs in the Queue before mapping any of them against an executor.

          Thus https://github.com/jenkinsci/throttle-concurrent-builds-plugin/blob/throttle-concurrents-1.8.4/src/main/java/hudson/plugins/throttleconcurrents/ThrottleQueueTaskDispatcher.java#L194 will always fail to count jobs that are being picked off the queue at the same time.

          What the TCB plugin should probably be doing is either:

          The code change in 1.607 just removed some of the threading issues that allowed the TCB plugin to work most of the time by accident this bug represents a logical flaw in TCB that could still occur in 1.536-1.606 (most likely if running on a single core master) if the thread sequencing falls just right as Queue.maintain evaluates all projects in Buildable in one chunk and then maps them against executors.

          With 1.536-1.606 the executor threads could pick up the tasks while the Queue was in the process of dividing out the atomic set of work assignments, and thus - as long as the executor threads were scheduled fast enough - the TCB plugin would see the work assignments in the process of being assigned (but on a heavily loaded system or a single core system this issue would still occur)

          With 1.607 we ensure that the work assignment from Queue.maintain is mapped as an atomic unit, and thus the bug in TCB is exposed always.

          At least that is my analysis

          Stephen Connolly added a comment - Actually I suspect that https://github.com/jenkinsci/jenkins/pull/1596 just exposes problems with throttle concurrent builds plugin that were masked. Throttle concurrent builds "works" by searching for jobs that are running during the QueueTaskDispatcher.canTake evaluation. In Jenkins core, pre: https://github.com/jenkinsci/jenkins/blob/jenkins-1.606/core/src/main/java/hudson/model/Queue.java#L1178 and post: https://github.com/jenkinsci/jenkins/blob/jenkins-607/core/src/main/java/hudson/model/Queue.java#L1352 the QueueTaskDispatcher methods are called for all Buildable jobs in the Queue before mapping any of them against an executor. Thus https://github.com/jenkinsci/throttle-concurrent-builds-plugin/blob/throttle-concurrents-1.8.4/src/main/java/hudson/plugins/throttleconcurrents/ThrottleQueueTaskDispatcher.java#L194 will always fail to count jobs that are being picked off the queue at the same time. What the TCB plugin should probably be doing is either: using ResourceLists https://github.com/jenkinsci/jenkins/blob/jenkins-1.606/core/src/main/java/hudson/model/ResourceController.java#L80 to prevent the build from actually starting or checking against both projects on or when counting builds on a node: https://github.com/jenkinsci/throttle-concurrent-builds-plugin/blob/throttle-concurrents-1.8.4/src/main/java/hudson/plugins/throttleconcurrents/ThrottleQueueTaskDispatcher.java#L194 it should actually be including also the Buildable projects against that computer https://github.com/jenkinsci/jenkins/blob/jenkins-1.607/core/src/main/java/hudson/model/Queue.java#L846 The code change in 1.607 just removed some of the threading issues that allowed the TCB plugin to work most of the time by accident this bug represents a logical flaw in TCB that could still occur in 1.536-1.606 (most likely if running on a single core master) if the thread sequencing falls just right as Queue.maintain evaluates all projects in Buildable in one chunk and then maps them against executors. With 1.536-1.606 the executor threads could pick up the tasks while the Queue was in the process of dividing out the atomic set of work assignments, and thus - as long as the executor threads were scheduled fast enough - the TCB plugin would see the work assignments in the process of being assigned (but on a heavily loaded system or a single core system this issue would still occur) With 1.607 we ensure that the work assignment from Queue.maintain is mapped as an atomic unit, and thus the bug in TCB is exposed always. At least that is my analysis

          Oleg Nenashev added a comment -

          Stephen, thanks for the analysis.
          I confirm the issue and hope to work on it soon if somebody does not take the issue. It seems to be a showstopper for jenkins-1.607+.

          Oleg Nenashev added a comment - Stephen, thanks for the analysis. I confirm the issue and hope to work on it soon if somebody does not take the issue. It seems to be a showstopper for jenkins-1.607+.

          > It seems to be a showstopper for jenkins-1.607+.
          the issue's priority should be changed to blocker then

          Alexander Petrov added a comment - > It seems to be a showstopper for jenkins-1.607+. the issue's priority should be changed to blocker then

          Daniel Beck added a comment -

          Could be worse. This is not breaking your Jenkins install (as in crashes).

          Daniel Beck added a comment - Could be worse. This is not breaking your Jenkins install (as in crashes).

          Can you see if the "hack" I have made in http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150410.091817-1.war resolves this issue for you?

          diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java
          index d776a57..6a5e006 100644
          --- a/core/src/main/java/hudson/model/Queue.java
          +++ b/core/src/main/java/hudson/model/Queue.java
          @@ -1335,7 +1335,10 @@ public class Queue extends ResourceController implements Saveable {
                       final QueueSorter s = sorter;
                       if (s != null)
                           s.sortBuildableItems(buildables);
          -
          +            
          +            // JENKINS-27708, JENKINS-27871, etc
          +            updateSnapshot();
          +            
                       // allocate buildable jobs to executors
                       for (BuildableItem p : new ArrayList<BuildableItem>(
                               buildables)) {// copy as we'll mutate the list in the loop
          @@ -1372,6 +1375,9 @@ public class Queue extends ResourceController implements Saveable {
                               makePending(p);
                           else
                               LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?", p);
          +                                
          +                // JENKINS-27708, JENKINS-27871, etc
          +                updateSnapshot();
                       }
                   } finally { updateSnapshot(); } } finally {
                       lock.unlock();
          

          Stephen Connolly added a comment - Can you see if the "hack" I have made in http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150410.091817-1.war resolves this issue for you? diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..6a5e006 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -1335,7 +1335,10 @@ public class Queue extends ResourceController implements Saveable { final QueueSorter s = sorter; if (s != null ) s.sortBuildableItems(buildables); - + + // JENKINS-27708, JENKINS-27871, etc + updateSnapshot(); + // allocate buildable jobs to executors for (BuildableItem p : new ArrayList<BuildableItem>( buildables)) { // copy as we'll mutate the list in the loop @@ -1372,6 +1375,9 @@ public class Queue extends ResourceController implements Saveable { makePending(p); else LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?" , p); + + // JENKINS-27708, JENKINS-27871, etc + updateSnapshot(); } } finally { updateSnapshot(); } } finally { lock.unlock();

          It seems to help. When I run the steps from the bug description on the jenkins.war you posted I get correct behavior, the three builds run sequentially.

          Tomasz Śniatowski added a comment - It seems to help. When I run the steps from the bug description on the jenkins.war you posted I get correct behavior, the three builds run sequentially.

          http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150411.073620-2.war is an alternative attempt. It would be good if you could confirm that it also resolves the issue.

          diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java
          index d776a57..3c9367a 100644
          --- a/core/src/main/java/hudson/model/Queue.java
          +++ b/core/src/main/java/hudson/model/Queue.java
          @@ -844,6 +844,17 @@ public class Queue extends ResourceController implements Saveable {
                * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}.
                */
               public List<BuildableItem> getBuildableItems(Computer c) {
          +        if (lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
          +            try {
          +                List<BuildableItem> result = new ArrayList<BuildableItem>();
          +                _getBuildableItems(c, buildables, result);
          +                _getBuildableItems(c, pendings, result);
          +                return result;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   List<BuildableItem> result = new ArrayList<BuildableItem>();
                   _getBuildableItems(c, snapshot.buildables, result);
          @@ -865,6 +876,16 @@ public class Queue extends ResourceController implements Saveable {
                * Gets the snapshot of all {@link BuildableItem}s.
                */
               public List<BuildableItem> getBuildableItems() {
          +        if (lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
          +            try {
          +                ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables);
          +                r.addAll(pendings);
          +                return r;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables);
                   r.addAll(snapshot.pendings);
          @@ -875,6 +896,14 @@ public class Queue extends ResourceController implements Saveable {
                * Gets the snapshot of all {@link BuildableItem}s.
                */
               public List<BuildableItem> getPendingItems() {
          +        if (lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
          +            try {
          +                return new ArrayList<BuildableItem>(pendings);
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   return new ArrayList<BuildableItem>(snapshot.pendings);
               }
           
          @@ -902,6 +931,19 @@ public class Queue extends ResourceController implements Saveable {
                * @since 1.402
                */
               public List<Item> getUnblockedItems() {
          +        if (lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live
          +            try {
          +                List<Item> queuedNotBlocked = new ArrayList<Item>();
          +                queuedNotBlocked.addAll(waitingList);
          +                queuedNotBlocked.addAll(buildables);
          +                queuedNotBlocked.addAll(pendings);
          +                // but not 'blockedProjects'
          +                return queuedNotBlocked;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   List<Item> queuedNotBlocked = new ArrayList<Item>();
                   queuedNotBlocked.addAll(snapshot.waitingList);
          @@ -928,6 +970,16 @@ public class Queue extends ResourceController implements Saveable {
                * Is the given task currently pending execution?
                */
               public boolean isPending(Task t) {
          +        if (lock.tryLock()) {
          +            try {
          +                for (BuildableItem i : pendings)
          +                    if (i.task.equals(t))
          +                        return true;
          +                return false;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   for (BuildableItem i : snapshot.pendings)
                       if (i.task.equals(t))
          

          If both work then I will fire each up on my test bed environment and see which passes the concurrency and lock stress... then we commit the simpler fix.

          Stephen Connolly added a comment - http://repo.jenkins-ci.org/snapshots/org/jenkins-ci/main/jenkins-war/1.609-SNAPSHOT/jenkins-war-1.609-20150411.073620-2.war is an alternative attempt. It would be good if you could confirm that it also resolves the issue. diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..3c9367a 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -844,6 +844,17 @@ public class Queue extends ResourceController implements Saveable { * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}. */ public List<BuildableItem> getBuildableItems(Computer c) { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + List<BuildableItem> result = new ArrayList<BuildableItem>(); + _getBuildableItems(c, buildables, result); + _getBuildableItems(c, pendings, result); + return result; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<BuildableItem> result = new ArrayList<BuildableItem>(); _getBuildableItems(c, snapshot.buildables, result); @@ -865,6 +876,16 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getBuildableItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables); + r.addAll(pendings); + return r; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables); r.addAll(snapshot.pendings); @@ -875,6 +896,14 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getPendingItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + return new ArrayList<BuildableItem>(pendings); + } finally { + lock.unlock(); + } + } return new ArrayList<BuildableItem>(snapshot.pendings); } @@ -902,6 +931,19 @@ public class Queue extends ResourceController implements Saveable { * @since 1.402 */ public List<Item> getUnblockedItems() { + if (lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871, if we can get the lock, return live + try { + List<Item> queuedNotBlocked = new ArrayList<Item>(); + queuedNotBlocked.addAll(waitingList); + queuedNotBlocked.addAll(buildables); + queuedNotBlocked.addAll(pendings); + // but not 'blockedProjects' + return queuedNotBlocked; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<Item> queuedNotBlocked = new ArrayList<Item>(); queuedNotBlocked.addAll(snapshot.waitingList); @@ -928,6 +970,16 @@ public class Queue extends ResourceController implements Saveable { * Is the given task currently pending execution? */ public boolean isPending(Task t) { + if (lock.tryLock()) { + try { + for (BuildableItem i : pendings) + if (i.task.equals(t)) + return true ; + return false ; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; for (BuildableItem i : snapshot.pendings) if (i.task.equals(t)) If both work then I will fire each up on my test bed environment and see which passes the concurrency and lock stress... then we commit the simpler fix.

          Both appear to work for me.

          Tomasz Śniatowski added a comment - Both appear to work for me.

          Here is one final version of a fix, presented for posterity - or if we need to revisit.

          diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java
          index d776a57..67ba711 100644
          --- a/core/src/main/java/hudson/model/Queue.java
          +++ b/core/src/main/java/hudson/model/Queue.java
          @@ -340,6 +340,25 @@ public class Queue extends ResourceController implements Saveable {
           
               private transient final Condition condition = lock.newCondition();
           
          +    /**
          +     * There are some plugins that determine whether a project is blocked from execution on the basis of the 
          +     * jobs currently executing or currently pending execution. When we ask if a {@link Queue.Task} is blocked
          +     * for execution from {@link #maintain()}, we need to ensure that the {@link Queue.Task#isBuildBlocked(Item)}
          +     * is returning a result based on the live information of jobs scheduled rather than a result based on the
          +     * most recent {@link #snapshot}. This field holds the identity of the current thread that is running 
          +     * {@link #maintain()} and thus we can switch to returning live results when called from that thread.
          +     * 
          +     * Two alternative fixes for this class of issue (JENKINS-22708 and JENKINS-27871) could also be used:
          +     * <ul>
          +     *     <li>Update the {@link #snapshot}</li> after each and every {@link Queue.Task} is assigned for execution.</li>
          +     *     <li>Have {@link #maintain()} return after finding one {@link Queue.Task} to execute.</li>
          +     * </ul>
          +     * @since 1.FIXME
          +     * @see #liveResultRequired() 
          +     */
          +    @CheckForNull
          +    private transient volatile Thread maintainThread = null;
          +
               public Queue(@Nonnull LoadBalancer loadBalancer) {
                   this.loadBalancer =  loadBalancer.sanitize();
                   // if all the executors are busy doing something, then the queue won't be maintained in
          @@ -844,6 +863,19 @@ public class Queue extends ResourceController implements Saveable {
                * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}.
                */
               public List<BuildableItem> getBuildableItems(Computer c) {
          +        if (liveResultRequired() && lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
          +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
          +            // of the current thread being the thread running maintain().
          +            try {
          +                List<BuildableItem> result = new ArrayList<BuildableItem>();
          +                _getBuildableItems(c, buildables, result);
          +                _getBuildableItems(c, pendings, result);
          +                return result;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   List<BuildableItem> result = new ArrayList<BuildableItem>();
                   _getBuildableItems(c, snapshot.buildables, result);
          @@ -865,6 +897,18 @@ public class Queue extends ResourceController implements Saveable {
                * Gets the snapshot of all {@link BuildableItem}s.
                */
               public List<BuildableItem> getBuildableItems() {
          +        if (liveResultRequired() && lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
          +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
          +            // of the current thread being the thread running maintain().
          +            try {
          +                ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables);
          +                r.addAll(pendings);
          +                return r;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables);
                   r.addAll(snapshot.pendings);
          @@ -875,6 +919,16 @@ public class Queue extends ResourceController implements Saveable {
                * Gets the snapshot of all {@link BuildableItem}s.
                */
               public List<BuildableItem> getPendingItems() {
          +        if (liveResultRequired() && lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
          +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
          +            // of the current thread being the thread running maintain().
          +            try {
          +                return new ArrayList<BuildableItem>(pendings);
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   return new ArrayList<BuildableItem>(snapshot.pendings);
               }
           
          @@ -902,6 +956,21 @@ public class Queue extends ResourceController implements Saveable {
                * @since 1.402
                */
               public List<Item> getUnblockedItems() {
          +        if (liveResultRequired() && lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
          +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
          +            // of the current thread being the thread running maintain().
          +            try {
          +                List<Item> queuedNotBlocked = new ArrayList<Item>();
          +                queuedNotBlocked.addAll(waitingList);
          +                queuedNotBlocked.addAll(buildables);
          +                queuedNotBlocked.addAll(pendings);
          +                // but not 'blockedProjects'
          +                return queuedNotBlocked;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   List<Item> queuedNotBlocked = new ArrayList<Item>();
                   queuedNotBlocked.addAll(snapshot.waitingList);
          @@ -928,6 +997,19 @@ public class Queue extends ResourceController implements Saveable {
                * Is the given task currently pending execution?
                */
               public boolean isPending(Task t) {
          +        if (liveResultRequired() && lock.tryLock()) {
          +            // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held
          +            // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue
          +            // of the current thread being the thread running maintain().
          +            try {
          +                for (BuildableItem i : pendings)
          +                    if (i.task.equals(t))
          +                        return true;
          +                return false;
          +            } finally {
          +                lock.unlock();
          +            }
          +        }
                   Snapshot snapshot = this.snapshot;
                   for (BuildableItem i : snapshot.pendings)
                       if (i.task.equals(t))
          @@ -1250,6 +1332,18 @@ public class Queue extends ResourceController implements Saveable {
               }
           
               /**
          +     * Some functionality that determines whether builds are blocked depending on what builds are currently running
          +     * needs live information rather than information from the most recently updated snapshot. (See
          +     * JENKINS-22708 and JENKINS-27871)
          +     * @return {@code true} if the current thread is running {@link #maintain()} and thus live results should be
          +     * returned rather than results from {@link #snapshot}
          +     * @since 1.FIXME
          +     */
          +    private boolean liveResultRequired() {
          +        return Thread.currentThread() == maintainThread;
          +    }
          +    
          +    /**
                * Queue maintenance.
                *
                * <p>
          @@ -1264,6 +1358,7 @@ public class Queue extends ResourceController implements Saveable {
               public void maintain() {
                   lock.lock();
                   try { try {
          +            maintainThread = Thread.currentThread();
           
                       LOGGER.log(Level.FINE, "Queue maintenance started {0}", this);
           
          @@ -1373,7 +1468,7 @@ public class Queue extends ResourceController implements Saveable {
                           else
                               LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?", p);
                       }
          -        } finally { updateSnapshot(); } } finally {
          +        } finally { maintainThread = null; updateSnapshot(); } } finally {
                       lock.unlock();
                   }
               }
          

          I am going to apply Occam's razor and apply a slightly tweaked version of the first fix to core. Jobs are not scheduled for execution at a particularly onerous rate such that the creation of a new snapshot should not cause undue GC pressure, plus the snapshot itself should be easy for GC to clean up and should not really ever end up being promoted out of the eden pool, so simplest fix works

          Stephen Connolly added a comment - Here is one final version of a fix, presented for posterity - or if we need to revisit. diff --git a/core/src/main/java/hudson/model/Queue.java b/core/src/main/java/hudson/model/Queue.java index d776a57..67ba711 100644 --- a/core/src/main/java/hudson/model/Queue.java +++ b/core/src/main/java/hudson/model/Queue.java @@ -340,6 +340,25 @@ public class Queue extends ResourceController implements Saveable { private transient final Condition condition = lock.newCondition(); + /** + * There are some plugins that determine whether a project is blocked from execution on the basis of the + * jobs currently executing or currently pending execution. When we ask if a {@link Queue.Task} is blocked + * for execution from {@link #maintain()}, we need to ensure that the {@link Queue.Task#isBuildBlocked(Item)} + * is returning a result based on the live information of jobs scheduled rather than a result based on the + * most recent {@link #snapshot}. This field holds the identity of the current thread that is running + * {@link #maintain()} and thus we can switch to returning live results when called from that thread. + * + * Two alternative fixes for this class of issue (JENKINS-22708 and JENKINS-27871) could also be used: + * <ul> + * <li>Update the {@link #snapshot}</li> after each and every {@link Queue.Task} is assigned for execution.</li> + * <li>Have {@link #maintain()} return after finding one {@link Queue.Task} to execute.</li> + * </ul> + * @since 1.FIXME + * @see #liveResultRequired() + */ + @CheckForNull + private transient volatile Thread maintainThread = null ; + public Queue(@Nonnull LoadBalancer loadBalancer) { this .loadBalancer = loadBalancer.sanitize(); // if all the executors are busy doing something, then the queue won't be maintained in @@ -844,6 +863,19 @@ public class Queue extends ResourceController implements Saveable { * Gets all the {@link BuildableItem}s that are waiting for an executor in the given {@link Computer}. */ public List<BuildableItem> getBuildableItems(Computer c) { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + List<BuildableItem> result = new ArrayList<BuildableItem>(); + _getBuildableItems(c, buildables, result); + _getBuildableItems(c, pendings, result); + return result; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<BuildableItem> result = new ArrayList<BuildableItem>(); _getBuildableItems(c, snapshot.buildables, result); @@ -865,6 +897,18 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getBuildableItems() { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(buildables); + r.addAll(pendings); + return r; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; ArrayList<BuildableItem> r = new ArrayList<BuildableItem>(snapshot.buildables); r.addAll(snapshot.pendings); @@ -875,6 +919,16 @@ public class Queue extends ResourceController implements Saveable { * Gets the snapshot of all {@link BuildableItem}s. */ public List<BuildableItem> getPendingItems() { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + return new ArrayList<BuildableItem>(pendings); + } finally { + lock.unlock(); + } + } return new ArrayList<BuildableItem>(snapshot.pendings); } @@ -902,6 +956,21 @@ public class Queue extends ResourceController implements Saveable { * @since 1.402 */ public List<Item> getUnblockedItems() { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + List<Item> queuedNotBlocked = new ArrayList<Item>(); + queuedNotBlocked.addAll(waitingList); + queuedNotBlocked.addAll(buildables); + queuedNotBlocked.addAll(pendings); + // but not 'blockedProjects' + return queuedNotBlocked; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; List<Item> queuedNotBlocked = new ArrayList<Item>(); queuedNotBlocked.addAll(snapshot.waitingList); @@ -928,6 +997,19 @@ public class Queue extends ResourceController implements Saveable { * Is the given task currently pending execution? */ public boolean isPending(Task t) { + if (liveResultRequired() && lock.tryLock()) { + // JENKINS-22708 and JENKINS-27871. Note we should get the lock always as it should be already held + // but the safer code construct is to get the lock anyway rather than assume it is held just by virtue + // of the current thread being the thread running maintain(). + try { + for (BuildableItem i : pendings) + if (i.task.equals(t)) + return true ; + return false ; + } finally { + lock.unlock(); + } + } Snapshot snapshot = this .snapshot; for (BuildableItem i : snapshot.pendings) if (i.task.equals(t)) @@ -1250,6 +1332,18 @@ public class Queue extends ResourceController implements Saveable { } /** + * Some functionality that determines whether builds are blocked depending on what builds are currently running + * needs live information rather than information from the most recently updated snapshot. (See + * JENKINS-22708 and JENKINS-27871) + * @ return {@code true } if the current thread is running {@link #maintain()} and thus live results should be + * returned rather than results from {@link #snapshot} + * @since 1.FIXME + */ + private boolean liveResultRequired() { + return Thread .currentThread() == maintainThread; + } + + /** * Queue maintenance. * * <p> @@ -1264,6 +1358,7 @@ public class Queue extends ResourceController implements Saveable { public void maintain() { lock.lock(); try { try { + maintainThread = Thread .currentThread(); LOGGER.log(Level.FINE, "Queue maintenance started {0}" , this ); @@ -1373,7 +1468,7 @@ public class Queue extends ResourceController implements Saveable { else LOGGER.log(Level.FINE, "BuildableItem {0} with empty work units!?" , p); } - } finally { updateSnapshot(); } } finally { + } finally { maintainThread = null ; updateSnapshot(); } } finally { lock.unlock(); } } I am going to apply Occam's razor and apply a slightly tweaked version of the first fix to core. Jobs are not scheduled for execution at a particularly onerous rate such that the creation of a new snapshot should not cause undue GC pressure, plus the snapshot itself should be easy for GC to clean up and should not really ever end up being promoted out of the eden pool, so simplest fix works

          It appears we have missed the boat for 1.609...

          Stephen Connolly added a comment - It appears we have missed the boat for 1.609...

          https://github.com/jenkinsci/jenkins/pull/1645 and this should hopefully land for 1.610

          Stephen Connolly added a comment - https://github.com/jenkinsci/jenkins/pull/1645 and this should hopefully land for 1.610

          Oleg Nenashev added a comment -

          Reassigned the issue to Stephen

          Oleg Nenashev added a comment - Reassigned the issue to Stephen

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/src/main/java/hudson/model/Queue.java
          http://jenkins-ci.org/commit/jenkins/5880ed830201f9349ae9def6653c19a186e1eb18
          Log:
          [FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state.

          • The creation of a snapshot itself should be relatively cheap given the expected rate of
            job execution. You probably would need 100's of jobs starting execution every iteration
            of maintain() before this could even start to become an issue and likely the calculation
            of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
            since the snapshot itself only ever has at most one reference originating outside of the stack
            it should remain in the eden space and thus be cheap to GC.
          • JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of
            this issue. I am favouring this approach as it is simpler and provides less scope for error as any
            new helper methods can just rely on the snapshot being up to date whereas with the other
            two candidates if a new helper method is introduced there is the potential to miss adding support
            for the live view. The comment 225819 has the risk of introducing extra lock contention while
            the comment 225906 version forces every access to the helper methods to pass a second memory
            barrier

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java http://jenkins-ci.org/commit/jenkins/5880ed830201f9349ae9def6653c19a186e1eb18 Log: [FIXED JENKINS-27708] [FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. The creation of a snapshot itself should be relatively cheap given the expected rate of job execution. You probably would need 100's of jobs starting execution every iteration of maintain() before this could even start to become an issue and likely the calculation of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally since the snapshot itself only ever has at most one reference originating outside of the stack it should remain in the eden space and thus be cheap to GC. JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of this issue. I am favouring this approach as it is simpler and provides less scope for error as any new helper methods can just rely on the snapshot being up to date whereas with the other two candidates if a new helper method is introduced there is the potential to miss adding support for the live view. The comment 225819 has the risk of introducing extra lock contention while the comment 225906 version forces every access to the helper methods to pass a second memory barrier

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/src/main/java/hudson/model/Queue.java
          test/src/test/java/hudson/model/QueueTest.java
          http://jenkins-ci.org/commit/jenkins/152d00ad09931c10f02fab4ac8a42e574d622bd3
          Log:
          Merge pull request #1645 from stephenc/jenkins-27708

          JENKINS-27708, JENKINS-27871 Ensure that identification of blocked tasks is using the live state.

          Compare: https://github.com/jenkinsci/jenkins/compare/b688047877cf...152d00ad0993

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/152d00ad09931c10f02fab4ac8a42e574d622bd3 Log: Merge pull request #1645 from stephenc/jenkins-27708 JENKINS-27708 , JENKINS-27871 Ensure that identification of blocked tasks is using the live state. Compare: https://github.com/jenkinsci/jenkins/compare/b688047877cf...152d00ad0993

          dogfood added a comment -

          Integrated in jenkins_main_trunk #4087

          Result = SUCCESS

          dogfood added a comment - Integrated in jenkins_main_trunk #4087 Result = SUCCESS

          Alan Birtles added a comment -

          Did this make it into 1.610?

          Alan Birtles added a comment - Did this make it into 1.610?

          Daniel Beck added a comment - Yes, see: https://github.com/jenkinsci/jenkins/commit/152d00ad09931c10f02fab4ac8a42e574d622bd3 Unfortunately, https://github.com/jenkinsci/jenkins/pull/1645#issuecomment-94141001

          Oleg Nenashev added a comment -

          Marking as fixed

          Oleg Nenashev added a comment - Marking as fixed

          I think with this latest approach there's some issues when both the Build Blocker Plugin and Throttle Concurrent Builds Plugin are used in conjunction. I've had two instances now where builds get stuck in the queue when nothing that is currently executing should be blocking them.

          Hovering over the queue items, they do not offer a reason for the blockage – only a start reason and waiting time in queue.

          Trying to find a consistent method to reproduce and scope of the issue, but regardless our use of the Build Blocker Plugin in this instance is legacy and due to be removed.

          Matthew DeTullio added a comment - I think with this latest approach there's some issues when both the Build Blocker Plugin and Throttle Concurrent Builds Plugin are used in conjunction. I've had two instances now where builds get stuck in the queue when nothing that is currently executing should be blocking them. Hovering over the queue items, they do not offer a reason for the blockage – only a start reason and waiting time in queue. Trying to find a consistent method to reproduce and scope of the issue, but regardless our use of the Build Blocker Plugin in this instance is legacy and due to be removed.

          It seems to be fixed for me as well (I reported duplicate JENKINS-28084).

          But now the same (as in above comment) happened to me multiple times: jobs get stuck from time to time in queue. Nothing else running, but they just won't start. If I cancel ('x' button) them from queue and start them again manually they do start.

          I don't have a Build Blocker Plugin installed.

          Can I check something more when I see this happening again to pinpoint the cause?
          Should we reopen or create separate bug report?

          Andrei Costescu added a comment - It seems to be fixed for me as well (I reported duplicate JENKINS-28084 ). But now the same (as in above comment) happened to me multiple times: jobs get stuck from time to time in queue. Nothing else running, but they just won't start. If I cancel ('x' button) them from queue and start them again manually they do start. I don't have a Build Blocker Plugin installed. Can I check something more when I see this happening again to pinpoint the cause? Should we reopen or create separate bug report?

          Seems like we have similar issues with only BuildBlocker (not the throttle concurrent builds plugin) installed. But we´re still running on 1610. Just looks like queued blocked builds are not triggert (or the regex check) again after one blocking build finished.

          Dirk von Husen added a comment - Seems like we have similar issues with only BuildBlocker (not the throttle concurrent builds plugin) installed. But we´re still running on 1610. Just looks like queued blocked builds are not triggert (or the regex check) again after one blocking build finished.

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/src/main/java/hudson/model/Queue.java
          http://jenkins-ci.org/commit/jenkins/11b89d9c0b46f0afad0ca86a11264b503b799c2e
          Log:
          [FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state.

          • The creation of a snapshot itself should be relatively cheap given the expected rate of
            job execution. You probably would need 100's of jobs starting execution every iteration
            of maintain() before this could even start to become an issue and likely the calculation
            of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
            since the snapshot itself only ever has at most one reference originating outside of the stack
            it should remain in the eden space and thus be cheap to GC.
          • JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of
            this issue. I am favouring this approach as it is simpler and provides less scope for error as any
            new helper methods can just rely on the snapshot being up to date whereas with the other
            two candidates if a new helper method is introduced there is the potential to miss adding support
            for the live view. The comment 225819 has the risk of introducing extra lock contention while
            the comment 225906 version forces every access to the helper methods to pass a second memory
            barrier

          (cherry picked from commit 5880ed830201f9349ae9def6653c19a186e1eb18)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java http://jenkins-ci.org/commit/jenkins/11b89d9c0b46f0afad0ca86a11264b503b799c2e Log: [FIXED JENKINS-27708] [FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. The creation of a snapshot itself should be relatively cheap given the expected rate of job execution. You probably would need 100's of jobs starting execution every iteration of maintain() before this could even start to become an issue and likely the calculation of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally since the snapshot itself only ever has at most one reference originating outside of the stack it should remain in the eden space and thus be cheap to GC. JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of this issue. I am favouring this approach as it is simpler and provides less scope for error as any new helper methods can just rely on the snapshot being up to date whereas with the other two candidates if a new helper method is introduced there is the potential to miss adding support for the live view. The comment 225819 has the risk of introducing extra lock contention while the comment 225906 version forces every access to the helper methods to pass a second memory barrier (cherry picked from commit 5880ed830201f9349ae9def6653c19a186e1eb18)

          Dirk, did you find a solution to the new problem - builds not starting when they should?
          I didn't update Jenkins recently to avoid hitting new bugs.

          Andrei Costescu added a comment - Dirk, did you find a solution to the new problem - builds not starting when they should? I didn't update Jenkins recently to avoid hitting new bugs.

          Dirk von Husen added a comment - - edited

          No, did not find a solution so far. I am going to create an issue here soon. Since we were still using an older version, I wanted to update first, check if the problem is still there and then come back to this issue.
          Unfortunately the problem still exists, even after the update to version 1.613.

          -> please notice JENKINS-28376

          Dirk von Husen added a comment - - edited No, did not find a solution so far. I am going to create an issue here soon. Since we were still using an older version, I wanted to update first, check if the problem is still there and then come back to this issue. Unfortunately the problem still exists, even after the update to version 1.613. -> please notice JENKINS-28376

          I added my vote to it. Thanks.

          Andrei Costescu added a comment - I added my vote to it. Thanks.

          This is fundamentally an issue in t-c-b-p making incorrect assumptions

          Stephen Connolly added a comment - This is fundamentally an issue in t-c-b-p making incorrect assumptions

          Oleg Nenashev added a comment -

          costescuandrei , stephenconnolly, dvh_yxlon
          Is it a subject for a new issue?

          Oleg Nenashev added a comment - costescuandrei , stephenconnolly , dvh_yxlon Is it a subject for a new issue?

          Dirk already created JENKINS-28376 for the new problem.
          Don't know how related it is in code to the current one.

          Andrei Costescu added a comment - Dirk already created JENKINS-28376 for the new problem. Don't know how related it is in code to the current one.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #4292
          [FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. (Revision 11b89d9c0b46f0afad0ca86a11264b503b799c2e)

          Result = UNSTABLE
          ogondza : 11b89d9c0b46f0afad0ca86a11264b503b799c2e
          Files :

          • core/src/main/java/hudson/model/Queue.java

          dogfood added a comment - Integrated in jenkins_main_trunk #4292 [FIXED JENKINS-27708] [FIXED JENKINS-27871] Ensure that identification of blocked tasks is using the live state. (Revision 11b89d9c0b46f0afad0ca86a11264b503b799c2e) Result = UNSTABLE ogondza : 11b89d9c0b46f0afad0ca86a11264b503b799c2e Files : core/src/main/java/hudson/model/Queue.java

            oleg_nenashev Oleg Nenashev
            tsniatowski Tomasz Śniatowski
            Votes:
            2 Vote for this issue
            Watchers:
            17 Start watching this issue

              Created:
              Updated:
              Resolved: