Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28926

Jenkins queue self-locking without apparent reason?

      Since some weeks ago we are experimenting some problems with the jenkins queue.

      While looking for dupes before creating this... I've found a bunch of issues, similar, but I'm not sure if any of them are the very same issue than this, because they often comment about various plugins we are not using at all). Here it's a brief list of those "similar" issues, just in case, at the end, all them are the same problem: JENKINS-28532, JENKINS-28887, JENKINS-28136, JENKINS-28376, JENKINS-28690...

      One thing in common for all them is that they are really recent and it seems to be common that, whatever the problem is, it started around 1.611. While I don't have the exact version for our case (coz we update continuously) I'd say it started happening also recently here.

      Description:

      We have 2 jenkins server, a public one (linux) and a private/testing (mac) one. And we are experimenting the same problem in both. This is the URL of the public one:

      http://integration.moodle.org

      There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

      The first job is always a git-pull-changes one and it starts the "chain" whenever changes are detected in the target branch. We have one chain for every supported branch.

      And this has been working since ages ago (years). If for any reason a job was manually launched or the scheduled (every 5 minutes) git job detected new changes... it never has been a problem. Those new jobs were there, in the queue, waiting for the current "chain" to finish. And, once finished, the queue handling was clever enough to detect the 1st job to execute from it, also deleting dupes or whatever was needed.

      Basically, the summary is that it never became stuck, no matter how new jobs were in the queue or how they had landed to it (manually or automatically). So far, perfect.

      But, since some versions ago.. that has changed drastically. Now, if we add manually jobs to the queue, of if multiple changes are detected in a short period of time... those jobs in the queue correctly wait for the current "chain" to end (like they used to do, can be viewed hovering over elements). But once the chain has ended, the queue is not able to decide any job to start with, and it became "locked" forever.

      Right now, if you go to the server above... you'll see that there are 4 jobs, all them belonging to the "master" view/branch/chain, awaiting in the queue... never launched and, worse, preventing new runs in that branch to happen. And the hover information does not show any waiting cause (screenshots added, showing both manually added jobs when the chain was running and automatic jobs, any of them with a reason for the locking, as far as all the executors are idle).

      And those self-locks are really having an impact here, because it's transforming our "continuous automatic integration" experience into a "wow, we have not run tests for master since 2 days ago, wtf, let's kill the queue manually and process all changes together, grrr" thing. Sure you get it, lol.

      Those servers and chains have been working perfectly since the night of the times and, while we are using various plugins for notification, conditional builds and so on, it seems that the way the queue handles jobs using the core "Block build..." settings has changed recently, leading easily (both manually & automated changes) to some horrible locks.

      Constantly. And it's a recent "change of behavior". I'm not sure if it's ok to call it a "bug" (although I feel inclined to think that), but can ensure that it's hurting our integration experience here.

      Finally, we are reproducing this behavior with both 1.617 (testing server) and older 1.613 (public server).

      Ciao and thanks for all the hard work, you rock

          [JENKINS-28926] Jenkins queue self-locking without apparent reason?

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/src/main/java/hudson/model/Queue.java
          core/src/main/java/hudson/model/queue/QueueSorter.java
          test/src/test/java/hudson/model/QueueTest.java
          http://jenkins-ci.org/commit/jenkins/7929412037ff75f60791cfb23631521f8726c23d
          Log:
          Merge pull request #1743 from stephenc/jenkins-28926

          [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete

          Compare: https://github.com/jenkinsci/jenkins/compare/482bffa9cb91...7929412037ff

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/queue/QueueSorter.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/7929412037ff75f60791cfb23631521f8726c23d Log: Merge pull request #1743 from stephenc/jenkins-28926 [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete Compare: https://github.com/jenkinsci/jenkins/compare/482bffa9cb91...7929412037ff

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          changelog.html
          core/src/main/java/hudson/model/queue/QueueSorter.java
          http://jenkins-ci.org/commit/jenkins/a208dfeac886d67d805505546e49ae52940a191e
          Log:
          JENKINS-28926 Noting merge of #1743

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: changelog.html core/src/main/java/hudson/model/queue/QueueSorter.java http://jenkins-ci.org/commit/jenkins/a208dfeac886d67d805505546e49ae52940a191e Log: JENKINS-28926 Noting merge of #1743

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/src/main/java/hudson/model/Queue.java
          http://jenkins-ci.org/commit/jenkins/8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2
          Log:
          JENKINS-28926 Tidy-up TODO for the Java 7+ Jenkins versions

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java http://jenkins-ci.org/commit/jenkins/8c5b9cd008a4d0fb30dc39d9ee1bd72b95b199f2 Log: JENKINS-28926 Tidy-up TODO for the Java 7+ Jenkins versions

          Can anyone here confirm whether this fix will address the problem described here?

          Further, I have confirmed the problem (at least as described under JENKINS-28513) is reproducible on the latest LTS release as well (v1.609.1). We just recently finishing rolling out this LTS edition into production on our build farm yesterday and have had numerous cases already where this bug is affecting our production teams.

          As it stands we've had severely detrimental effects on our development teams as a result of this defect so the sooner it can be backported the better!

          Kevin Phillips added a comment - Can anyone here confirm whether this fix will address the problem described here ? Further, I have confirmed the problem (at least as described under JENKINS-28513 ) is reproducible on the latest LTS release as well (v1.609.1). We just recently finishing rolling out this LTS edition into production on our build farm yesterday and have had numerous cases already where this bug is affecting our production teams. As it stands we've had severely detrimental effects on our development teams as a result of this defect so the sooner it can be backported the better!

          Eloy Lafuente added a comment -

          Not knowing anything about internals... if your jobs stayed in the queue forever, never being picked for build... and without any "this is blocked by xxxxx hover" I'd say the fix here may solve your situation (as far as I understood the discussion @ github it precisely avoids those deadlocks in the queue "without cause").

          But be noted, I can be 200% wrong, it's just a supposition, based in the "symptom" being the same I experimented here (no matter i do not use the build-blocker-plugin, but the core "Block upstream/downstream" settings instead.

          Sure once 1.618 is out we'll easily know the answer. Ciao

          Eloy Lafuente added a comment - Not knowing anything about internals... if your jobs stayed in the queue forever, never being picked for build... and without any "this is blocked by xxxxx hover" I'd say the fix here may solve your situation (as far as I understood the discussion @ github it precisely avoids those deadlocks in the queue "without cause"). But be noted, I can be 200% wrong, it's just a supposition, based in the "symptom" being the same I experimented here (no matter i do not use the build-blocker-plugin, but the core "Block upstream/downstream" settings instead. Sure once 1.618 is out we'll easily know the answer. Ciao

          Since we have experienced severe regression problems with every single Jenkins upgrade we have ever performed, we now have a sandbox environment setup for testing new versions (although apparently our test environment is insufficient to catch all problems since we still managed to miss this one).

          I only mention that here because I can probably test out 1.618 fairly quickly to see if I can reproduce the problem on our particular configuration, which I would be happy to do if it means we can get the fix backported sooner.

          Just let me know if I can help.

          Kevin Phillips added a comment - Since we have experienced severe regression problems with every single Jenkins upgrade we have ever performed, we now have a sandbox environment setup for testing new versions (although apparently our test environment is insufficient to catch all problems since we still managed to miss this one). I only mention that here because I can probably test out 1.618 fairly quickly to see if I can reproduce the problem on our particular configuration, which I would be happy to do if it means we can get the fix backported sooner. Just let me know if I can help.

          FYI if you are stuck, killing one of the deadlocked threads (i.e. calling Thread.stop() on the one with Queue.maintain() ) from the Groovy console will repair your instance without restarting it.

          We have a CloudBees hotfix for this issue (sadly for CloudBees customers) that does just that, i.e. periodically checks for this type of deadlock and kills the one with Queue.maintain() as that is the safe one to kill.

          All the test scenarios we could come up with to reproduce these type of deadlocks do not give rise to deadlocks on 1.618 (but do deadlock 1.617)... doesn't mean that leedega's deadlock is the same... it may be a different deadlock... providing the stack trace of the deadlocked threads is the easiest way to confirm/deny

          Stephen Connolly added a comment - FYI if you are stuck, killing one of the deadlocked threads (i.e. calling Thread.stop() on the one with Queue.maintain() ) from the Groovy console will repair your instance without restarting it. We have a CloudBees hotfix for this issue (sadly for CloudBees customers) that does just that, i.e. periodically checks for this type of deadlock and kills the one with Queue.maintain() as that is the safe one to kill. All the test scenarios we could come up with to reproduce these type of deadlocks do not give rise to deadlocks on 1.618 (but do deadlock 1.617)... doesn't mean that leedega 's deadlock is the same... it may be a different deadlock... providing the stack trace of the deadlocked threads is the easiest way to confirm/deny

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/src/main/java/hudson/model/Queue.java
          core/src/main/java/hudson/model/queue/QueueSorter.java
          http://jenkins-ci.org/commit/jenkins/4f4a64a522ec7bf31f24280827757214e6985f3d
          Log:
          [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete

          • One could argue that without this change the system is functioning correctly and that previous behaviour
            was a bug. On the other hand, people have come to rely on the previous behaviour.
          • The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream
            relies on checking the building projects and the queued (excluding blocked) tasks we need any change in
            the blocked task list to be visible immediately (i.e. update the snapshot)
          • I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects
            but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is
            fixed for my manual test case
          • I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise
            tasks as intended by the QueueSorter

          (cherry picked from commit de87736795898e57f7aca140124c2b1a3d1daf40)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/queue/QueueSorter.java http://jenkins-ci.org/commit/jenkins/4f4a64a522ec7bf31f24280827757214e6985f3d Log: [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete One could argue that without this change the system is functioning correctly and that previous behaviour was a bug. On the other hand, people have come to rely on the previous behaviour. The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream relies on checking the building projects and the queued (excluding blocked) tasks we need any change in the blocked task list to be visible immediately (i.e. update the snapshot) I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is fixed for my manual test case I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise tasks as intended by the QueueSorter (cherry picked from commit de87736795898e57f7aca140124c2b1a3d1daf40)

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          test/src/test/java/hudson/model/QueueTest.java
          http://jenkins-ci.org/commit/jenkins/8596004024e9d8a00a99c459b4d7c82c004d1724
          Log:
          JENKINS-28926 Adding test case

          • I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me

          (cherry picked from commit c44c088442e1821f8cd44f4fdaa146d94dd85910)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/8596004024e9d8a00a99c459b4d7c82c004d1724 Log: JENKINS-28926 Adding test case I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me (cherry picked from commit c44c088442e1821f8cd44f4fdaa146d94dd85910)

          dogfood added a comment -

          Integrated in jenkins_main_trunk #4292
          [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision 4f4a64a522ec7bf31f24280827757214e6985f3d)
          JENKINS-28926 Adding test case (Revision 8596004024e9d8a00a99c459b4d7c82c004d1724)

          Result = UNSTABLE
          ogondza : 4f4a64a522ec7bf31f24280827757214e6985f3d
          Files :

          • core/src/main/java/hudson/model/queue/QueueSorter.java
          • core/src/main/java/hudson/model/Queue.java

          ogondza : 8596004024e9d8a00a99c459b4d7c82c004d1724
          Files :

          • test/src/test/java/hudson/model/QueueTest.java

          dogfood added a comment - Integrated in jenkins_main_trunk #4292 [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision 4f4a64a522ec7bf31f24280827757214e6985f3d) JENKINS-28926 Adding test case (Revision 8596004024e9d8a00a99c459b4d7c82c004d1724) Result = UNSTABLE ogondza : 4f4a64a522ec7bf31f24280827757214e6985f3d Files : core/src/main/java/hudson/model/queue/QueueSorter.java core/src/main/java/hudson/model/Queue.java ogondza : 8596004024e9d8a00a99c459b4d7c82c004d1724 Files : test/src/test/java/hudson/model/QueueTest.java

            stephenconnolly Stephen Connolly
            stronk7 Eloy Lafuente
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: