Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28926

Jenkins queue self-locking without apparent reason?

    XMLWordPrintable

Details

    Description

      Since some weeks ago we are experimenting some problems with the jenkins queue.

      While looking for dupes before creating this... I've found a bunch of issues, similar, but I'm not sure if any of them are the very same issue than this, because they often comment about various plugins we are not using at all). Here it's a brief list of those "similar" issues, just in case, at the end, all them are the same problem: JENKINS-28532, JENKINS-28887, JENKINS-28136, JENKINS-28376, JENKINS-28690...

      One thing in common for all them is that they are really recent and it seems to be common that, whatever the problem is, it started around 1.611. While I don't have the exact version for our case (coz we update continuously) I'd say it started happening also recently here.

      Description:

      We have 2 jenkins server, a public one (linux) and a private/testing (mac) one. And we are experimenting the same problem in both. This is the URL of the public one:

      http://integration.moodle.org

      There we have some "chains" of free-form jobs, with all the jobs having both the "Block build when upstream project is building" and "Block build when downstream project is building" settings ticked.

      The first job is always a git-pull-changes one and it starts the "chain" whenever changes are detected in the target branch. We have one chain for every supported branch.

      And this has been working since ages ago (years). If for any reason a job was manually launched or the scheduled (every 5 minutes) git job detected new changes... it never has been a problem. Those new jobs were there, in the queue, waiting for the current "chain" to finish. And, once finished, the queue handling was clever enough to detect the 1st job to execute from it, also deleting dupes or whatever was needed.

      Basically, the summary is that it never became stuck, no matter how new jobs were in the queue or how they had landed to it (manually or automatically). So far, perfect.

      But, since some versions ago.. that has changed drastically. Now, if we add manually jobs to the queue, of if multiple changes are detected in a short period of time... those jobs in the queue correctly wait for the current "chain" to end (like they used to do, can be viewed hovering over elements). But once the chain has ended, the queue is not able to decide any job to start with, and it became "locked" forever.

      Right now, if you go to the server above... you'll see that there are 4 jobs, all them belonging to the "master" view/branch/chain, awaiting in the queue... never launched and, worse, preventing new runs in that branch to happen. And the hover information does not show any waiting cause (screenshots added, showing both manually added jobs when the chain was running and automatic jobs, any of them with a reason for the locking, as far as all the executors are idle).

      And those self-locks are really having an impact here, because it's transforming our "continuous automatic integration" experience into a "wow, we have not run tests for master since 2 days ago, wtf, let's kill the queue manually and process all changes together, grrr" thing. Sure you get it, lol.

      Those servers and chains have been working perfectly since the night of the times and, while we are using various plugins for notification, conditional builds and so on, it seems that the way the queue handles jobs using the core "Block build..." settings has changed recently, leading easily (both manually & automated changes) to some horrible locks.

      Constantly. And it's a recent "change of behavior". I'm not sure if it's ok to call it a "bug" (although I feel inclined to think that), but can ensure that it's hurting our integration experience here.

      Finally, we are reproducing this behavior with both 1.617 (testing server) and older 1.613 (public server).

      Ciao and thanks for all the hard work, you rock

      Attachments

        Issue Links

          Activity

            Since we have experienced severe regression problems with every single Jenkins upgrade we have ever performed, we now have a sandbox environment setup for testing new versions (although apparently our test environment is insufficient to catch all problems since we still managed to miss this one).

            I only mention that here because I can probably test out 1.618 fairly quickly to see if I can reproduce the problem on our particular configuration, which I would be happy to do if it means we can get the fix backported sooner.

            Just let me know if I can help.

            leedega Kevin Phillips added a comment - Since we have experienced severe regression problems with every single Jenkins upgrade we have ever performed, we now have a sandbox environment setup for testing new versions (although apparently our test environment is insufficient to catch all problems since we still managed to miss this one). I only mention that here because I can probably test out 1.618 fairly quickly to see if I can reproduce the problem on our particular configuration, which I would be happy to do if it means we can get the fix backported sooner. Just let me know if I can help.

            FYI if you are stuck, killing one of the deadlocked threads (i.e. calling Thread.stop() on the one with Queue.maintain() ) from the Groovy console will repair your instance without restarting it.

            We have a CloudBees hotfix for this issue (sadly for CloudBees customers) that does just that, i.e. periodically checks for this type of deadlock and kills the one with Queue.maintain() as that is the safe one to kill.

            All the test scenarios we could come up with to reproduce these type of deadlocks do not give rise to deadlocks on 1.618 (but do deadlock 1.617)... doesn't mean that leedega's deadlock is the same... it may be a different deadlock... providing the stack trace of the deadlocked threads is the easiest way to confirm/deny

            stephenconnolly Stephen Connolly added a comment - FYI if you are stuck, killing one of the deadlocked threads (i.e. calling Thread.stop() on the one with Queue.maintain() ) from the Groovy console will repair your instance without restarting it. We have a CloudBees hotfix for this issue (sadly for CloudBees customers) that does just that, i.e. periodically checks for this type of deadlock and kills the one with Queue.maintain() as that is the safe one to kill. All the test scenarios we could come up with to reproduce these type of deadlocks do not give rise to deadlocks on 1.618 (but do deadlock 1.617)... doesn't mean that leedega 's deadlock is the same... it may be a different deadlock... providing the stack trace of the deadlocked threads is the easiest way to confirm/deny

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            core/src/main/java/hudson/model/Queue.java
            core/src/main/java/hudson/model/queue/QueueSorter.java
            http://jenkins-ci.org/commit/jenkins/4f4a64a522ec7bf31f24280827757214e6985f3d
            Log:
            [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete

            • One could argue that without this change the system is functioning correctly and that previous behaviour
              was a bug. On the other hand, people have come to rely on the previous behaviour.
            • The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream
              relies on checking the building projects and the queued (excluding blocked) tasks we need any change in
              the blocked task list to be visible immediately (i.e. update the snapshot)
            • I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects
              but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is
              fixed for my manual test case
            • I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise
              tasks as intended by the QueueSorter

            (cherry picked from commit de87736795898e57f7aca140124c2b1a3d1daf40)

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/src/main/java/hudson/model/Queue.java core/src/main/java/hudson/model/queue/QueueSorter.java http://jenkins-ci.org/commit/jenkins/4f4a64a522ec7bf31f24280827757214e6985f3d Log: [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete One could argue that without this change the system is functioning correctly and that previous behaviour was a bug. On the other hand, people have come to rely on the previous behaviour. The issue really centeres around state changes in the blocked tasks. Since blocking on upstream/downstream relies on checking the building projects and the queued (excluding blocked) tasks we need any change in the blocked task list to be visible immediately (i.e. update the snapshot) I was able to reliably reproduce this behaviour with a convoluted set of manually configured projects but turning this into a test case has not proved quite as easy. Manual testing confirms that the issue is fixed for my manual test case I have also added a sorting of the blocked list when probing for tasks to unblock. This should prioritise tasks as intended by the QueueSorter (cherry picked from commit de87736795898e57f7aca140124c2b1a3d1daf40)

            Code changed in jenkins
            User: Stephen Connolly
            Path:
            test/src/test/java/hudson/model/QueueTest.java
            http://jenkins-ci.org/commit/jenkins/8596004024e9d8a00a99c459b4d7c82c004d1724
            Log:
            JENKINS-28926 Adding test case

            • I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me

            (cherry picked from commit c44c088442e1821f8cd44f4fdaa146d94dd85910)

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/8596004024e9d8a00a99c459b4d7c82c004d1724 Log: JENKINS-28926 Adding test case I was forgetting the call to `rebuildDependencyGraph()` which was why the test didn't work for me (cherry picked from commit c44c088442e1821f8cd44f4fdaa146d94dd85910)
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #4292
            [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision 4f4a64a522ec7bf31f24280827757214e6985f3d)
            JENKINS-28926 Adding test case (Revision 8596004024e9d8a00a99c459b4d7c82c004d1724)

            Result = UNSTABLE
            ogondza : 4f4a64a522ec7bf31f24280827757214e6985f3d
            Files :

            • core/src/main/java/hudson/model/queue/QueueSorter.java
            • core/src/main/java/hudson/model/Queue.java

            ogondza : 8596004024e9d8a00a99c459b4d7c82c004d1724
            Files :

            • test/src/test/java/hudson/model/QueueTest.java
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #4292 [FIXED JENKINS-28926] Block while upstream/downstream building cycles never complete (Revision 4f4a64a522ec7bf31f24280827757214e6985f3d) JENKINS-28926 Adding test case (Revision 8596004024e9d8a00a99c459b4d7c82c004d1724) Result = UNSTABLE ogondza : 4f4a64a522ec7bf31f24280827757214e6985f3d Files : core/src/main/java/hudson/model/queue/QueueSorter.java core/src/main/java/hudson/model/Queue.java ogondza : 8596004024e9d8a00a99c459b4d7c82c004d1724 Files : test/src/test/java/hudson/model/QueueTest.java

            People

              stephenconnolly Stephen Connolly
              stronk7 Eloy Lafuente
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: