Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69850

Queue maintain falls in an infinite recursive loop - preventing all jobs to be executed

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core
    • 2.375, 2.361.4

      Issue

      After upgrading from 2.340(jdk8 image) to 2.372(jdk11 image), just after stargin Jenkins, the Queue maintain gets into an infinite recursive loop and throws a stackoverflow, rendering the Queue unusable (jobs can't run).

      Same scenario occurred twice in prod. Everything was fine during tests but obviously without the same jobs in the Queue.

      Hypothesis on the cause

      This looks like very much an edge-case not caught by the tests and validations of this change, JENKINS-68780 - https://github.com/jenkinsci/jenkins/pull/6675, which was introduced in 2.361

      We do have the priority-sorter-plugin that could be interfering with the Queue, but I verified and the Blocking Items are all traversed anyway in AbstractProject

      Technical explanation

      Prerequisites

      • Job has blockBuildWhenDownstreamBuilding or blockBuildWhenUpstreamBuilding enabled
      • Job is blocked without an assigned BlockedItem.causeOfBlockage (null)
        I cannot yet explain how the BlockedItem.causeOfBlockage was null. I'm still investigation on that.

        **
      • However, it's clearly supported as I could see that null BlockedItem.causeOfBlockage is supported in the code but causes the infinite loop since the mentioned modification
        • UPDATE 2022-10-17: It doesn't change the fact that null causeOfBlockage is supported, but here are where it could emanate from:
          • Restored from the Queue.xml at startup
          • Instantiated indirectly
          • A plugin
          • Another mechanism?

      Issue comes from the fact that BlockedItem.causeOfBlockage can be null. This has been validated with a heap dump

      Cleaned up Call chain leading to the issue (reconstituted)

      There must be a null BlockedItem.causeOfBlockage

      // Read from bottom to top like a stacktrace
      
      -- Again, and so on
      
      hudson.model.Qeueue$BlockedItem.getCauseOfBlockage(Queue.java:2630) [This is where the null causeOfBlockage is important]
      hudson.model.AbstractProject.getBuildingUpstream(AbtractProject.java:1143)
      hudson.model.AbstractProject.getCauseOfBlockage(AbtractProject.java:1094)
      hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
      hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197) 
      
      -- Another recursion of the loop
      
      hudson.model.Qeueue$BlockedItem.getCauseOfBlockage(Queue.java:2630) [This is where the null causeOfBlockage is important]
      hudson.model.AbstractProject.getBuildingUpstream(AbtractProject.java:1143)
      hudson.model.AbstractProject.getCauseOfBlockage(AbtractProject.java:1094)
      hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
      hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       
      -- Start of infinite recursive loop
      
      hudson.model.Queue.maintain(Queue.java:1539)
      
      -- Starts here

      Here's the real stack trace of the stackoverflow

       

      {"thread_name":"jenkins.util.Timer [#1]","message":"Timer task hudson.model.Queue$MaintainTask@73873351 failed","timestamp":"2022-10-12 23:26:54.557","level":"SEVERE","mdc":{},"container":"master","logger_name":"hudson.triggers.SafeTimerTask","source_host":"bdbf33cd8b7c","exception_class":"java.lang.StackOverflowError","stacktrace":"java.lang.StackOverflowError
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1077)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
       at hudson.model.AbstractProject.getBuildingUpstream(AbstractProject.java:1143)
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1094)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
       at hudson.model.AbstractProject.getBuildingUpstream(AbstractProject.java:1143)
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1094)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
       at hudson.model.AbstractProject.getBuildingUpstream(AbstractProject.java:1143)
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1094)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
      
      And it goes on and on and on... until stackoverflow

       

          [JENKINS-69850] Queue maintain falls in an infinite recursive loop - preventing all jobs to be executed

          Will this make it into the LTS version? Because LTS is affected...

          Rolf Offermanns added a comment - Will this make it into the LTS version? Because LTS is affected...

          Any chance to get this into the LTS version?

          Rolf Offermanns added a comment - Any chance to get this into the LTS version?

          Will be included in the next LTS baseline.

          Alexander Brandes added a comment - Will be included in the next LTS baseline.

          Steve Hill added a comment -

          Thanks for this fix! I would like to propose this be included in a 2.361.4 LTS.

          We've hit this a few times after restarting controllers. No jobs are able to start until builds are manually canceled from the queue.

          Steve Hill added a comment - Thanks for this fix! I would like to propose this be included in a 2.361.4 LTS. We've hit this a few times after restarting controllers. No jobs are able to start until builds are manually canceled from the queue.

          Steve Graham added a comment -

          Had to go back to LTS 2.362.2 since this was a real blocking factor.

          Steve Graham added a comment - Had to go back to LTS 2.362.2 since this was a real blocking factor.

          Tim Jacomb added a comment -

          This has been released in 2.361.4

          Tim Jacomb added a comment - This has been released in 2.361.4

          Steve Hill added a comment -

          timja thank you for creating a new LTS release!

          Steve Hill added a comment - timja thank you for creating a new LTS release!

          Steve Graham added a comment -

          Installed the 2.361.4 release yesterday. It stopped again last night. No reason in the log file,
          I will have to go back to 2.361.2 again.
          I have some exceptions but only marked as Warnings. No idea what the real cause was.
          ( I did a grep for exception and got 30000 + since 06. Oct )
          [id=19522] WARNING h.p.b.g.GlobalTimeOutConfiguration#timeOutFor: Monitoring/jenkinsNodes#140870 cannot allow individual jobs to overwrite timeout due to ClassCastException

          java.lang.ClassCastException: class hudson.matrix.MatrixProject cannot be cast to class hudson.model.Project (hudson.matrix.MatrixProject is in unnamed module of loader jenkins.util.URLClassLoader2 @3424daf6; hudson.model.Project is in unnamed module of loader org.eclipse.jetty.webapp.WebAppClassLoader @6f0ca692)

          Steve Graham added a comment - Installed the 2.361.4 release yesterday. It stopped again last night. No reason in the log file, I will have to go back to 2.361.2 again. I have some exceptions but only marked as Warnings. No idea what the real cause was. ( I did a grep for exception and got 30000 + since 06. Oct ) [id=19522] WARNING h.p.b.g.GlobalTimeOutConfiguration#timeOutFor: Monitoring/jenkinsNodes#140870 cannot allow individual jobs to overwrite timeout due to ClassCastException java.lang.ClassCastException: class hudson.matrix.MatrixProject cannot be cast to class hudson.model.Project (hudson.matrix.MatrixProject is in unnamed module of loader jenkins.util.URLClassLoader2 @3424daf6; hudson.model.Project is in unnamed module of loader org.eclipse.jetty.webapp.WebAppClassLoader @6f0ca692)

          Basil Crow added a comment -

          sgjenkins This is not a general support thread and what you are describing has nothing to do with this ticket. Please open a new ticket detailing how to reproduce your problem from scratch.

          Basil Crow added a comment - sgjenkins This is not a general support thread and what you are describing has nothing to do with this ticket. Please open a new ticket detailing how to reproduce your problem from scratch.

          Steve Graham added a comment -

          My Matrix jobs throw a lot of jobs into the queue simultaneously. It looks like this blows up the queue and kills jenkins. No other jobs were running.
          This crash was new with 2.361.3 and continues with 2.361.4.
          There is unfortunately nothing in the log.

          • thanks for the suggestion. I will wait for 2.375.1

          Steve Graham added a comment - My Matrix jobs throw a lot of jobs into the queue simultaneously. It looks like this blows up the queue and kills jenkins. No other jobs were running. This crash was new with 2.361.3 and continues with 2.361.4. There is unfortunately nothing in the log. thanks for the suggestion. I will wait for 2.375.1

            Unassigned Unassigned
            l_r Louis-Rémi Paquet
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: