Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69850

Queue maintain falls in an infinite recursive loop - preventing all jobs to be executed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core
    • 2.375, 2.361.4

      Issue

      After upgrading from 2.340(jdk8 image) to 2.372(jdk11 image), just after stargin Jenkins, the Queue maintain gets into an infinite recursive loop and throws a stackoverflow, rendering the Queue unusable (jobs can't run).

      Same scenario occurred twice in prod. Everything was fine during tests but obviously without the same jobs in the Queue.

      Hypothesis on the cause

      This looks like very much an edge-case not caught by the tests and validations of this change, JENKINS-68780 - https://github.com/jenkinsci/jenkins/pull/6675, which was introduced in 2.361

      We do have the priority-sorter-plugin that could be interfering with the Queue, but I verified and the Blocking Items are all traversed anyway in AbstractProject

      Technical explanation

      Prerequisites

      • Job has blockBuildWhenDownstreamBuilding or blockBuildWhenUpstreamBuilding enabled
      • Job is blocked without an assigned BlockedItem.causeOfBlockage (null)
        I cannot yet explain how the BlockedItem.causeOfBlockage was null. I'm still investigation on that.

        **
      • However, it's clearly supported as I could see that null BlockedItem.causeOfBlockage is supported in the code but causes the infinite loop since the mentioned modification
        • UPDATE 2022-10-17: It doesn't change the fact that null causeOfBlockage is supported, but here are where it could emanate from:
          • Restored from the Queue.xml at startup
          • Instantiated indirectly
          • A plugin
          • Another mechanism?

      Issue comes from the fact that BlockedItem.causeOfBlockage can be null. This has been validated with a heap dump

      Cleaned up Call chain leading to the issue (reconstituted)

      There must be a null BlockedItem.causeOfBlockage

      // Read from bottom to top like a stacktrace
      
      -- Again, and so on
      
      hudson.model.Qeueue$BlockedItem.getCauseOfBlockage(Queue.java:2630) [This is where the null causeOfBlockage is important]
      hudson.model.AbstractProject.getBuildingUpstream(AbtractProject.java:1143)
      hudson.model.AbstractProject.getCauseOfBlockage(AbtractProject.java:1094)
      hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
      hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197) 
      
      -- Another recursion of the loop
      
      hudson.model.Qeueue$BlockedItem.getCauseOfBlockage(Queue.java:2630) [This is where the null causeOfBlockage is important]
      hudson.model.AbstractProject.getBuildingUpstream(AbtractProject.java:1143)
      hudson.model.AbstractProject.getCauseOfBlockage(AbtractProject.java:1094)
      hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
      hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       
      -- Start of infinite recursive loop
      
      hudson.model.Queue.maintain(Queue.java:1539)
      
      -- Starts here

      Here's the real stack trace of the stackoverflow

       

      {"thread_name":"jenkins.util.Timer [#1]","message":"Timer task hudson.model.Queue$MaintainTask@73873351 failed","timestamp":"2022-10-12 23:26:54.557","level":"SEVERE","mdc":{},"container":"master","logger_name":"hudson.triggers.SafeTimerTask","source_host":"bdbf33cd8b7c","exception_class":"java.lang.StackOverflowError","stacktrace":"java.lang.StackOverflowError
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1077)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
       at hudson.model.AbstractProject.getBuildingUpstream(AbstractProject.java:1143)
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1094)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
       at hudson.model.AbstractProject.getBuildingUpstream(AbstractProject.java:1143)
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1094)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
       at hudson.model.AbstractProject.getBuildingUpstream(AbstractProject.java:1143)
       at hudson.model.AbstractProject.getCauseOfBlockage(AbstractProject.java:1094)
       at hudson.model.Queue.getCauseOfBlockageForTask(Queue.java:1240)
       at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
       at hudson.model.Queue$BlockedItem.getCauseOfBlockage(Queue.java:2630)
      
      And it goes on and on and on... until stackoverflow

       

            Unassigned Unassigned
            l_r Louis-Rémi Paquet
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: