Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-40578

Matrix flyweight job crashes with NPE if it's triggered jobs are in the queue for a long time

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • matrix-project-plugin
    • None
    • Jenkins 2.36
      Matrix Project Plugin 1.7.1

      I have a Matrix project with 50 configurations that run against a label supported by 25 machines. So, when this job runs, it immediately creates a build queue. This job has "execute concurrent builds if necessary" enabled, although I believe I have seen this issue occur when only one instance of this job running as well.

      The build queue on this server can sometimes grow very large, preventing these jobs from running for a long time. After some time with its matrix configuration jobs in the queue, I see the flyweight job fail with the following null pointer (causing Jenkins to interrupt all of the configurations in the job to fail):

      Interrupting #1003
      FATAL: null
      java.lang.NullPointerException
      at hudson.matrix.DefaultMatrixExecutionStrategyImpl.waitForCompletion(DefaultMatrixExecutionStrategyImpl.java:288)
      at hudson.matrix.DefaultMatrixExecutionStrategyImpl.run(DefaultMatrixExecutionStrategyImpl.java:162)
      at hudson.matrix.MatrixBuild$MatrixBuildExecution.doRun(MatrixBuild.java:364)
      at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
      at hudson.model.Run.execute(Run.java:1729)
      at hudson.matrix.MatrixBuild.run(MatrixBuild.java:313)
      at hudson.model.ResourceController.execute(ResourceController.java:98)
      at hudson.model.Executor.run(Executor.java:404)
      Finished: FAILURE

      looking at the source in https://github.com/jenkinsci/matrix-project-plugin/blob/master/src/main/java/hudson/matrix/DefaultMatrixExecutionStrategyImpl.java it does look like its checking to see if the queue item is null at line 288, is it possible that this is some race condition where the job has been assigned to a build machine after the code has checked if the queue item is not null but before the print statement has executed?

      All I see in the main log of Jenkins is the server logging that it aborts all of the associated matrix jobs that the flyweight job created.

          [JENKINS-40578] Matrix flyweight job crashes with NPE if it's triggered jobs are in the queue for a long time

          Gabriel Ash added a comment -

          actually, after thinking about it a bit more, it looks like qi.getCauseOfBlockage() is returning null, which is causing the problem

          could it just catch the null pointer exception and pass? this is only a print statement in the console of the flyweight job, which isn't read all that often

          Gabriel Ash added a comment - actually, after thinking about it a bit more, it looks like qi.getCauseOfBlockage() is returning null, which is causing the problem could it just catch the null pointer exception and pass? this is only a print statement in the console of the flyweight job, which isn't read all that often

          Gabriel Ash added a comment - - edited

          it looks like matrix project plugin pull request 40 addresses this bug, but it is not currently merged into the mainline.

          Gabriel Ash added a comment - - edited it looks like matrix project plugin pull request 40 addresses this bug, but it is not currently merged into the mainline.

            kohsuke Kohsuke Kawaguchi
            gabrielbash Gabriel Ash
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: