Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-17590

Builds fail because of "slave went offline during the build"

      Several times now I have builds that fail because of a "Looks like the node went offline during the build. Check the slave log for the details." message.

      That slave (a swarm slave) is still connected, but we want to reboot it's host. We have switched it to offline because it still had builds running, and we wanted to wait till they finish, but not have it accept new builds (that's the purpose of Offline, yes?)

      Of course that whole purpose is defeated if switching the slave to Offline also causes running builds to fail.

      Expected:

      • Have a way to cleanly shut down and disconnect an existing slave that has builds running, without disturbing its running builds in any way. Currently that is not possible.

          [JENKINS-17590] Builds fail because of "slave went offline during the build"

          Marc Günther created issue -
          Marc Günther made changes -
          Description Original: Several times now I have builds that fail because of a "Looks like the node went offline during the build. Check the slave log for the details." message.

          That slave (a swarm slave) is still connected. We have switched it to offline because it still had builds running, and we want to wait till they finish, but not have it accept new builds (that's the purpose of Offline, yes?)

          Of course that whole purpose is defeated if switching the slave to Offline also causes running builds to fail.

          Expected:
          - Have a way to cleanly shut down and disconnect an existing slave that has builds running, without disturbing that running builds in any way. Currently that is not possible.
          New: Several times now I have builds that fail because of a "Looks like the node went offline during the build. Check the slave log for the details." message.

          That slave (a swarm slave) is still connected, but we want to reboot it's host. We have switched it to offline because it still had builds running, and we wanted to wait till they finish, but not have it accept new builds (that's the purpose of Offline, yes?)

          Of course that whole purpose is defeated if switching the slave to Offline also causes running builds to fail.

          Expected:
          - Have a way to cleanly shut down and disconnect an existing slave that has builds running, without disturbing its running builds in any way. Currently that is not possible.
          Summary Original: Build fails because of "slave went offline during the build" New: Builds fail because of "slave went offline during the build"

          Marc Günther added a comment -

          Example, this is from a job with a Maven buildstep followed by the following postbuild actions, Archive Artifact, Publish HTML, Publish JUnit:

          ...
          [INFO] BUILD SUCCESSFUL
          [INFO] ------------------------------------------------------------------------
          [INFO] Total time: 27 minutes 17 seconds
          [INFO] Finished at: Fri Apr 12 15:20:59 CEST 2013
          [INFO] Final Memory: 50M/496M
          [INFO] ------------------------------------------------------------------------
          Looks like the node went offline during the build. Check the slave log for the details.FATAL: null
          java.lang.NullPointerException
          

          Marc Günther added a comment - Example, this is from a job with a Maven buildstep followed by the following postbuild actions, Archive Artifact, Publish HTML, Publish JUnit: ... [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 27 minutes 17 seconds [INFO] Finished at: Fri Apr 12 15:20:59 CEST 2013 [INFO] Final Memory: 50M/496M [INFO] ------------------------------------------------------------------------ Looks like the node went offline during the build. Check the slave log for the details.FATAL: null java.lang.NullPointerException

          I'm seeing this also when running 1.515 and the previous 2 or 3 releases before. Output found in the console log below. The slave log didn't have any other details.

          ////////

          Looks like the node went offline during the build. Check the slave log for the details.FATAL: null
          java.lang.NullPointerException
          at hudson.plugins.timestamper.annotator.TimestampAnnotatorFactory.getOffset(TimestampAnnotatorFactory.java:65)
          at hudson.plugins.timestamper.annotator.TimestampAnnotatorFactory.newInstance(TimestampAnnotatorFactory.java:52)
          at hudson.console.ConsoleAnnotator._for(ConsoleAnnotator.java:143)
          at hudson.console.ConsoleAnnotator.initial(ConsoleAnnotator.java:133)
          at hudson.console.AnnotatedLargeText.createAnnotator(AnnotatedLargeText.java:140)
          at hudson.console.AnnotatedLargeText.writeHtmlTo(AnnotatedLargeText.java:157)
          at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:599)
          at hudson.model.Run.execute(Run.java:1575)
          at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:241)

          Andrew Erickson added a comment - I'm seeing this also when running 1.515 and the previous 2 or 3 releases before. Output found in the console log below. The slave log didn't have any other details. //////// Looks like the node went offline during the build. Check the slave log for the details.FATAL: null java.lang.NullPointerException at hudson.plugins.timestamper.annotator.TimestampAnnotatorFactory.getOffset(TimestampAnnotatorFactory.java:65) at hudson.plugins.timestamper.annotator.TimestampAnnotatorFactory.newInstance(TimestampAnnotatorFactory.java:52) at hudson.console.ConsoleAnnotator._for(ConsoleAnnotator.java:143) at hudson.console.ConsoleAnnotator.initial(ConsoleAnnotator.java:133) at hudson.console.AnnotatedLargeText.createAnnotator(AnnotatedLargeText.java:140) at hudson.console.AnnotatedLargeText.writeHtmlTo(AnnotatedLargeText.java:157) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:599) at hudson.model.Run.execute(Run.java:1575) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:241)
          Andrew Erickson made changes -
          Affects Version/s New: current [ 10162 ]

          I'm going to up the priority on this to 'Major'. It affects whether or not jobs pass or fail which seems like core functionality.

          This used to work fine... the job could still pass even if it was taken offline.

          We frequently take nodes offline before the job finishes so that we can inspect the system state (with long running jobs)... so we see this more often and it's somewhat painful to have to rerun to get a blue/green ball even if it succeeded.

          Andrew Erickson added a comment - I'm going to up the priority on this to 'Major'. It affects whether or not jobs pass or fail which seems like core functionality. This used to work fine... the job could still pass even if it was taken offline. We frequently take nodes offline before the job finishes so that we can inspect the system state (with long running jobs)... so we see this more often and it's somewhat painful to have to rerun to get a blue/green ball even if it succeeded.
          Andrew Erickson made changes -
          Priority Original: Minor [ 4 ] New: Major [ 3 ]

          I think I've tracked this down to the "Timestamper" plugin (https://wiki.jenkins-ci.org/display/JENKINS/Timestamper).

          Test setup:

          • create a new job that just has one shell command 'sleep 30'

          Testing procedure:

          • start job
          • before job finishes, take node offline
          • see if job fails (it should succeed)

          This test procedure also fails on our production server (Jenkins 1.515 with Timestamper 1.5.3)

          A build of git head (1.518) on a virgin local installation does not have this bug. Installing the latest version of Timestamper causes the test to fail (the job causes NPE's just like on our production server).

          Andrew Erickson added a comment - I think I've tracked this down to the "Timestamper" plugin ( https://wiki.jenkins-ci.org/display/JENKINS/Timestamper ). Test setup: create a new job that just has one shell command 'sleep 30' Testing procedure: start job before job finishes, take node offline see if job fails (it should succeed) This test procedure also fails on our production server (Jenkins 1.515 with Timestamper 1.5.3) A build of git head (1.518) on a virgin local installation does not have this bug. Installing the latest version of Timestamper causes the test to fail (the job causes NPE's just like on our production server).

          Steven,

          Do you have any ideas on this?

          I've used Timestamper for awhile before this... so something seems to have changed in core that brought this about.

          Thanks,
          Andy

          Andrew Erickson added a comment - Steven, Do you have any ideas on this? I've used Timestamper for awhile before this... so something seems to have changed in core that brought this about. Thanks, Andy
          Andrew Erickson made changes -
          Assignee New: Steven G Brown [ stevengbrown ]

            stevengbrown Steven G Brown
            marc_guenther Marc Günther
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: