Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27371

Parent builds sometimes hang on successful child builds of same type

    • Icon: Bug Bug
    • Resolution: Postponed
    • Icon: Major Major
    • multijob-plugin
    • None
    • jenkins 1.593
      multijob plugin 1.16

      With a job configuration like:

      parent_job:
      child_job
      child_job

      occasionally we'll get output that looks like this:

      Starting build job child_job.
      Starting build job child_job.
      Finished Build : #123 - Job : child_job with status : SUCCESS
      <this hangs forever, so we abort the parent_job>
      Aborting all subjobs.
      Finished Build : #124 - Job : child_job with status : ABORTED

      However, job #124 will always have finished successfully. We see this relatively rarely (~5% of parent_job builds?). My guess is that there's a race condition causing this, since we usually see this during times of high load (ie many child_job instances are being started).

          [JENKINS-27371] Parent builds sometimes hang on successful child builds of same type

          Simon Weber created issue -

          You have the option "Kill the phase on:" for each sub-jobs on a phase. When this is on "Failure", this kill all others builds when this sub-job has failed. Select "never" to never kill others sub-jobs.

          The option "Abort all other job" is when the job is aborted manually by an user.

          Mathieu Cantin added a comment - You have the option "Kill the phase on:" for each sub-jobs on a phase. When this is on "Failure", this kill all others builds when this sub-job has failed. Select "never" to never kill others sub-jobs. The option "Abort all other job" is when the job is aborted manually by an user.

          Simon Weber added a comment -

          I'm not sure that has anything to do with the problem? The problem is that multijob will sometimes lose track of a child job and wait forever for its result. That forces us to abort the parent build (or else the parent job would run indefinitely).

          Simon Weber added a comment - I'm not sure that has anything to do with the problem? The problem is that multijob will sometimes lose track of a child job and wait forever for its result. That forces us to abort the parent build (or else the parent job would run indefinitely).

          Sorry, I see the problem. Do you have the "Execute concurrent builds if necessary" checked on child_job ?

          Mathieu Cantin added a comment - Sorry, I see the problem. Do you have the "Execute concurrent builds if necessary" checked on child_job ?
          Mathieu Cantin made changes -
          Link New: This issue duplicates JENKINS-26678 [ JENKINS-26678 ]

          Simon Weber added a comment -

          > Sorry, I see the problem

          Oh, no worries; I could have done a better job describing the problem.

          > Do you have the "Execute concurrent builds if necessary" checked on child_job?

          Yes; on both child_job and parent_job. I had originally suspected this only happened when job numbers were interleaved due to another concurrent build, but that didn't happen in an example I just triggered now. I'll check the logs to see if there's anything interesting.

          Simon Weber added a comment - > Sorry, I see the problem Oh, no worries; I could have done a better job describing the problem. > Do you have the "Execute concurrent builds if necessary" checked on child_job? Yes; on both child_job and parent_job. I had originally suspected this only happened when job numbers were interleaved due to another concurrent build, but that didn't happen in an example I just triggered now. I'll check the logs to see if there's anything interesting.

          Simon Weber added a comment -

          The logs aren't too informative. My comments are in [ square brackets ].

          parent_job log

          Starting build job child_job.
          Starting build job child_job.
          Finished Build : #6631 of Job : child_job with status : SUCCESS
          Build timed out (after 20 minutes). Marking the build as aborted.  [ this is from a plugin we use to timeout automatically ]
          Aborting all subjobs.
          Finished Build : #6630 of Job : child_job with status : ABORTED [ note that the parent job knows about the correct subjob since it knows which to abort ]
          Build was aborted
          Finished: ABORTED
          

          Jenkins log:

          Mar 12, 2015 2:34:40 PM INFO hudson.model.Run execute
          child_job #6631 main build action completed: SUCCESS
          Mar 12, 2015 2:34:45 PM INFO hudson.model.Run execute
          child_job #6630 main build action completed: SUCCESS  [ this subjob did finish successfully and on time, but the parent job missed its result ]
          
          <snip>
          
          Mar 12, 2015 2:36:52 PM SEVERE hudson.model.Executor run [I don't think this is relevant - it happens later than the job finishes - but I figured I'd include it ]
          Executor threw an exception
          java.util.NoSuchElementException
              at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
              at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
              at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
              at hudson.util.RunList.subList(RunList.java:137)
              at hudson.tasks.LogRotator.perform(LogRotator.java:124)
              at hudson.model.Job.logRotate(Job.java:449)
              at hudson.model.Run.execute(Run.java:1823)
              at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
              at hudson.model.ResourceController.execute(ResourceController.java:89)
              at hudson.model.Executor.run(Executor.java:240)
          
          

          Simon Weber added a comment - The logs aren't too informative. My comments are in [ square brackets ]. parent_job log Starting build job child_job. Starting build job child_job. Finished Build : #6631 of Job : child_job with status : SUCCESS Build timed out (after 20 minutes). Marking the build as aborted. [ this is from a plugin we use to timeout automatically ] Aborting all subjobs. Finished Build : #6630 of Job : child_job with status : ABORTED [ note that the parent job knows about the correct subjob since it knows which to abort ] Build was aborted Finished: ABORTED Jenkins log: Mar 12, 2015 2:34:40 PM INFO hudson.model.Run execute child_job #6631 main build action completed: SUCCESS Mar 12, 2015 2:34:45 PM INFO hudson.model.Run execute child_job #6630 main build action completed: SUCCESS [ this subjob did finish successfully and on time, but the parent job missed its result ] <snip> Mar 12, 2015 2:36:52 PM SEVERE hudson.model.Executor run [I don 't think this is relevant - it happens later than the job finishes - but I figured I' d include it ] Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:449) at hudson.model.Run.execute(Run.java:1823) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)

          Good new (or bad). This seen a change on the API of Jenkins 1.596.1 (we have updated last friday). I work on a fix.

          Mathieu Cantin added a comment - Good new (or bad). This seen a change on the API of Jenkins 1.596.1 (we have updated last friday). I work on a fix.

          Simon Weber added a comment -

          Is the api change you're thinking of in that specific version? We're still on version 1.593, and I think we saw this behavior on a version < 1.565.1 as well.

          Simon Weber added a comment - Is the api change you're thinking of in that specific version? We're still on version 1.593, and I think we saw this behavior on a version < 1.565.1 as well.

          Daniel Beck added a comment -

          If something hangs, thread dumps would be helpful.

          https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump

          Daniel Beck added a comment - If something hangs, thread dumps would be helpful. https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump

            chenc Chen Cohen
            simonmweber Simon Weber
            Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: