Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27371

Parent builds sometimes hang on successful child builds of same type

    • Icon: Bug Bug
    • Resolution: Postponed
    • Icon: Major Major
    • multijob-plugin
    • None
    • jenkins 1.593
      multijob plugin 1.16

      With a job configuration like:

      parent_job:
      child_job
      child_job

      occasionally we'll get output that looks like this:

      Starting build job child_job.
      Starting build job child_job.
      Finished Build : #123 - Job : child_job with status : SUCCESS
      <this hangs forever, so we abort the parent_job>
      Aborting all subjobs.
      Finished Build : #124 - Job : child_job with status : ABORTED

      However, job #124 will always have finished successfully. We see this relatively rarely (~5% of parent_job builds?). My guess is that there's a race condition causing this, since we usually see this during times of high load (ie many child_job instances are being started).

          [JENKINS-27371] Parent builds sometimes hang on successful child builds of same type

          You have the option "Kill the phase on:" for each sub-jobs on a phase. When this is on "Failure", this kill all others builds when this sub-job has failed. Select "never" to never kill others sub-jobs.

          The option "Abort all other job" is when the job is aborted manually by an user.

          Mathieu Cantin added a comment - You have the option "Kill the phase on:" for each sub-jobs on a phase. When this is on "Failure", this kill all others builds when this sub-job has failed. Select "never" to never kill others sub-jobs. The option "Abort all other job" is when the job is aborted manually by an user.

          Simon Weber added a comment -

          I'm not sure that has anything to do with the problem? The problem is that multijob will sometimes lose track of a child job and wait forever for its result. That forces us to abort the parent build (or else the parent job would run indefinitely).

          Simon Weber added a comment - I'm not sure that has anything to do with the problem? The problem is that multijob will sometimes lose track of a child job and wait forever for its result. That forces us to abort the parent build (or else the parent job would run indefinitely).

          Sorry, I see the problem. Do you have the "Execute concurrent builds if necessary" checked on child_job ?

          Mathieu Cantin added a comment - Sorry, I see the problem. Do you have the "Execute concurrent builds if necessary" checked on child_job ?

          Simon Weber added a comment -

          > Sorry, I see the problem

          Oh, no worries; I could have done a better job describing the problem.

          > Do you have the "Execute concurrent builds if necessary" checked on child_job?

          Yes; on both child_job and parent_job. I had originally suspected this only happened when job numbers were interleaved due to another concurrent build, but that didn't happen in an example I just triggered now. I'll check the logs to see if there's anything interesting.

          Simon Weber added a comment - > Sorry, I see the problem Oh, no worries; I could have done a better job describing the problem. > Do you have the "Execute concurrent builds if necessary" checked on child_job? Yes; on both child_job and parent_job. I had originally suspected this only happened when job numbers were interleaved due to another concurrent build, but that didn't happen in an example I just triggered now. I'll check the logs to see if there's anything interesting.

          Simon Weber added a comment -

          The logs aren't too informative. My comments are in [ square brackets ].

          parent_job log

          Starting build job child_job.
          Starting build job child_job.
          Finished Build : #6631 of Job : child_job with status : SUCCESS
          Build timed out (after 20 minutes). Marking the build as aborted.  [ this is from a plugin we use to timeout automatically ]
          Aborting all subjobs.
          Finished Build : #6630 of Job : child_job with status : ABORTED [ note that the parent job knows about the correct subjob since it knows which to abort ]
          Build was aborted
          Finished: ABORTED
          

          Jenkins log:

          Mar 12, 2015 2:34:40 PM INFO hudson.model.Run execute
          child_job #6631 main build action completed: SUCCESS
          Mar 12, 2015 2:34:45 PM INFO hudson.model.Run execute
          child_job #6630 main build action completed: SUCCESS  [ this subjob did finish successfully and on time, but the parent job missed its result ]
          
          <snip>
          
          Mar 12, 2015 2:36:52 PM SEVERE hudson.model.Executor run [I don't think this is relevant - it happens later than the job finishes - but I figured I'd include it ]
          Executor threw an exception
          java.util.NoSuchElementException
              at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
              at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
              at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
              at hudson.util.RunList.subList(RunList.java:137)
              at hudson.tasks.LogRotator.perform(LogRotator.java:124)
              at hudson.model.Job.logRotate(Job.java:449)
              at hudson.model.Run.execute(Run.java:1823)
              at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
              at hudson.model.ResourceController.execute(ResourceController.java:89)
              at hudson.model.Executor.run(Executor.java:240)
          
          

          Simon Weber added a comment - The logs aren't too informative. My comments are in [ square brackets ]. parent_job log Starting build job child_job. Starting build job child_job. Finished Build : #6631 of Job : child_job with status : SUCCESS Build timed out (after 20 minutes). Marking the build as aborted. [ this is from a plugin we use to timeout automatically ] Aborting all subjobs. Finished Build : #6630 of Job : child_job with status : ABORTED [ note that the parent job knows about the correct subjob since it knows which to abort ] Build was aborted Finished: ABORTED Jenkins log: Mar 12, 2015 2:34:40 PM INFO hudson.model.Run execute child_job #6631 main build action completed: SUCCESS Mar 12, 2015 2:34:45 PM INFO hudson.model.Run execute child_job #6630 main build action completed: SUCCESS [ this subjob did finish successfully and on time, but the parent job missed its result ] <snip> Mar 12, 2015 2:36:52 PM SEVERE hudson.model.Executor run [I don 't think this is relevant - it happens later than the job finishes - but I figured I' d include it ] Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:449) at hudson.model.Run.execute(Run.java:1823) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)

          Good new (or bad). This seen a change on the API of Jenkins 1.596.1 (we have updated last friday). I work on a fix.

          Mathieu Cantin added a comment - Good new (or bad). This seen a change on the API of Jenkins 1.596.1 (we have updated last friday). I work on a fix.

          Simon Weber added a comment -

          Is the api change you're thinking of in that specific version? We're still on version 1.593, and I think we saw this behavior on a version < 1.565.1 as well.

          Simon Weber added a comment - Is the api change you're thinking of in that specific version? We're still on version 1.593, and I think we saw this behavior on a version < 1.565.1 as well.

          Daniel Beck added a comment -

          If something hangs, thread dumps would be helpful.

          https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump

          Daniel Beck added a comment - If something hangs, thread dumps would be helpful. https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump

          Simon Weber added a comment -

          I finally got a thread dump! It looks like the problem is at this line: https://github.com/jenkinsci/tikal-multijob-plugin/blob/jenkins-multijob-plugin-1.16/src/main/java/com/tikal/jenkins/plugins/multijob/MultiJobBuilder.java#L216.

          Here's the relevant thread dump:

          "Executor #13 for qa-jenkins-psuedorunner-016 : executing platform_pr #2036" Id=257890 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de
          	at sun.misc.Unsafe.park(Native Method)
          	-  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
          	at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:216)
          	at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110)
          	at org.jenkins_ci.plugins.run_condition.BuildStepRunner$DontRun.conditionalRun(BuildStepRunner.java:264)
          	at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105)
          	at org.jenkinsci.plugins.conditionalbuildstep.singlestep.SingleConditionalBuilder.perform(SingleConditionalBuilder.java:108)
          	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
          	at hudson.model.Build$BuildExecution.build(Build.java:199)
          	at hudson.model.Build$BuildExecution.doRun(Build.java:160)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533)
          	at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:134)
          	at hudson.model.Run.execute(Run.java:1759)
          	at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73)
          	at hudson.model.ResourceController.execute(ResourceController.java:89)
          	at hudson.model.Executor.run(Executor.java:240)
          

          Let me know if you need any other information from the dump – I saved it all just in case.

          Simon Weber added a comment - I finally got a thread dump! It looks like the problem is at this line: https://github.com/jenkinsci/tikal-multijob-plugin/blob/jenkins-multijob-plugin-1.16/src/main/java/com/tikal/jenkins/plugins/multijob/MultiJobBuilder.java#L216 . Here's the relevant thread dump: "Executor #13 for qa-jenkins-psuedorunner-016 : executing platform_pr #2036" Id=257890 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:216) at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110) at org.jenkins_ci.plugins.run_condition.BuildStepRunner$DontRun.conditionalRun(BuildStepRunner.java:264) at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105) at org.jenkinsci.plugins.conditionalbuildstep.singlestep.SingleConditionalBuilder.perform(SingleConditionalBuilder.java:108) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770) at hudson.model.Build$BuildExecution.build(Build.java:199) at hudson.model.Build$BuildExecution.doRun(Build.java:160) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533) at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:134) at hudson.model.Run.execute(Run.java:1759) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240) Let me know if you need any other information from the dump – I saved it all just in case.

          Simon Weber added a comment -

          Here's a log message regarding the subjob that finished, but was still being waited on. My comments in [ square brackets ]:

          Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute
          platform_parallel #2620 main build action completed: SUCCESS  [ this was the job the platform_pr was waiting on ]
          
          Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute
          platform_parallel #2619 main build action completed: SUCCESS
          
          Apr 01, 2015 5:52:20 PM WARNING jenkins.model.lazy.LazyBuildMixIn removeRun
          com.tikal.jenkins.plugins.multijob.MultiJobProject@4ff2acc8[platform_parallel] did not contain platform_parallel #1620 to begin with  [ coincidence? was this supposed to be #2620? ]
          
          Apr 01, 2015 5:52:20 PM SEVERE hudson.model.Executor run
          Executor threw an exception
          java.util.NoSuchElementException
          	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
          	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
          	at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
          	at hudson.util.RunList.subList(RunList.java:137)
          	at hudson.tasks.LogRotator.perform(LogRotator.java:124)
          	at hudson.model.Job.logRotate(Job.java:449)
          	at hudson.model.Run.execute(Run.java:1823)
          	at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73)
          	at hudson.model.ResourceController.execute(ResourceController.java:89)
          	at hudson.model.Executor.run(Executor.java:240)
          

          Simon Weber added a comment - Here's a log message regarding the subjob that finished, but was still being waited on. My comments in [ square brackets ]: Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute platform_parallel #2620 main build action completed: SUCCESS [ this was the job the platform_pr was waiting on ] Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute platform_parallel #2619 main build action completed: SUCCESS Apr 01, 2015 5:52:20 PM WARNING jenkins.model.lazy.LazyBuildMixIn removeRun com.tikal.jenkins.plugins.multijob.MultiJobProject@4ff2acc8[platform_parallel] did not contain platform_parallel #1620 to begin with [ coincidence? was this supposed to be #2620? ] Apr 01, 2015 5:52:20 PM SEVERE hudson.model.Executor run Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:449) at hudson.model.Run.execute(Run.java:1823) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)

          I created this pull request, https://github.com/jenkinsci/tikal-multijob-plugin/pull/65, and I think this will fix that.

          Mathieu Cantin added a comment - I created this pull request, https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 , and I think this will fix that.

          Simon Weber added a comment -

          Can we get some eyes on that pull request? We're still running into this issue daily.

          Simon Weber added a comment - Can we get some eyes on that pull request? We're still running into this issue daily.

          Simon Weber added a comment -

          @mcantin, I just installed your changes from source and unfortunately we're still seeing the problem. I've only seen hangs when one of the child builds failed, in case that's useful.

          Simon Weber added a comment - @mcantin, I just installed your changes from source and unfortunately we're still seeing the problem. I've only seen hangs when one of the child builds failed, in case that's useful.

          Michael Merritt added a comment - - edited

          We're also seeing this same issue that Simon is describing and it's causing deploy headaches for us. Frequently having to kill the job and rebuild it and roll the dice to see if it's going to work. Our logs look similar to those mentioned above. Let me know if you need more info.

          Jenkins 1.607
          Multijob 1.16

          website-deploy-production-download #10432 main build action completed: SUCCESS
          Jun 10, 2015 8:50:19 PM SEVERE hudson.model.Executor finish1
          Executor threw an exception
          java.util.NoSuchElementException
          	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
          	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
          	at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
          	at hudson.util.RunList.subList(RunList.java:137)
          	at hudson.tasks.LogRotator.perform(LogRotator.java:124)
          	at hudson.model.Job.logRotate(Job.java:465)
          	at hudson.model.Run.execute(Run.java:1805)
          	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          	at hudson.model.ResourceController.execute(ResourceController.java:98)
          	at hudson.model.Executor.run(Executor.java:374)
          

          Michael Merritt added a comment - - edited We're also seeing this same issue that Simon is describing and it's causing deploy headaches for us. Frequently having to kill the job and rebuild it and roll the dice to see if it's going to work. Our logs look similar to those mentioned above. Let me know if you need more info. Jenkins 1.607 Multijob 1.16 website-deploy-production-download #10432 main build action completed: SUCCESS Jun 10, 2015 8:50:19 PM SEVERE hudson.model.Executor finish1 Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:465) at hudson.model.Run.execute(Run.java:1805) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374)

          Since ~ 1 month, I try a new fix and I don't have again this bug. Can you try it? The PR : https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 and the build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/159/.

          Mathieu Cantin added a comment - Since ~ 1 month, I try a new fix and I don't have again this bug. Can you try it? The PR : https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 and the build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/159/ .

          Thanks, we'll give it a try.

          Michael Merritt added a comment - Thanks, we'll give it a try.

          Simon Weber added a comment -

          mcantin, that fix didn't work for us: it made all builds hang.

          I found this error in the logs:

          Jun 19, 2015 9:34:03 PM WARNING hudson.triggers.Trigger checkTriggers
          org.jenkinsci.plugins.ghprb.GhprbTrigger.run() failed for hudson.model.FreeStyleProject@b347b18[venmo_platform_auto_pr]
          java.lang.NullPointerException
          	at org.jenkinsci.plugins.ghprb.GhprbTrigger.run(GhprbTrigger.java:155)
          	at hudson.triggers.Trigger.checkTriggers(Trigger.java:266)
          	at hudson.triggers.Trigger$Cron.doRun(Trigger.java:214)
          	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
          	at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
          	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:724)
          

          Simon Weber added a comment - mcantin , that fix didn't work for us: it made all builds hang. I found this error in the logs: Jun 19, 2015 9:34:03 PM WARNING hudson.triggers.Trigger checkTriggers org.jenkinsci.plugins.ghprb.GhprbTrigger.run() failed for hudson.model.FreeStyleProject@b347b18[venmo_platform_auto_pr] java.lang.NullPointerException at org.jenkinsci.plugins.ghprb.GhprbTrigger.run(GhprbTrigger.java:155) at hudson.triggers.Trigger.checkTriggers(Trigger.java:266) at hudson.triggers.Trigger$Cron.doRun(Trigger.java:214) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:724)

          Mathieu Cantin added a comment - - edited

          simonmweber
          I'm sorry. Did you update the "GitHub pull request builder plugin" (https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin) recently? Because, the error was on this plugin with a "Free Style Project". Which version do you have?

          Mathieu Cantin added a comment - - edited simonmweber I'm sorry. Did you update the "GitHub pull request builder plugin" ( https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin ) recently? Because, the error was on this plugin with a "Free Style Project". Which version do you have?

          Simon Weber added a comment -

          Whoops! I pasted the wrong thing. Let me do some debugging and get back to you.

          Simon Weber added a comment - Whoops! I pasted the wrong thing. Let me do some debugging and get back to you.

          Simon Weber added a comment -

          Ok, I had to fix a few unrelated problems, haha. The hanging I was seeing before was caused by Jenkins taking a long time to discard old builds. I'm not sure why that was triggered by updating the plugin and restarting – maybe I just got unlucky?

          Anyway, after fixing all that other stuff and upgrading, things look good! I'll know for sure next week once we have some real load on it, but this looks promising =)

          Simon Weber added a comment - Ok, I had to fix a few unrelated problems, haha. The hanging I was seeing before was caused by Jenkins taking a long time to discard old builds. I'm not sure why that was triggered by updating the plugin and restarting – maybe I just got unlucky? Anyway, after fixing all that other stuff and upgrading, things look good! I'll know for sure next week once we have some real load on it, but this looks promising =)

          Simon Weber added a comment -

          mcantin bummer; we just had a build hang.

          Here are the logs, my comments <"like this"> so jira highlights them.

          From the build itself:

          Started by upstream project "platform_deploy_listener" build number 219
          <"snip">
          [MultiJob] Starting job platform_parallel.
          [MultiJob] Starting job platform_parallel.
          [MultiJob] Starting job venmo_platform_external_integration.
          [MultiJob] Starting job venmo_platform_external_integration.
          [MultiJob] Starting job scope_frontend.
          [MultiJob] Finished Build : #96 - deploy of Job : scope_frontend with status : SUCCESS
          [MultiJob] Finished Build : #6837 of Job : platform_parallel with status : SUCCESS
          [MultiJob] Finished Build : #6836 of Job : platform_parallel with status : SUCCESS
          [MultiJob] Finished Build : #1248 - deploy of Job : venmo_platform_external_integration with status : SUCCESS
          Build timed out (after 20 minutes). Marking the build as aborted.
          [MultiJob] Aborting all subjobs.
          [MultiJob] Aborting platform_parallel.
          [MultiJob] Aborting platform_parallel.
          [MultiJob] Aborting venmo_platform_external_integration.
          [MultiJob] Aborting venmo_platform_external_integration.
          [MultiJob] Aborting scope_frontend.
          Build was aborted
          <"snip">
          [MultiJob] Finished Build : #1247 - deploy of Job : venmo_platform_external_integration with status : ABORTED <"here's the build that hung">
          

          From Jenkins:

          Jun 22, 2015 2:58:55 PM INFO
          venmo_platform_test_and_deploy - #863 Started by ...
          Jun 22, 2015 3:09:10 PM INFO hudson.model.Run execute
          venmo_platform_external_integration #1247 main build action completed: SUCCESS <"here we see it completed successfully">
          Jun 22, 2015 3:09:11 PM INFO hudson.model.Run execute
          venmo_platform_external_integration #1248 main build action completed: SUCCESS
          Jun 22, 2015 3:09:20 PM SEVERE hudson.model.Executor finish1
          Executor threw an exception
          java.util.NoSuchElementException <"perhaps this is related? We keep the most recent 1000 builds.">
          	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76)
          	at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63)
          	at java.util.AbstractMap$2$1.next(AbstractMap.java:385)
          	at hudson.util.RunList.subList(RunList.java:139)
          	at hudson.tasks.LogRotator.perform(LogRotator.java:125)
          	at hudson.model.Job.logRotate(Job.java:467)
          	at hudson.model.Run.execute(Run.java:1808)
          	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          	at hudson.model.ResourceController.execute(ResourceController.java:98)
          	at hudson.model.Executor.run(Executor.java:374)
          
          <"these look to be expected">
          Jun 22, 2015 3:18:48 PM INFO hudson.model.Run execute
          platform_deploy_listener #219 aborted
          java.lang.InterruptedException
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Object.wait(Object.java:503)
          	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
          	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:135)
          	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
          	at hudson.model.Build$BuildExecution.build(Build.java:205)
          	at hudson.model.Build$BuildExecution.doRun(Build.java:162)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
          	at hudson.model.Run.execute(Run.java:1744)
          	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          	at hudson.model.ResourceController.execute(ResourceController.java:98)
          	at hudson.model.Executor.run(Executor.java:374)
          
          
          Jun 22, 2015 3:18:57 PM INFO hudson.model.Run execute
          venmo_platform_test_and_deploy #863 aborted
          java.lang.InterruptedException
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
          	at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:220)
          	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
          	at hudson.model.Build$BuildExecution.build(Build.java:205)
          	at hudson.model.Build$BuildExecution.doRun(Build.java:162)
          	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
          	at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:137)
          	at hudson.model.Run.execute(Run.java:1744)
          	at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:76)
          	at hudson.model.ResourceController.execute(ResourceController.java:98)
          	at hudson.model.Executor.run(Executor.java:374)
          

          Simon Weber added a comment - mcantin bummer; we just had a build hang. Here are the logs, my comments <"like this"> so jira highlights them. From the build itself: Started by upstream project "platform_deploy_listener" build number 219 < "snip" > [MultiJob] Starting job platform_parallel. [MultiJob] Starting job platform_parallel. [MultiJob] Starting job venmo_platform_external_integration. [MultiJob] Starting job venmo_platform_external_integration. [MultiJob] Starting job scope_frontend. [MultiJob] Finished Build : #96 - deploy of Job : scope_frontend with status : SUCCESS [MultiJob] Finished Build : #6837 of Job : platform_parallel with status : SUCCESS [MultiJob] Finished Build : #6836 of Job : platform_parallel with status : SUCCESS [MultiJob] Finished Build : #1248 - deploy of Job : venmo_platform_external_integration with status : SUCCESS Build timed out (after 20 minutes). Marking the build as aborted. [MultiJob] Aborting all subjobs. [MultiJob] Aborting platform_parallel. [MultiJob] Aborting platform_parallel. [MultiJob] Aborting venmo_platform_external_integration. [MultiJob] Aborting venmo_platform_external_integration. [MultiJob] Aborting scope_frontend. Build was aborted < "snip" > [MultiJob] Finished Build : #1247 - deploy of Job : venmo_platform_external_integration with status : ABORTED < "here's the build that hung" > From Jenkins: Jun 22, 2015 2:58:55 PM INFO venmo_platform_test_and_deploy - #863 Started by ... Jun 22, 2015 3:09:10 PM INFO hudson.model.Run execute venmo_platform_external_integration #1247 main build action completed: SUCCESS < "here we see it completed successfully" > Jun 22, 2015 3:09:11 PM INFO hudson.model.Run execute venmo_platform_external_integration #1248 main build action completed: SUCCESS Jun 22, 2015 3:09:20 PM SEVERE hudson.model.Executor finish1 Executor threw an exception java.util.NoSuchElementException < "perhaps this is related? We keep the most recent 1000 builds." > at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:139) at hudson.tasks.LogRotator.perform(LogRotator.java:125) at hudson.model.Job.logRotate(Job.java:467) at hudson.model.Run.execute(Run.java:1808) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374) < "these look to be expected" > Jun 22, 2015 3:18:48 PM INFO hudson.model.Run execute platform_deploy_listener #219 aborted java.lang.InterruptedException at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:503) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:135) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at hudson.model.Run.execute(Run.java:1744) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374) Jun 22, 2015 3:18:57 PM INFO hudson.model.Run execute venmo_platform_test_and_deploy #863 aborted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:220) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:137) at hudson.model.Run.execute(Run.java:1744) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:76) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374)

          It's difficult to debug. I added debug information in this build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/166/ (lines started with [MultiJob][Debug]). This could help to debug.

          Mathieu Cantin added a comment - It's difficult to debug. I added debug information in this build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/166/ (lines started with [MultiJob] [Debug] ). This could help to debug.

          Simon Weber added a comment -

          Sounds good. I'll update to that revision and hopefully have an example in the next few days.

          Simon Weber added a comment - Sounds good. I'll update to that revision and hopefully have an example in the next few days.

          Simon Weber added a comment -

          So I've got my first example, but it's actually not for a multijob build! In this case, the parent build used https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin to trigger and block on the downstream build (which itself was a multijob build). The parent build was timed out after 20 minutes while the child was successful.

          So, perhaps the problem is actually with that plugin instead? My understanding is that it's a dependency of the multijob plugin. My installed version was 2.25 (I've since upgraded to 2.27 in case that fixes it).

          Here are the logs, but they're not too interesting:

          parent build:

          Waiting for the completion of platform_parallel
          Build timed out (after 20 minutes). Marking the build as aborted.
          Build was aborted
          

          child (multijob) build:

          [MultiJob] Starting job venmo_platform.
          [MultiJob] Starting job venmo_platform.
          [MultiJob] Starting job venmo_platform.
          [MultiJob][Debug] Create executor service.
          [MultiJob][Debug] Create task venmo_platform
          [MultiJob][Debug] Create task venmo_platform
          [MultiJob][Debug] Create task venmo_platform
          [MultiJob][Debug] Shutdown.
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Timeoutexception, continue
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Timeoutexception, continue
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Timeoutexception, continue
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Timeoutexception, continue
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Timeoutexception, continue
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] Timeoutexception, continue
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] future is started
          [MultiJob][Debug] future is started
          [MultiJob][Debug] future is started
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] future is started
          [MultiJob][Debug] Thy to start future for venmo_platform
          [MultiJob][Debug] future is started
          <"snip; this repeats for a long time">
          [MultiJob][Debug] Result value is SUCCESS
          [MultiJob] Finished Build : #29045 -  of Job : venmo_platform with status : SUCCESS
          [MultiJob][Debug]  Finish execution of SUCCESS
          [MultiJob][Debug] Check phase termination.
          <"one of these for each subbuild">
          [MultiJob][Debug] Continuation calculation.
          Finished: SUCCESS
          

          Simon Weber added a comment - So I've got my first example, but it's actually not for a multijob build! In this case, the parent build used https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin to trigger and block on the downstream build (which itself was a multijob build). The parent build was timed out after 20 minutes while the child was successful. So, perhaps the problem is actually with that plugin instead? My understanding is that it's a dependency of the multijob plugin. My installed version was 2.25 (I've since upgraded to 2.27 in case that fixes it). Here are the logs, but they're not too interesting: parent build: Waiting for the completion of platform_parallel Build timed out (after 20 minutes). Marking the build as aborted. Build was aborted child (multijob) build: [MultiJob] Starting job venmo_platform. [MultiJob] Starting job venmo_platform. [MultiJob] Starting job venmo_platform. [MultiJob][Debug] Create executor service. [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Shutdown. [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started [MultiJob][Debug] future is started [MultiJob][Debug] future is started [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started < "snip; this repeats for a long time" > [MultiJob][Debug] Result value is SUCCESS [MultiJob] Finished Build : #29045 - of Job : venmo_platform with status : SUCCESS [MultiJob][Debug] Finish execution of SUCCESS [MultiJob][Debug] Check phase termination. < "one of these for each subbuild" > [MultiJob][Debug] Continuation calculation. Finished: SUCCESS

          Simon Weber added a comment -

          I just triggered an identical hang with the upgraded plugin, so it doesn't look like it's fixed.

          Simon Weber added a comment - I just triggered an identical hang with the upgraded plugin, so it doesn't look like it's fixed.

          Leo Gallucci added a comment -

          Leo Gallucci added a comment - I can recreate when using https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin

          Which version of Jenkins are you running?

          Mathieu Cantin added a comment - Which version of Jenkins are you running?

          Leo Gallucci added a comment -

          Jenkins ver. 1.609.1

          Leo Gallucci added a comment - Jenkins ver. 1.609.1

          Simon Weber added a comment -

          We haven't seen this issue in a few months. We're now running Jenkins 1.628, trigger plugin 2.27, and multijob from source at https://github.com/jenkinsci/tikal-multijob-plugin/pull/65.

          Simon Weber added a comment - We haven't seen this issue in a few months. We're now running Jenkins 1.628, trigger plugin 2.27, and multijob from source at https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 .

          Tony Bridges added a comment -

          We've seen this a few times, most recently in a MultiJob run calling a maven job, but previously in a Freestyle job calling maven jobs.
          We are running 1.609.3.1, CBE.
          I've opened a ticket with CB Support and cited this defect.

          Tony Bridges added a comment - We've seen this a few times, most recently in a MultiJob run calling a maven job, but previously in a Freestyle job calling maven jobs. We are running 1.609.3.1, CBE. I've opened a ticket with CB Support and cited this defect.

          Tony Bridges added a comment -

          Does the pull request conclusively correct the issue ? If so, maybe we can get it rolled into a new plugin release ?

          Tony Bridges added a comment - Does the pull request conclusively correct the issue ? If so, maybe we can get it rolled into a new plugin release ?

          Simon Weber added a comment -

          tbridges I don't think this was solved for us until I upgraded Jenkins. I'm not sure what combination of the jenkins, trigger plugin, and multijob plugin upgrades is necessary to solve it, but the environment I described earlier has been rock solid for months now.

          Simon Weber added a comment - tbridges I don't think this was solved for us until I upgraded Jenkins. I'm not sure what combination of the jenkins, trigger plugin, and multijob plugin upgrades is necessary to solve it, but the environment I described earlier has been rock solid for months now.

          How to reproduce it:

          • Configure a FreeStyle Project - freeStyle01
          • Configure a Maven Project - maven01
          • Configure a Multijob Project with 30 subjobs (enough to have time to stop slave agent) - Multijob01
          • Configure one slave agent with 1 executor - slave01
          • Run Multijob01
          • Go to Manage Jenkins/Manage Nodes and put slave01 offline or kill "java -jar slave.jar" on slave agent
          • Multijob01 hang waiting for finish task

          Ivan Fernandez Calvo added a comment - How to reproduce it: Configure a FreeStyle Project - freeStyle01 Configure a Maven Project - maven01 Configure a Multijob Project with 30 subjobs (enough to have time to stop slave agent) - Multijob01 Configure one slave agent with 1 executor - slave01 Run Multijob01 Go to Manage Jenkins/Manage Nodes and put slave01 offline or kill "java -jar slave.jar" on slave agent Multijob01 hang waiting for finish task

          Alex Taylor added a comment -

          So I would think this could happen even with a slight disconnect of the slave per Ivan's last comment.

          Is there any more information that can be provided here?

          Alex Taylor added a comment - So I would think this could happen even with a slight disconnect of the slave per Ivan's last comment. Is there any more information that can be provided here?

          Closing issue as part of tikal-multijob-plugin issues cleanup.
          If still relevant, please open a matching issue in https://github.com/jenkinsci/tikal-multijob-plugin/issues (you can refer to this issue in its description)

          Yoram Michaeli added a comment - Closing issue as part of tikal-multijob-plugin issues cleanup. If still relevant, please open a matching issue in https://github.com/jenkinsci/tikal-multijob-plugin/issues (you can refer to this issue in its description)

            chenc Chen Cohen
            simonmweber Simon Weber
            Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: