-
Bug
-
Resolution: Postponed
-
Major
-
None
-
jenkins 1.593
multijob plugin 1.16
-
Powered by SuggestiMate
With a job configuration like:
parent_job:
child_job
child_job
occasionally we'll get output that looks like this:
Starting build job child_job.
Starting build job child_job.
Finished Build : #123 - Job : child_job with status : SUCCESS
<this hangs forever, so we abort the parent_job>
Aborting all subjobs.
Finished Build : #124 - Job : child_job with status : ABORTED
However, job #124 will always have finished successfully. We see this relatively rarely (~5% of parent_job builds?). My guess is that there's a race condition causing this, since we usually see this during times of high load (ie many child_job instances are being started).
- duplicates
-
JENKINS-26678 Deadlock when running as many multijobs as there are executors
-
- Closed
-
- is duplicated by
-
JENKINS-28662 abrupt close of down stream job and child job is missing from the build panel
-
- Closed
-
-
JENKINS-29981 MultiJob doesn't detect that the job has finished
-
- Closed
-
- links to
[JENKINS-27371] Parent builds sometimes hang on successful child builds of same type
I'm not sure that has anything to do with the problem? The problem is that multijob will sometimes lose track of a child job and wait forever for its result. That forces us to abort the parent build (or else the parent job would run indefinitely).
Sorry, I see the problem. Do you have the "Execute concurrent builds if necessary" checked on child_job ?
> Sorry, I see the problem
Oh, no worries; I could have done a better job describing the problem.
> Do you have the "Execute concurrent builds if necessary" checked on child_job?
Yes; on both child_job and parent_job. I had originally suspected this only happened when job numbers were interleaved due to another concurrent build, but that didn't happen in an example I just triggered now. I'll check the logs to see if there's anything interesting.
The logs aren't too informative. My comments are in [ square brackets ].
parent_job log
Starting build job child_job.
Starting build job child_job.
Finished Build : #6631 of Job : child_job with status : SUCCESS
Build timed out (after 20 minutes). Marking the build as aborted. [ this is from a plugin we use to timeout automatically ]
Aborting all subjobs.
Finished Build : #6630 of Job : child_job with status : ABORTED [ note that the parent job knows about the correct subjob since it knows which to abort ]
Build was aborted
Finished: ABORTED
Jenkins log:
Mar 12, 2015 2:34:40 PM INFO hudson.model.Run execute child_job #6631 main build action completed: SUCCESS Mar 12, 2015 2:34:45 PM INFO hudson.model.Run execute child_job #6630 main build action completed: SUCCESS [ this subjob did finish successfully and on time, but the parent job missed its result ] <snip> Mar 12, 2015 2:36:52 PM SEVERE hudson.model.Executor run [I don't think this is relevant - it happens later than the job finishes - but I figured I'd include it ] Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:449) at hudson.model.Run.execute(Run.java:1823) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)
Good new (or bad). This seen a change on the API of Jenkins 1.596.1 (we have updated last friday). I work on a fix.
Is the api change you're thinking of in that specific version? We're still on version 1.593, and I think we saw this behavior on a version < 1.565.1 as well.
If something hangs, thread dumps would be helpful.
https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump
I finally got a thread dump! It looks like the problem is at this line: https://github.com/jenkinsci/tikal-multijob-plugin/blob/jenkins-multijob-plugin-1.16/src/main/java/com/tikal/jenkins/plugins/multijob/MultiJobBuilder.java#L216.
Here's the relevant thread dump:
"Executor #13 for qa-jenkins-psuedorunner-016 : executing platform_pr #2036" Id=257890 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@385f18de
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:216)
at org.jenkins_ci.plugins.run_condition.BuildStepRunner$2.run(BuildStepRunner.java:110)
at org.jenkins_ci.plugins.run_condition.BuildStepRunner$DontRun.conditionalRun(BuildStepRunner.java:264)
at org.jenkins_ci.plugins.run_condition.BuildStepRunner.perform(BuildStepRunner.java:105)
at org.jenkinsci.plugins.conditionalbuildstep.singlestep.SingleConditionalBuilder.perform(SingleConditionalBuilder.java:108)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:160)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:533)
at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:134)
at hudson.model.Run.execute(Run.java:1759)
at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73)
at hudson.model.ResourceController.execute(ResourceController.java:89)
at hudson.model.Executor.run(Executor.java:240)
Let me know if you need any other information from the dump – I saved it all just in case.
Here's a log message regarding the subjob that finished, but was still being waited on. My comments in [ square brackets ]:
Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute platform_parallel #2620 main build action completed: SUCCESS [ this was the job the platform_pr was waiting on ] Apr 01, 2015 5:52:20 PM INFO hudson.model.Run execute platform_parallel #2619 main build action completed: SUCCESS Apr 01, 2015 5:52:20 PM WARNING jenkins.model.lazy.LazyBuildMixIn removeRun com.tikal.jenkins.plugins.multijob.MultiJobProject@4ff2acc8[platform_parallel] did not contain platform_parallel #1620 to begin with [ coincidence? was this supposed to be #2620? ] Apr 01, 2015 5:52:20 PM SEVERE hudson.model.Executor run Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:449) at hudson.model.Run.execute(Run.java:1823) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:73) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240)
I created this pull request, https://github.com/jenkinsci/tikal-multijob-plugin/pull/65, and I think this will fix that.
Can we get some eyes on that pull request? We're still running into this issue daily.
@mcantin, I just installed your changes from source and unfortunately we're still seeing the problem. I've only seen hangs when one of the child builds failed, in case that's useful.
We're also seeing this same issue that Simon is describing and it's causing deploy headaches for us. Frequently having to kill the job and rebuild it and roll the dice to see if it's going to work. Our logs look similar to those mentioned above. Let me know if you need more info.
Jenkins 1.607
Multijob 1.16
website-deploy-production-download #10432 main build action completed: SUCCESS Jun 10, 2015 8:50:19 PM SEVERE hudson.model.Executor finish1 Executor threw an exception java.util.NoSuchElementException at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:137) at hudson.tasks.LogRotator.perform(LogRotator.java:124) at hudson.model.Job.logRotate(Job.java:465) at hudson.model.Run.execute(Run.java:1805) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374)
Since ~ 1 month, I try a new fix and I don't have again this bug. Can you try it? The PR : https://github.com/jenkinsci/tikal-multijob-plugin/pull/65 and the build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/159/.
mcantin, that fix didn't work for us: it made all builds hang.
I found this error in the logs:
Jun 19, 2015 9:34:03 PM WARNING hudson.triggers.Trigger checkTriggers org.jenkinsci.plugins.ghprb.GhprbTrigger.run() failed for hudson.model.FreeStyleProject@b347b18[venmo_platform_auto_pr] java.lang.NullPointerException at org.jenkinsci.plugins.ghprb.GhprbTrigger.run(GhprbTrigger.java:155) at hudson.triggers.Trigger.checkTriggers(Trigger.java:266) at hudson.triggers.Trigger$Cron.doRun(Trigger.java:214) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)
simonmweber
I'm sorry. Did you update the "GitHub pull request builder plugin" (https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin) recently? Because, the error was on this plugin with a "Free Style Project". Which version do you have?
Whoops! I pasted the wrong thing. Let me do some debugging and get back to you.
Ok, I had to fix a few unrelated problems, haha. The hanging I was seeing before was caused by Jenkins taking a long time to discard old builds. I'm not sure why that was triggered by updating the plugin and restarting – maybe I just got unlucky?
Anyway, after fixing all that other stuff and upgrading, things look good! I'll know for sure next week once we have some real load on it, but this looks promising =)
mcantin bummer; we just had a build hang.
Here are the logs, my comments <"like this"> so jira highlights them.
From the build itself:
Started by upstream project "platform_deploy_listener" build number 219 <"snip"> [MultiJob] Starting job platform_parallel. [MultiJob] Starting job platform_parallel. [MultiJob] Starting job venmo_platform_external_integration. [MultiJob] Starting job venmo_platform_external_integration. [MultiJob] Starting job scope_frontend. [MultiJob] Finished Build : #96 - deploy of Job : scope_frontend with status : SUCCESS [MultiJob] Finished Build : #6837 of Job : platform_parallel with status : SUCCESS [MultiJob] Finished Build : #6836 of Job : platform_parallel with status : SUCCESS [MultiJob] Finished Build : #1248 - deploy of Job : venmo_platform_external_integration with status : SUCCESS Build timed out (after 20 minutes). Marking the build as aborted. [MultiJob] Aborting all subjobs. [MultiJob] Aborting platform_parallel. [MultiJob] Aborting platform_parallel. [MultiJob] Aborting venmo_platform_external_integration. [MultiJob] Aborting venmo_platform_external_integration. [MultiJob] Aborting scope_frontend. Build was aborted <"snip"> [MultiJob] Finished Build : #1247 - deploy of Job : venmo_platform_external_integration with status : ABORTED <"here's the build that hung">
From Jenkins:
Jun 22, 2015 2:58:55 PM INFO venmo_platform_test_and_deploy - #863 Started by ... Jun 22, 2015 3:09:10 PM INFO hudson.model.Run execute venmo_platform_external_integration #1247 main build action completed: SUCCESS <"here we see it completed successfully"> Jun 22, 2015 3:09:11 PM INFO hudson.model.Run execute venmo_platform_external_integration #1248 main build action completed: SUCCESS Jun 22, 2015 3:09:20 PM SEVERE hudson.model.Executor finish1 Executor threw an exception java.util.NoSuchElementException <"perhaps this is related? We keep the most recent 1000 builds."> at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:76) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.next(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.next(AbstractMap.java:385) at hudson.util.RunList.subList(RunList.java:139) at hudson.tasks.LogRotator.perform(LogRotator.java:125) at hudson.model.Job.logRotate(Job.java:467) at hudson.model.Run.execute(Run.java:1808) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374) <"these look to be expected"> Jun 22, 2015 3:18:48 PM INFO hudson.model.Run execute platform_deploy_listener #219 aborted java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73) at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:135) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at hudson.model.Run.execute(Run.java:1744) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374) Jun 22, 2015 3:18:57 PM INFO hudson.model.Run execute venmo_platform_test_and_deploy #863 aborted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) at com.tikal.jenkins.plugins.multijob.MultiJobBuilder.perform(MultiJobBuilder.java:220) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537) at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:137) at hudson.model.Run.execute(Run.java:1744) at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:76) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:374)
It's difficult to debug. I added debug information in this build : https://jenkins.ci.cloudbees.com/job/plugins/job/tikal-multijob-plugin/166/ (lines started with [MultiJob][Debug]). This could help to debug.
Sounds good. I'll update to that revision and hopefully have an example in the next few days.
So I've got my first example, but it's actually not for a multijob build! In this case, the parent build used https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin to trigger and block on the downstream build (which itself was a multijob build). The parent build was timed out after 20 minutes while the child was successful.
So, perhaps the problem is actually with that plugin instead? My understanding is that it's a dependency of the multijob plugin. My installed version was 2.25 (I've since upgraded to 2.27 in case that fixes it).
Here are the logs, but they're not too interesting:
parent build:
Waiting for the completion of platform_parallel
Build timed out (after 20 minutes). Marking the build as aborted.
Build was aborted
child (multijob) build:
[MultiJob] Starting job venmo_platform. [MultiJob] Starting job venmo_platform. [MultiJob] Starting job venmo_platform. [MultiJob][Debug] Create executor service. [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Create task venmo_platform [MultiJob][Debug] Shutdown. [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] Timeoutexception, continue [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started [MultiJob][Debug] future is started [MultiJob][Debug] future is started [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started [MultiJob][Debug] Thy to start future for venmo_platform [MultiJob][Debug] future is started <"snip; this repeats for a long time"> [MultiJob][Debug] Result value is SUCCESS [MultiJob] Finished Build : #29045 - of Job : venmo_platform with status : SUCCESS [MultiJob][Debug] Finish execution of SUCCESS [MultiJob][Debug] Check phase termination. <"one of these for each subbuild"> [MultiJob][Debug] Continuation calculation. Finished: SUCCESS
I just triggered an identical hang with the upgraded plugin, so it doesn't look like it's fixed.
I can recreate when using https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin
We haven't seen this issue in a few months. We're now running Jenkins 1.628, trigger plugin 2.27, and multijob from source at https://github.com/jenkinsci/tikal-multijob-plugin/pull/65.
We've seen this a few times, most recently in a MultiJob run calling a maven job, but previously in a Freestyle job calling maven jobs.
We are running 1.609.3.1, CBE.
I've opened a ticket with CB Support and cited this defect.
Does the pull request conclusively correct the issue ? If so, maybe we can get it rolled into a new plugin release ?
tbridges I don't think this was solved for us until I upgraded Jenkins. I'm not sure what combination of the jenkins, trigger plugin, and multijob plugin upgrades is necessary to solve it, but the environment I described earlier has been rock solid for months now.
How to reproduce it:
- Configure a FreeStyle Project - freeStyle01
- Configure a Maven Project - maven01
- Configure a Multijob Project with 30 subjobs (enough to have time to stop slave agent) - Multijob01
- Configure one slave agent with 1 executor - slave01
- Run Multijob01
- Go to Manage Jenkins/Manage Nodes and put slave01 offline or kill "java -jar slave.jar" on slave agent
- Multijob01 hang waiting for finish task
So I would think this could happen even with a slight disconnect of the slave per Ivan's last comment.
Is there any more information that can be provided here?
Closing issue as part of tikal-multijob-plugin issues cleanup.
If still relevant, please open a matching issue in https://github.com/jenkinsci/tikal-multijob-plugin/issues (you can refer to this issue in its description)
You have the option "Kill the phase on:" for each sub-jobs on a phase. When this is on "Failure", this kill all others builds when this sub-job has failed. Select "never" to never kill others sub-jobs.
The option "Abort all other job" is when the job is aborted manually by an user.