-
Bug
-
Resolution: Postponed
-
Major
-
None
We have been experiencing an issue lately wherein our MultiJobs run, all the subjobs complete (and pass), and it gets "stuck". The multijob fails to complete, and it does not respond to manual halt requests. The only way to unstick it we've found is to restart jenkins.
Reproduction
I've setup a simplified generic example to reproduce it:
- One multijob named "parent_job"
- parent_job has one build phase
- build phase has 10 identical subjobs: subjob_1 - subjob_10
- Each subjob merely runs this simple script:
echo "start" sleep 3 echo "done"
- set the parent job to run every minute (check "Build periodically" and set it to "* * * * *")
Generally this job runs just fine. Approximately one of every 500-1000 runs though... it hangs.
When it hangs... this is the multijob console log
Started by timer
[EnvInject] - Loading node environment variables.
Building in workspace /var/lib/jenkins/jobs/b_parentjob/workspace
Starting build job b_subjob4.
Starting build job b_subjob9.
Starting build job b_subjob8.
Starting build job b_subjob3.
Starting build job b_subjob10.
Starting build job b_subjob6.
Starting build job b_subjob2.
Starting build job b_subjob5.
Starting build job b_subjob1.
Finished Build : #5849 of Job : b_subjob4 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob9 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob8 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob3 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob10 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob6 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob2 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob1 with status : SUCCESS
Finished Build : #5849 of Job : b_subjob5 with status : SUCCESS
Looks basically exactly like the log when it does not hang.
other notes
The hanging seems to correspond with other periods of high CPU and/or IO load on the EC2 instance we are using. But, I've tried to induce CPU / IO load using the 'stress' tool, but havent been able to find a reliable set of reproduction steps.
versions
We've reproduced using the latest jenkins and plugins: jenkins 1.626, conditional-build-step 1.3.3, multijob 1.16.
I've back up to jenkins 1.613 where we had previously seen stable behavior, unfortunately the problem returned.
Any ideas?
Thanks,
Rob