-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Powered by SuggestiMate
Reproduction case:
Create a concurrent matrix job with a user defined axis that does a 'sleep 120' as its build step. Launch several of these jobs, enough that all available executors are taken up and there are still builds in the queue. Some of these builds will abort, with a console message similar to:
[...]
9 completed with result SUCCESS
24 completed with result SUCCESS
23 completed with result SUCCESS
2 completed with result SUCCESS
10 appears to be cancelled
10 completed with result ABORTED
25 appears to be cancelled
25 completed with result ABORTED
18 appears to be cancelled
18 completed with result ABORTED
13 appears to be cancelled
[...]
For my test case, I have 26 slaves, 82 executors, 25 sub-jobs. I can reproduce reliably if I launch 5 or more top level jobs at once.
[JENKINS-13972] Concurrent matrix builds abort
This seems to be a side effect of the fix of issue 6747.
The problem only appears when starting the matrix jobs in concurrent mode. Starting the job in serial mode will not abort the axis-jobs
Next problem: the workaround given in issue 6747 is not working anymore. So there is no possibility to patch the job to run them concurrent anymore.
Arg, I just wanted to starting using concurrent matrix builds and ran into this. I am heartened to see that the bug is open and hope we can see resolution soon!
Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
changelog.html
core/src/main/java/hudson/matrix/MatrixBuild.java
core/src/main/java/hudson/matrix/MatrixConfiguration.java
http://jenkins-ci.org/commit/jenkins/9c7ef619cc96dc0111220412e841199de71d5b8d
Log:
[FIXED JENKINS-13972]
Fixed a problem in actually making concurrent builds work.
Compare: https://github.com/jenkinsci/jenkins/compare/c2c31e2b933a...9c7ef619cc96
Integrated in jenkins_main_trunk #1812
[FIXED JENKINS-13972] (Revision 9c7ef619cc96dc0111220412e841199de71d5b8d)
Result = UNSTABLE
Kohsuke Kawaguchi : 9c7ef619cc96dc0111220412e841199de71d5b8d
Files :
- changelog.html
- core/src/main/java/hudson/matrix/MatrixConfiguration.java
- core/src/main/java/hudson/matrix/MatrixBuild.java
Matrix build started on Debian Debian-5.0.9 downstream projects executed on amd64 and i386 debian 5.0 slaves.
no change for svn://************ since the previous build
Triggering buildnode_x86.deb50
Triggering buildnode_x86_64.deb50
Configuration buildnode_x86.deb50 is still in the queue: Waiting for next available executor on build-lnx32-2.deb50
buildnode_x86.deb50 completed with result SUCCESS
appears to be cancelled
buildnode_x86_64.deb50 completed with result ABORTED
Notifying upstream build ************* #426 of job completion
All downstream projects complete!
Minimum result threshold not met for join project
Notifying upstream projects of job completion
Notifying upstream of completion: ********** #426
Finished: ABORTED
buildnode_x86_64.deb50 task log shows:
Notifying upstream projects of job completion
Finished: SUCCESS
Jenkins master runs on Windows Server 2008
Jenkins ver. 1.492
slave java.version 1.6.0_0
master java.version 1.7
Issue was reopened maybe due to JENKINS-15587 causing similar symptoms "job appears to be cancelled" in matrix build.
Judging only from jenkins system logs: date parse exception (JENKINS-15587) preceded build abortion notification.
Code changed in jenkins
User: Aleksas
Path:
core/src/main/java/hudson/model/Run.java
http://jenkins-ci.org/commit/jenkins/3d850711bb1a31f11c4309bd798200fbc5410764
Log:
Update core/src/main/java/hudson/model/Run.java
Handling NTFS symlinks introduced via Util.resolveSymlink.
JENKINS-15587
Also probably culprit for JENKINS-13972
We are seeing this behavior back as well, we aren't using NTFS symlinks so not sure that recent change will address the issue for us
I am seeing this issue for the first time today after upgrading to Jenkins 1.509.1 from the previous LTS version. I have a matrix job which runs four different flavours of build on each of three platforms (Mac, Linux and Windows). My job configuration has not changed (and in fact no code has been checked in at all since the last good build – I just started this build manually today to test Jenkins after the upgrade).
Our master is on Windows, and there are two Windows executors on the same machine. There are two Mac executors on a Mac slave, and two Linux executors on a Linux slave. All of the builds in fact complete successfully, but the master reports that all four of the Windows builds "appear to be cancelled" and then that they "completed with result ABORTED". UPDATE: I have seen similar behaviour for other matrix jobs, including some that do not run on the slaves at all. I think it is a matrix job issue, not a master/slave issue.
Changing the job configuration to make the builds run serially rather than in parallel appears to work round the problem.
UPDATE: I believe this problem is Windows only. On my Windows installation, I had to configure all my matrix jobs to run serially, so as to work round this bug. I have now moved my Jenkins master to a Mac, and I have changed all my jobs again so that they do not run serially. So far, I have not seen the problem occur even once on the Mac. (On the Mac I have Jenkins 1.509.2 installed, but I don't think there is a fix for anything like this between 1.509.1 and 1.509.2, so it's more likely to be the change of platform that has caused the improvement.)
I see a machine that aborts a job on its second slave. Both slaves start via SSH, and the machine runs a Centrify SSH server.
Other 2 machines run a regular SSH server and do not exhibit aborts on their second slaves.
We have Jenkins 1.492.
I figured a node configuration of my matrix job received a "job disabled" property. Sub-projects disabled via "Configuration Slicing/Job Disabled Build Slicer (bool)" in /slicing/jobdisabledbool/ will deny requests for new runs without pointing the reason.
[USER@MASTER ~]$ diff -u /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAME{X,Y}/config.xml --- /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAMEX/config.xml 2013-06-06 21:43:25.823244000 -0400 +++ /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAMEY/config.xml 2013-06-07 16:22:46.529940000 -0400 @@ -7,7 +7,7 @@ </properties> <scm class="hudson.scm.NullSCM"/> <canRoam>true</canRoam> - <disabled>true</disabled> + <disabled>false</disabled> <blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding> <blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding> <triggers class="vector"/>
Hi all,
I've been investigating aborts issues in Jenkins and I've found at least one bug with regards to this.
Here is my report on the subject:
https://docs.google.com/presentation/d/1ybtB-Bhkb4c3dhb5ZMArr4prtEZ-pjLqH9Vk7yhdZTg/
There is also another issue I'm dealing with and in the process of investigating.
Core developers - I'll be happy to make a contribution to the sources if you can give me pointers on how to modify my proposed fix so it will be 'commit worthy'
Hi
Is the solution suggested above by Shay Weiss reasonable? is it going to be pushed in the next versions?
thanks
I created a pull request with a fix, can someone please review?
https://github.com/jenkinsci/matrix-project-plugin/pull/28
thanks,
Tidhar
Putting this back to fixed as confirmed in PR it is no longer a problem. When someone spot a similar problem, please, file new issue.
FYI someone did spot this again and raised JENKINS-46453
(and then one of my colleagues found that bug report after encountering the same symptoms, hence my interest in it)
I was able to reproduce this by hacking one of Jenkins' unit tests as well: