Reproduction case:

      Create a concurrent matrix job with a user defined axis that does a 'sleep 120' as its build step. Launch several of these jobs, enough that all available executors are taken up and there are still builds in the queue. Some of these builds will abort, with a console message similar to:

      [...]
      9 completed with result SUCCESS
      24 completed with result SUCCESS
      23 completed with result SUCCESS
      2 completed with result SUCCESS
      10 appears to be cancelled
      10 completed with result ABORTED
      25 appears to be cancelled
      25 completed with result ABORTED
      18 appears to be cancelled
      18 completed with result ABORTED
      13 appears to be cancelled
      [...]

      For my test case, I have 26 slaves, 82 executors, 25 sub-jobs. I can reproduce reliably if I launch 5 or more top level jobs at once.

          [JENKINS-13972] Concurrent matrix builds abort

          John Koleszar added a comment -

          I was able to reproduce this by hacking one of Jenkins' unit tests as well:

          diff --git a/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy b/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy
          index 1ddb195..8b6f324 100644
          --- a/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy
          +++ b/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy
          @@ -116,7 +116,7 @@ class MatrixProjectCustomWorkspaceTest extends HudsonTestCase {
                */
               def configureCustomWorkspaceConcurrentBuild(MatrixProject p) {
                   // needs sufficient parallel execution capability
          -        jenkins.numExecutors = 10
          +        jenkins.numExecutors = 4
                   jenkins.updateComputerList()
           
                   p.axes = new AxisList(new TextAxis("foo", "1", "2"))
          @@ -140,8 +140,10 @@ class MatrixProjectCustomWorkspaceTest extends HudsonTestCase {
                   // get one going
                   Thread.sleep(1000)
                   def f2 = p.scheduleBuild2(0)
          +        Thread.sleep(1000)
          +        def f3 = p.scheduleBuild2(0)
           
          -        def bs = [f1, f2]*.get().each { assertBuildStatusSuccess(it) }
          +        def bs = [f1, f2, f3]*.get().each { assertBuildStatusSuccess(it) }
                   return bs
               }
           }
          
          

          John Koleszar added a comment - I was able to reproduce this by hacking one of Jenkins' unit tests as well: diff --git a/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy b/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy index 1ddb195..8b6f324 100644 --- a/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy +++ b/test/src/test/groovy/hudson/matrix/MatrixProjectCustomWorkspaceTest.groovy @@ -116,7 +116,7 @@ class MatrixProjectCustomWorkspaceTest extends HudsonTestCase { */ def configureCustomWorkspaceConcurrentBuild(MatrixProject p) { // needs sufficient parallel execution capability - jenkins.numExecutors = 10 + jenkins.numExecutors = 4 jenkins.updateComputerList() p.axes = new AxisList(new TextAxis("foo", "1", "2")) @@ -140,8 +140,10 @@ class MatrixProjectCustomWorkspaceTest extends HudsonTestCase { // get one going Thread.sleep(1000) def f2 = p.scheduleBuild2(0) + Thread.sleep(1000) + def f3 = p.scheduleBuild2(0) - def bs = [f1, f2]*.get().each { assertBuildStatusSuccess(it) } + def bs = [f1, f2, f3]*.get().each { assertBuildStatusSuccess(it) } return bs } }

          This seems to be a side effect of the fix of issue 6747.
          The problem only appears when starting the matrix jobs in concurrent mode. Starting the job in serial mode will not abort the axis-jobs
          Next problem: the workaround given in issue 6747 is not working anymore. So there is no possibility to patch the job to run them concurrent anymore.

          Sven Appenrodt added a comment - This seems to be a side effect of the fix of issue 6747. The problem only appears when starting the matrix jobs in concurrent mode. Starting the job in serial mode will not abort the axis-jobs Next problem: the workaround given in issue 6747 is not working anymore. So there is no possibility to patch the job to run them concurrent anymore.

          Trevor Baker added a comment -

          Arg, I just wanted to starting using concurrent matrix builds and ran into this. I am heartened to see that the bug is open and hope we can see resolution soon!

          Trevor Baker added a comment - Arg, I just wanted to starting using concurrent matrix builds and ran into this. I am heartened to see that the bug is open and hope we can see resolution soon!

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          changelog.html
          core/src/main/java/hudson/matrix/MatrixBuild.java
          core/src/main/java/hudson/matrix/MatrixConfiguration.java
          http://jenkins-ci.org/commit/jenkins/9c7ef619cc96dc0111220412e841199de71d5b8d
          Log:
          [FIXED JENKINS-13972]

          Fixed a problem in actually making concurrent builds work.

          Compare: https://github.com/jenkinsci/jenkins/compare/c2c31e2b933a...9c7ef619cc96

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: changelog.html core/src/main/java/hudson/matrix/MatrixBuild.java core/src/main/java/hudson/matrix/MatrixConfiguration.java http://jenkins-ci.org/commit/jenkins/9c7ef619cc96dc0111220412e841199de71d5b8d Log: [FIXED JENKINS-13972] Fixed a problem in actually making concurrent builds work. Compare: https://github.com/jenkinsci/jenkins/compare/c2c31e2b933a...9c7ef619cc96

          dogfood added a comment -

          Integrated in jenkins_main_trunk #1812
          [FIXED JENKINS-13972] (Revision 9c7ef619cc96dc0111220412e841199de71d5b8d)

          Result = UNSTABLE
          Kohsuke Kawaguchi : 9c7ef619cc96dc0111220412e841199de71d5b8d
          Files :

          • changelog.html
          • core/src/main/java/hudson/matrix/MatrixConfiguration.java
          • core/src/main/java/hudson/matrix/MatrixBuild.java

          dogfood added a comment - Integrated in jenkins_main_trunk #1812 [FIXED JENKINS-13972] (Revision 9c7ef619cc96dc0111220412e841199de71d5b8d) Result = UNSTABLE Kohsuke Kawaguchi : 9c7ef619cc96dc0111220412e841199de71d5b8d Files : changelog.html core/src/main/java/hudson/matrix/MatrixConfiguration.java core/src/main/java/hudson/matrix/MatrixBuild.java

          aleksas added a comment -

          Matrix build started on Debian Debian-5.0.9 downstream projects executed on amd64 and i386 debian 5.0 slaves.

          no change for svn://************ since the previous build
          Triggering buildnode_x86.deb50
          Triggering buildnode_x86_64.deb50
          Configuration buildnode_x86.deb50 is still in the queue: Waiting for next available executor on build-lnx32-2.deb50
          buildnode_x86.deb50 completed with result SUCCESS
          appears to be cancelled
          buildnode_x86_64.deb50 completed with result ABORTED
          Notifying upstream build ************* #426 of job completion
          All downstream projects complete!
          Minimum result threshold not met for join project
          Notifying upstream projects of job completion
          Notifying upstream of completion: ********** #426
          Finished: ABORTED

          buildnode_x86_64.deb50 task log shows:

          Notifying upstream projects of job completion
          Finished: SUCCESS

          Jenkins master runs on Windows Server 2008
          Jenkins ver. 1.492

          slave java.version 1.6.0_0
          master java.version 1.7

          aleksas added a comment - Matrix build started on Debian Debian-5.0.9 downstream projects executed on amd64 and i386 debian 5.0 slaves. no change for svn://************ since the previous build Triggering buildnode_x86.deb50 Triggering buildnode_x86_64.deb50 Configuration buildnode_x86.deb50 is still in the queue: Waiting for next available executor on build-lnx32-2.deb50 buildnode_x86.deb50 completed with result SUCCESS appears to be cancelled buildnode_x86_64.deb50 completed with result ABORTED Notifying upstream build ************* #426 of job completion All downstream projects complete! Minimum result threshold not met for join project Notifying upstream projects of job completion Notifying upstream of completion: ********** #426 Finished: ABORTED buildnode_x86_64.deb50 task log shows: Notifying upstream projects of job completion Finished: SUCCESS Jenkins master runs on Windows Server 2008 Jenkins ver. 1.492 slave java.version 1.6.0_0 master java.version 1.7

          aleksas added a comment - - edited

          Issue was reopened maybe due to JENKINS-15587 causing similar symptoms "job appears to be cancelled" in matrix build.
          Judging only from jenkins system logs: date parse exception (JENKINS-15587) preceded build abortion notification.

          aleksas added a comment - - edited Issue was reopened maybe due to JENKINS-15587 causing similar symptoms "job appears to be cancelled" in matrix build. Judging only from jenkins system logs: date parse exception ( JENKINS-15587 ) preceded build abortion notification.

          Code changed in jenkins
          User: Aleksas
          Path:
          core/src/main/java/hudson/model/Run.java
          http://jenkins-ci.org/commit/jenkins/3d850711bb1a31f11c4309bd798200fbc5410764
          Log:
          Update core/src/main/java/hudson/model/Run.java

          Handling NTFS symlinks introduced via Util.resolveSymlink.
          JENKINS-15587
          Also probably culprit for JENKINS-13972

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Aleksas Path: core/src/main/java/hudson/model/Run.java http://jenkins-ci.org/commit/jenkins/3d850711bb1a31f11c4309bd798200fbc5410764 Log: Update core/src/main/java/hudson/model/Run.java Handling NTFS symlinks introduced via Util.resolveSymlink. JENKINS-15587 Also probably culprit for  JENKINS-13972

          We are seeing this behavior back as well, we aren't using NTFS symlinks so not sure that recent change will address the issue for us

          Jeremy Van Haren added a comment - We are seeing this behavior back as well, we aren't using NTFS symlinks so not sure that recent change will address the issue for us

          Sarah Woodall added a comment - - edited

          I am seeing this issue for the first time today after upgrading to Jenkins 1.509.1 from the previous LTS version. I have a matrix job which runs four different flavours of build on each of three platforms (Mac, Linux and Windows). My job configuration has not changed (and in fact no code has been checked in at all since the last good build – I just started this build manually today to test Jenkins after the upgrade).
          Our master is on Windows, and there are two Windows executors on the same machine. There are two Mac executors on a Mac slave, and two Linux executors on a Linux slave. All of the builds in fact complete successfully, but the master reports that all four of the Windows builds "appear to be cancelled" and then that they "completed with result ABORTED". UPDATE: I have seen similar behaviour for other matrix jobs, including some that do not run on the slaves at all. I think it is a matrix job issue, not a master/slave issue.
          Changing the job configuration to make the builds run serially rather than in parallel appears to work round the problem.

          UPDATE: I believe this problem is Windows only. On my Windows installation, I had to configure all my matrix jobs to run serially, so as to work round this bug. I have now moved my Jenkins master to a Mac, and I have changed all my jobs again so that they do not run serially. So far, I have not seen the problem occur even once on the Mac. (On the Mac I have Jenkins 1.509.2 installed, but I don't think there is a fix for anything like this between 1.509.1 and 1.509.2, so it's more likely to be the change of platform that has caused the improvement.)

          Sarah Woodall added a comment - - edited I am seeing this issue for the first time today after upgrading to Jenkins 1.509.1 from the previous LTS version. I have a matrix job which runs four different flavours of build on each of three platforms (Mac, Linux and Windows). My job configuration has not changed (and in fact no code has been checked in at all since the last good build – I just started this build manually today to test Jenkins after the upgrade). Our master is on Windows, and there are two Windows executors on the same machine. There are two Mac executors on a Mac slave, and two Linux executors on a Linux slave. All of the builds in fact complete successfully, but the master reports that all four of the Windows builds "appear to be cancelled" and then that they "completed with result ABORTED". UPDATE: I have seen similar behaviour for other matrix jobs, including some that do not run on the slaves at all. I think it is a matrix job issue, not a master/slave issue. Changing the job configuration to make the builds run serially rather than in parallel appears to work round the problem. UPDATE: I believe this problem is Windows only. On my Windows installation, I had to configure all my matrix jobs to run serially, so as to work round this bug. I have now moved my Jenkins master to a Mac, and I have changed all my jobs again so that they do not run serially. So far, I have not seen the problem occur even once on the Mac. (On the Mac I have Jenkins 1.509.2 installed, but I don't think there is a fix for anything like this between 1.509.1 and 1.509.2, so it's more likely to be the change of platform that has caused the improvement.)

          Ilguiz Latypov added a comment - - edited

          I see a machine that aborts a job on its second slave. Both slaves start via SSH, and the machine runs a Centrify SSH server.

          Other 2 machines run a regular SSH server and do not exhibit aborts on their second slaves.

          We have Jenkins 1.492.

          Ilguiz Latypov added a comment - - edited I see a machine that aborts a job on its second slave. Both slaves start via SSH, and the machine runs a Centrify SSH server. Other 2 machines run a regular SSH server and do not exhibit aborts on their second slaves. We have Jenkins 1.492.

          Ilguiz Latypov added a comment - - edited

          I figured a node configuration of my matrix job received a "job disabled" property. Sub-projects disabled via "Configuration Slicing/Job Disabled Build Slicer (bool)" in /slicing/jobdisabledbool/ will deny requests for new runs without pointing the reason.

          [USER@MASTER ~]$ diff -u /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAME{X,Y}/config.xml
          --- /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAMEX/config.xml  2013-06-06 21:43:25.823244000 -0400
          +++ /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAMEY/config.xml  2013-06-07 16:22:46.529940000 -0400
          @@ -7,7 +7,7 @@
             </properties>
             <scm class="hudson.scm.NullSCM"/>
             <canRoam>true</canRoam>
          -  <disabled>true</disabled>
          +  <disabled>false</disabled>
             <blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
             <blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
             <triggers class="vector"/>
          

          Ilguiz Latypov added a comment - - edited I figured a node configuration of my matrix job received a "job disabled" property. Sub-projects disabled via "Configuration Slicing/Job Disabled Build Slicer (bool)" in /slicing/jobdisabledbool/ will deny requests for new runs without pointing the reason. [USER@MASTER ~]$ diff -u /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAME{X,Y}/config.xml --- /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAMEX/config.xml 2013-06-06 21:43:25.823244000 -0400 +++ /usr/local/jenkins/data/jobs/MATRIXPROJ/configurations/axis-MATRIX/HOSTNAMEY/config.xml 2013-06-07 16:22:46.529940000 -0400 @@ -7,7 +7,7 @@ </properties> <scm class= "hudson.scm.NullSCM" /> <canRoam> true </canRoam> - <disabled> true </disabled> + <disabled> false </disabled> <blockBuildWhenDownstreamBuilding> false </blockBuildWhenDownstreamBuilding> <blockBuildWhenUpstreamBuilding> false </blockBuildWhenUpstreamBuilding> <triggers class= "vector" />

          Shay Weiss added a comment -

          Hi all,

          I've been investigating aborts issues in Jenkins and I've found at least one bug with regards to this.
          Here is my report on the subject:
          https://docs.google.com/presentation/d/1ybtB-Bhkb4c3dhb5ZMArr4prtEZ-pjLqH9Vk7yhdZTg/

          There is also another issue I'm dealing with and in the process of investigating.

          Core developers - I'll be happy to make a contribution to the sources if you can give me pointers on how to modify my proposed fix so it will be 'commit worthy'

          Shay Weiss added a comment - Hi all, I've been investigating aborts issues in Jenkins and I've found at least one bug with regards to this. Here is my report on the subject: https://docs.google.com/presentation/d/1ybtB-Bhkb4c3dhb5ZMArr4prtEZ-pjLqH9Vk7yhdZTg/ There is also another issue I'm dealing with and in the process of investigating. Core developers - I'll be happy to make a contribution to the sources if you can give me pointers on how to modify my proposed fix so it will be 'commit worthy'

          Tidhar Klein Orbach added a comment - - edited

          Hi

          Is the solution suggested above by Shay Weiss reasonable? is it going to be pushed in the next versions?

          thanks

          Tidhar Klein Orbach added a comment - - edited Hi Is the solution suggested above by Shay Weiss reasonable? is it going to be pushed in the next versions? thanks

          Tidhar Klein Orbach added a comment - - edited

          I created a pull request with a fix, can someone please review?
          https://github.com/jenkinsci/matrix-project-plugin/pull/28

          thanks,
          Tidhar

          Tidhar Klein Orbach added a comment - - edited I created a pull request with a fix, can someone please review? https://github.com/jenkinsci/matrix-project-plugin/pull/28 thanks, Tidhar

          Putting this back to fixed as confirmed in PR it is no longer a problem. When someone spot a similar problem, please, file new issue.

          Oliver Gondža added a comment - Putting this back to fixed as confirmed in PR it is no longer a problem. When someone spot a similar problem, please, file new issue.

          pjdarton added a comment -

          FYI someone did spot this again and raised JENKINS-46453

          (and then one of my colleagues found that bug report after encountering the same symptoms, hence my interest in it)

          pjdarton added a comment - FYI someone did spot this again and raised JENKINS-46453 (and then one of my colleagues found that bug report after encountering the same symptoms, hence my interest in it)

            Unassigned Unassigned
            jkoleszar John Koleszar
            Votes:
            25 Vote for this issue
            Watchers:
            30 Start watching this issue

              Created:
              Updated:
              Resolved: