Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-10944

MatrixProject can run on a heavyweight executor, leading to deadlocks

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • jenkins running on linux host
      builds tied to remote mac host
      have several matrix builds

      Sometimes jobs get blocked by each other or "deadlock". We must manually cancel and restart the builds.

      IRC Transcript
      --------------
      amrox: Hello. I had a brief exchange with @jenkins on twitter yesterday http://dl.dropbox.com/u/45634/Screen%20Shot%202011-09-08%20at%205.18.05%20PM.png
      amrox: my jenkins fell into a "deadlock" state again
      amrox: it's still in that state now. what information can I provide?
      rtyler: *nudges kohsuke *
      rtyler: amrox: I'm @jenkinsci FWIW
      amrox: rtyler: hi, and thanks for the responses yesterday
      amrox: heres a screen recording... best way I could think to show the issue http://dl.dropbox.com/u/45634/Screen%20Recording%204.mov
      farshidghods left the room (quit: Quit: Leaving.). (5:30:12 PM)
      mconigliaro left the room (quit: Quit: mconigliaro). (5:30:16 PM)
      amrox: I think it can be solved by just adding another executor
      amrox: but I'd like to avoid that, and it seems like a bug?
      kohsuke: amrox: we need thread dump. see https://wiki.jenkins-ci.org/display/JENKINS/Build+is+hanging
      amrox: jenkins master dump: http://dl.dropbox.com/u/45634/threaddump.html
      amrox: slave: http://dl.dropbox.com/u/45634/slave-threaddump.html
      amrox: is that format acceptable? helpful at all?
      amrox: should I just file a bug?
      kennethreitz kennethre@c-24-127-96-129.hsd1.va.comcast.net entered the room. (5:45:36 PM)
      kohsuke: amrox: yes, that'd be great
      kohsuke: jenkins-admin: create ant-plugin on github for kohsuke
      kohsuke: jenkins-admin: create javadoc-plugin on github for kohsuke
      amrox: kohsuke: will do thanks
      kohsuke: Is "content_viewer_ios_develop_build" the hanging job?
      kohsuke: OK, so the issue is that the matrix parents are blocking the execution of its child builds
      kohsuke: but the parent is also waiting for the completion of the child builds, hence the dead lock
      kohsuke: amrox: ^^ did I get that right?
      kohsuke: the question is why content_asset_verify_auto build is occupying an executor
      amrox: yes that seems accurate
      kohsuke: content_viewer_ios_develop_build is correctly using a temporary flyweight executor
      kohsuke: ... as seen by the lack of number in the executor table
      kohsuke: amrox: I assume all those builds are tried to remote-macslave-1
      amrox: kohsuke: yes
      kohsuke: OK. We'll capture this in the ticket you'll create
      kohsuke: Thanks for bringing this to our attention, and sorry for the bug
      amrox: thanks for building and maintaining jenkins 

          [JENKINS-10944] MatrixProject can run on a heavyweight executor, leading to deadlocks

          Jesse Glick added a comment -

          The reproduction case for JENKINS-24519 does not involve quieting down, it involves a slave being temporarily offline. But you may be right that the faulty code path and thus fix is the same: do not proceed to p.enter(this) when p.task instanceof FlyweightTask. I am not sure what should be done instead: from the callers of makeBuildable it seems that merely returning without doing anything is not correct either, that we would need to actually revert to the waiting state. Will have to examine this a bit more closely, starting by reproducing at least one of the failure modes in a test.

          it makes sense to not schedule Flyweight tasks while quieting down if they're not NonBlockingTask, however it makes no sense for them to be treated like normal ("heavyweight") tasks afterwards

          Agreed on both points.

          Jesse Glick added a comment - The reproduction case for JENKINS-24519 does not involve quieting down, it involves a slave being temporarily offline. But you may be right that the faulty code path and thus fix is the same: do not proceed to p.enter(this) when p.task instanceof FlyweightTask . I am not sure what should be done instead: from the callers of makeBuildable it seems that merely returning without doing anything is not correct either, that we would need to actually revert to the waiting state. Will have to examine this a bit more closely, starting by reproducing at least one of the failure modes in a test. it makes sense to not schedule Flyweight tasks while quieting down if they're not NonBlockingTask, however it makes no sense for them to be treated like normal ("heavyweight") tasks afterwards Agreed on both points.

          Daniel Beck added a comment -

          Jesse: You're right. I've accessed that issue so often and stopped reading the description a long time ago when I clearly shouldn't have, instead interpreting it from the code I've linked to and the behavior most relevant for me. Sorry about that.

          It seems to me that Queue.Item.task's flyweight property needs to be relevant somewhere else in addition to when it enters the queue (possibly Queue.maintain() around here?) as well.

          Daniel Beck added a comment - Jesse: You're right. I've accessed that issue so often and stopped reading the description a long time ago when I clearly shouldn't have, instead interpreting it from the code I've linked to and the behavior most relevant for me. Sorry about that. It seems to me that Queue.Item.task 's flyweight property needs to be relevant somewhere else in addition to when it enters the queue (possibly Queue.maintain() around here ?) as well.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          core/src/main/java/hudson/model/Queue.java
          test/src/test/java/hudson/model/QueueTest.java
          http://jenkins-ci.org/commit/jenkins/3e344a94a9eed316d0c351becb08287b473b6521
          Log:
          [FIXED JENKINS-10944] [FIXED JENKINS-24519] If makeBuildable fails on a FlyweightTask, keep it in queue.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: core/src/main/java/hudson/model/Queue.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/3e344a94a9eed316d0c351becb08287b473b6521 Log: [FIXED JENKINS-10944] [FIXED JENKINS-24519] If makeBuildable fails on a FlyweightTask, keep it in queue.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          changelog.html
          core/src/main/java/hudson/model/LoadBalancer.java
          core/src/main/java/hudson/model/Queue.java
          core/src/main/java/jenkins/model/Jenkins.java
          test/src/test/java/hudson/model/QueueTest.java
          http://jenkins-ci.org/commit/jenkins/9e333bc1e60dd82b9983135276f9379d3eb4d392
          Log:
          JENKINS-10944 JENKINS-24519 Noting merge of #1513.

          Compare: https://github.com/jenkinsci/jenkins/compare/7ce51328d515...9e333bc1e60d

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: changelog.html core/src/main/java/hudson/model/LoadBalancer.java core/src/main/java/hudson/model/Queue.java core/src/main/java/jenkins/model/Jenkins.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/9e333bc1e60dd82b9983135276f9379d3eb4d392 Log: JENKINS-10944 JENKINS-24519 Noting merge of #1513. Compare: https://github.com/jenkinsci/jenkins/compare/7ce51328d515...9e333bc1e60d

          dogfood added a comment -

          Integrated in jenkins_main_trunk #3924
          [FIXED JENKINS-10944] [FIXED JENKINS-24519] If makeBuildable fails on a FlyweightTask, keep it in queue. (Revision 3e344a94a9eed316d0c351becb08287b473b6521)

          Result = SUCCESS
          jesse glick : 3e344a94a9eed316d0c351becb08287b473b6521
          Files :

          • test/src/test/java/hudson/model/QueueTest.java
          • core/src/main/java/hudson/model/Queue.java

          dogfood added a comment - Integrated in jenkins_main_trunk #3924 [FIXED JENKINS-10944] [FIXED JENKINS-24519] If makeBuildable fails on a FlyweightTask, keep it in queue. (Revision 3e344a94a9eed316d0c351becb08287b473b6521) Result = SUCCESS jesse glick : 3e344a94a9eed316d0c351becb08287b473b6521 Files : test/src/test/java/hudson/model/QueueTest.java core/src/main/java/hudson/model/Queue.java

          Jesse Glick added a comment -

          I am not sure if lts-candidate is appropriate here. Would probably want code review from kohsuke before being comfortable with that.

          Jesse Glick added a comment - I am not sure if lts-candidate is appropriate here. Would probably want code review from kohsuke before being comfortable with that.

          Daniel Beck added a comment -

          This has been in three Jenkins releases now, so satisfies the soaking requirement.

          Daniel Beck added a comment - This has been in three Jenkins releases now, so satisfies the soaking requirement.

          Tim Wood added a comment -

          I appreciate everyone's thoughtful attention to this, thanks.

          Tim Wood added a comment - I appreciate everyone's thoughtful attention to this, thanks.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          core/src/main/java/hudson/model/LoadBalancer.java
          core/src/main/java/hudson/model/Queue.java
          core/src/main/java/jenkins/model/Jenkins.java
          test/src/test/java/hudson/model/QueueTest.java
          http://jenkins-ci.org/commit/jenkins/ff26bb8a18c62a4cd008cfff935166e38ee94f6a
          Log:
          JENKINS-10944 Merge branch 'master' into FlyweightTask-JENKINS-10944

          (cherry picked from commit 1a80973ab875d12c3eda61ebb25850795d9cb6d6)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: core/src/main/java/hudson/model/LoadBalancer.java core/src/main/java/hudson/model/Queue.java core/src/main/java/jenkins/model/Jenkins.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/ff26bb8a18c62a4cd008cfff935166e38ee94f6a Log: JENKINS-10944 Merge branch 'master' into FlyweightTask- JENKINS-10944 (cherry picked from commit 1a80973ab875d12c3eda61ebb25850795d9cb6d6)

          dogfood added a comment -

          Integrated in jenkins_main_trunk #4292
          JENKINS-10944 Merge branch 'master' into FlyweightTask-JENKINS-10944 (Revision ff26bb8a18c62a4cd008cfff935166e38ee94f6a)

          Result = UNSTABLE
          ogondza : ff26bb8a18c62a4cd008cfff935166e38ee94f6a
          Files :

          • test/src/test/java/hudson/model/QueueTest.java
          • core/src/main/java/hudson/model/Queue.java
          • core/src/main/java/jenkins/model/Jenkins.java
          • core/src/main/java/hudson/model/LoadBalancer.java

          dogfood added a comment - Integrated in jenkins_main_trunk #4292 JENKINS-10944 Merge branch 'master' into FlyweightTask- JENKINS-10944 (Revision ff26bb8a18c62a4cd008cfff935166e38ee94f6a) Result = UNSTABLE ogondza : ff26bb8a18c62a4cd008cfff935166e38ee94f6a Files : test/src/test/java/hudson/model/QueueTest.java core/src/main/java/hudson/model/Queue.java core/src/main/java/jenkins/model/Jenkins.java core/src/main/java/hudson/model/LoadBalancer.java

            jglick Jesse Glick
            amrox Andy M
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: