Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-10944

MatrixProject can run on a heavyweight executor, leading to deadlocks

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • jenkins running on linux host
      builds tied to remote mac host
      have several matrix builds

      Sometimes jobs get blocked by each other or "deadlock". We must manually cancel and restart the builds.

      IRC Transcript
      --------------
      amrox: Hello. I had a brief exchange with @jenkins on twitter yesterday http://dl.dropbox.com/u/45634/Screen%20Shot%202011-09-08%20at%205.18.05%20PM.png
      amrox: my jenkins fell into a "deadlock" state again
      amrox: it's still in that state now. what information can I provide?
      rtyler: *nudges kohsuke *
      rtyler: amrox: I'm @jenkinsci FWIW
      amrox: rtyler: hi, and thanks for the responses yesterday
      amrox: heres a screen recording... best way I could think to show the issue http://dl.dropbox.com/u/45634/Screen%20Recording%204.mov
      farshidghods left the room (quit: Quit: Leaving.). (5:30:12 PM)
      mconigliaro left the room (quit: Quit: mconigliaro). (5:30:16 PM)
      amrox: I think it can be solved by just adding another executor
      amrox: but I'd like to avoid that, and it seems like a bug?
      kohsuke: amrox: we need thread dump. see https://wiki.jenkins-ci.org/display/JENKINS/Build+is+hanging
      amrox: jenkins master dump: http://dl.dropbox.com/u/45634/threaddump.html
      amrox: slave: http://dl.dropbox.com/u/45634/slave-threaddump.html
      amrox: is that format acceptable? helpful at all?
      amrox: should I just file a bug?
      kennethreitz kennethre@c-24-127-96-129.hsd1.va.comcast.net entered the room. (5:45:36 PM)
      kohsuke: amrox: yes, that'd be great
      kohsuke: jenkins-admin: create ant-plugin on github for kohsuke
      kohsuke: jenkins-admin: create javadoc-plugin on github for kohsuke
      amrox: kohsuke: will do thanks
      kohsuke: Is "content_viewer_ios_develop_build" the hanging job?
      kohsuke: OK, so the issue is that the matrix parents are blocking the execution of its child builds
      kohsuke: but the parent is also waiting for the completion of the child builds, hence the dead lock
      kohsuke: amrox: ^^ did I get that right?
      kohsuke: the question is why content_asset_verify_auto build is occupying an executor
      amrox: yes that seems accurate
      kohsuke: content_viewer_ios_develop_build is correctly using a temporary flyweight executor
      kohsuke: ... as seen by the lack of number in the executor table
      kohsuke: amrox: I assume all those builds are tried to remote-macslave-1
      amrox: kohsuke: yes
      kohsuke: OK. We'll capture this in the ticket you'll create
      kohsuke: Thanks for bringing this to our attention, and sorry for the bug
      amrox: thanks for building and maintaining jenkins 

          [JENKINS-10944] MatrixProject can run on a heavyweight executor, leading to deadlocks

          Andy M created issue -
          Jenkins IRC Bot made changes -
          Component/s New: matrix-project [ 18765 ]
          Component/s Original: matrix [ 15501 ]

          Tim Wood added a comment -

          Very much needing a solution for this issue. It basically makes matrix jobs worthless, because they may deadlock.

          I've made two observations:

          • This occurs in the absence of the Throttle Concurrent Builds plugin.
          • It does not seem to occur in cases where all elements of the matrix do not start at the same time. That is, if the executors on a node the matrix job may run on are currently busy, that node's build of the matrix job gets queued, and the flyweight master task of the matrix build runs along with that node's build.
          • In cases where an executor is available on each node the matrix job may run on, occasionally all builds of the matrix job will get scheduled properly, including the one that runs concurrently with the flyweight master task. But usually, the deadlock happens. So there's some race between threads here, not just a straight-line logical error.

          Tim Wood added a comment - Very much needing a solution for this issue. It basically makes matrix jobs worthless, because they may deadlock. I've made two observations: This occurs in the absence of the Throttle Concurrent Builds plugin. It does not seem to occur in cases where all elements of the matrix do not start at the same time. That is, if the executors on a node the matrix job may run on are currently busy, that node's build of the matrix job gets queued, and the flyweight master task of the matrix build runs along with that node's build. In cases where an executor is available on each node the matrix job may run on, occasionally all builds of the matrix job will get scheduled properly, including the one that runs concurrently with the flyweight master task. But usually , the deadlock happens. So there's some race between threads here, not just a straight-line logical error.

          Tim Wood added a comment -

          OK, three observations.

          Tim Wood added a comment - OK, three observations.

          Tim Wood added a comment - - edited

          With my current plugin updates (msg me for list), the problem has gone away, except in this use case:
          1. Quiet Jenkins (announce "shut down")
          2. Wait for all running builds to complete on all nodes
          3. Queue a build of the matrix job
          4. Cancel shutdown
          5. Job begins building on all nodes except the one running the flyweight launcher task; build for that node remains queued in the deadlock condition.

          So far it works to exchange steps 3 & 4, and hope nothing else gets queued to run between them.

          Tim Wood added a comment - - edited With my current plugin updates (msg me for list), the problem has gone away, except in this use case: 1. Quiet Jenkins (announce "shut down") 2. Wait for all running builds to complete on all nodes 3. Queue a build of the matrix job 4. Cancel shutdown 5. Job begins building on all nodes except the one running the flyweight launcher task; build for that node remains queued in the deadlock condition. So far it works to exchange steps 3 & 4, and hope nothing else gets queued to run between them.
          Jesse Glick made changes -
          Component/s New: core [ 15593 ]
          Component/s Original: matrix-project-plugin [ 18765 ]
          Labels New: matrix
          Summary Original: Jobs "deadlocking" on remote slave New: MatrixProject can run on a heavyweight executor, leading to deadlocks

          Jesse Glick added a comment -

          I believe https://github.com/jenkinsci/jenkins/commit/6fa3acf8a9e96c385071eba594959c944150390a was not correct. If a matrix project is scheduled while Jenkins is shutting down, it should not be made buildable as a heavyweight task!

          Jesse Glick added a comment - I believe https://github.com/jenkinsci/jenkins/commit/6fa3acf8a9e96c385071eba594959c944150390a was not correct. If a matrix project is scheduled while Jenkins is shutting down, it should not be made buildable as a heavyweight task!
          Jesse Glick made changes -
          Link New: This issue is blocking JENKINS-4873 [ JENKINS-4873 ]
          Jesse Glick made changes -
          Labels Original: matrix New: deadlock flyweight matrix queue shutdown

          Daniel Beck added a comment -

          jglick Isn't this just a duplicate of JENKINS-24519? The other has more watchers and a better network of related/duplicate issues, so I'd rather resolve this.

          (Also, ZD-21888.)

          Daniel Beck added a comment - jglick Isn't this just a duplicate of JENKINS-24519 ? The other has more watchers and a better network of related/duplicate issues, so I'd rather resolve this. (Also, ZD-21888.)

            jglick Jesse Glick
            amrox Andy M
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: