Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-9688

Build runs on master without executor and get stuck

      In the ASF Jenkins, we do not allow builds to run on master. Thus, we have set the number of executors on master to zero. Even so, one build recently (https://builds.apache.org/hudson/job/Axis2/774/) got assigned to run on master. The build is configured with a label to run on one of the slaves, but somehow Jenkins assigned it incorrectly.
      Running on master also meant it got stuck, presumably due to the lack of executors, without any way of stopping it.

          [JENKINS-9688] Build runs on master without executor and get stuck

          veithen added a comment - - edited

          Some additional observations for this issue:

          The Axis2 builds (there are in total 3 builds for 3 different branches) use locks. What we see is that sometimes two (or more) different builds (for different branches) where triggered at the same time by the completion of a common upstream build (Axiom trunk e.g.). Since the Axis2 builds use a common lock, one would expect that only one starts execution, while the others remain in the build queue. However, what happens is that sometimes, two builds start execution in parallel, with one waiting for the lock (i.e. instead of waiting in the build queue, it is assigned to an executor and waiting there).

          Here is a screenshot that shows the problem (The three Axis2 builds all use the same lock):

          http://people.apache.org/~veithen/axis2-builds.png

          While the build is waiting on the executor to acquire the lock, it is reported as running on the master. E.g. the axis2-1.6 #225 build showed:

          "Started 42 min ago
          Build is being executed for 42 min on master"

          In that particular case, the axis2-1.6 build eventually completed successfully. However, I think (I'm 90% sure) that I saw another Axis2 build that was triggered after the axis2-1.6 #225, but that actually started execution before. That would mean that when blocked builds are assigned to executors, they are no longer executed in FIFO style.

          As a conclusion, I think that in order to make progress on this issue, one should first concentrate on the issues that occur when using locks:

          • The fact that builds waiting for a lock (on an executor) may be reported as running on master makes it hard to debug things if something gets stuck.
          • The fact that execution is not FIFO means that a build may appear to be stuck simply because there is a constant flow of other builds in the queue that use the same lock. (Unfortunately it is not possible to check if that is a valid explanation for the original issue reported in this JIRA)

          Note: these observations where made with the following build: Jenkins ver. 1.447-SNAPSHOT (private-01/01/2012 20:43 GMT-olamy)

          veithen added a comment - - edited Some additional observations for this issue: The Axis2 builds (there are in total 3 builds for 3 different branches) use locks. What we see is that sometimes two (or more) different builds (for different branches) where triggered at the same time by the completion of a common upstream build (Axiom trunk e.g.). Since the Axis2 builds use a common lock, one would expect that only one starts execution, while the others remain in the build queue. However, what happens is that sometimes, two builds start execution in parallel, with one waiting for the lock (i.e. instead of waiting in the build queue, it is assigned to an executor and waiting there). Here is a screenshot that shows the problem (The three Axis2 builds all use the same lock): http://people.apache.org/~veithen/axis2-builds.png While the build is waiting on the executor to acquire the lock, it is reported as running on the master. E.g. the axis2-1.6 #225 build showed: "Started 42 min ago Build is being executed for 42 min on master" In that particular case, the axis2-1.6 build eventually completed successfully. However, I think (I'm 90% sure) that I saw another Axis2 build that was triggered after the axis2-1.6 #225, but that actually started execution before. That would mean that when blocked builds are assigned to executors, they are no longer executed in FIFO style. As a conclusion, I think that in order to make progress on this issue, one should first concentrate on the issues that occur when using locks: The fact that builds waiting for a lock (on an executor) may be reported as running on master makes it hard to debug things if something gets stuck. The fact that execution is not FIFO means that a build may appear to be stuck simply because there is a constant flow of other builds in the queue that use the same lock. (Unfortunately it is not possible to check if that is a valid explanation for the original issue reported in this JIRA) Note: these observations where made with the following build: Jenkins ver. 1.447-SNAPSHOT (private-01/01/2012 20:43 GMT-olamy)

          veithen added a comment -

          Still occurs (with identical symptoms) on Jenkins 1.454.

          veithen added a comment - Still occurs (with identical symptoms) on Jenkins 1.454.

          wbauer added a comment -

          I see the same issue sporadically (Jenkins 1.454 currently).
          The build seems to be stuck in the assignment phase and no thread or sockets refers to it at all.
          The build is not visible in any queue or listing except on the job itself.
          The build shows:
          "Started 2 hr 1 min ago
          Build is being executed for null on master"
          .
          This build is not killable meaning that I have to restart the master to get rid of the build...

          wbauer added a comment - I see the same issue sporadically (Jenkins 1.454 currently). The build seems to be stuck in the assignment phase and no thread or sockets refers to it at all. The build is not visible in any queue or listing except on the job itself. The build shows: "Started 2 hr 1 min ago Build is being executed for null on master" . This build is not killable meaning that I have to restart the master to get rid of the build...

          Jose Sa added a comment -

          The problem still occurs in 1.458 see:

          We have multiple jobs using the same lock representing a test environment, but without any quiet period. These jobs are not meant to run on master, but on a build slave, but in the build status it shows:
          "Started 1 day 14 hr ago
          Build is being executed for null on master"

          Maybe activating the default quiet period (30 seconds) will help in this situation.

          Jose Sa added a comment - The problem still occurs in 1.458 see: We have multiple jobs using the same lock representing a test environment, but without any quiet period. These jobs are not meant to run on master, but on a build slave, but in the build status it shows: "Started 1 day 14 hr ago Build is being executed for null on master" Maybe activating the default quiet period (30 seconds) will help in this situation.

          Jose Sa added a comment -

          Even with delays it got stuck (maybe because I used exact same quiet period). Now I've configured different quiet periods to avoid this.

          Jose Sa added a comment - Even with delays it got stuck (maybe because I used exact same quiet period). Now I've configured different quiet periods to avoid this.

          Jose Sa added a comment -

          After several similar situations in the past weeks like this one where concurrent jobs using the same lock I think the problem may be mostly between chair and keyboard, but also due to lack of informative messages.

          Lets assume 3 jobs A, B and C using the same lock, the following timeline occurs:

          1. A is already running (so it has the lock) and we start B and C.
          2. B and C show in queue stating they are being blocked by A
          3. A finishes
          4. B and C already waited the 'delay' so they are ready to run, so they do run at the same time
          5. B gets the lock so it states it in the log is activelly running
          6. C on the other hand didn't get the lock, but is running anyway and showing no output that it is actively waiting for the lock
          7. Users panic when they see a job running for 2 days without any output in the log and cancel the C
          8. C get's inconsistent state and starts showing "null" instead of real dates
          9. Jenkins needs to be restarted to get rid of C because there is no way to kill it

          Also we observed that if we don't Cancel C eventually it starts executing normally when B finishes and showing it has acquired the lock in the log file.

          Bottom line the problem is lack of informative messages from the plugin to provide the proper feedback to users so they don't panic.

          Workaround: Revoke users permissions to Cancel builds that are using locks.

          Jose Sa added a comment - After several similar situations in the past weeks like this one where concurrent jobs using the same lock I think the problem may be mostly between chair and keyboard, but also due to lack of informative messages. Lets assume 3 jobs A, B and C using the same lock, the following timeline occurs: A is already running (so it has the lock) and we start B and C. B and C show in queue stating they are being blocked by A A finishes B and C already waited the 'delay' so they are ready to run, so they do run at the same time B gets the lock so it states it in the log is activelly running C on the other hand didn't get the lock, but is running anyway and showing no output that it is actively waiting for the lock Users panic when they see a job running for 2 days without any output in the log and cancel the C C get's inconsistent state and starts showing "null" instead of real dates Jenkins needs to be restarted to get rid of C because there is no way to kill it Also we observed that if we don't Cancel C eventually it starts executing normally when B finishes and showing it has acquired the lock in the log file. Bottom line the problem is lack of informative messages from the plugin to provide the proper feedback to users so they don't panic. Workaround: Revoke users permissions to Cancel builds that are using locks.

          I still have the same issue, using Jenkins ver. 1.552 on 64bit Linux RHEL, and only Restart of Jenkins is able to resolve these zombie jobs staying there forever and preventing new builds from starting.

          Dimitar Sakarov added a comment - I still have the same issue, using Jenkins ver. 1.552 on 64bit Linux RHEL, and only Restart of Jenkins is able to resolve these zombie jobs staying there forever and preventing new builds from starting.

          Daniel Beck added a comment -

          Too old and too little detail in the issue report to help further investigation, so resolving as incomplete.

          Please file a new issue if this happens with a recent version of Jenkins (no older than ten weeks or so), and provide relevant information: Installed plugins, job configuration, how this can be reproduced, etc. See also the wiki for what other information may be relevant:
          https://wiki.jenkins-ci.org/display/JENKINS/How+to+report+an+issue

          Daniel Beck added a comment - Too old and too little detail in the issue report to help further investigation, so resolving as incomplete. Please file a new issue if this happens with a recent version of Jenkins (no older than ten weeks or so), and provide relevant information: Installed plugins, job configuration, how this can be reproduced, etc. See also the wiki for what other information may be relevant: https://wiki.jenkins-ci.org/display/JENKINS/How+to+report+an+issue

          Pranav Gupta added a comment -

          Hi,

          whats the best way to run couple of jenkins jobs using groovy script. I don't find any good script for this. Any kind of help is greatly appreciated

          Pranav Gupta added a comment - Hi, whats the best way to run couple of jenkins jobs using groovy script. I don't find any good script for this. Any kind of help is greatly appreciated

            stephenconnolly Stephen Connolly
            protocol7b protocol7b
            Votes:
            9 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: