Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-14122

Build Executors on Nodes enter infinite hang

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: core
    • Labels:
    • Environment:
      Jenkins Server: Ubuntu 10.4.4
      Jenkins Nodes: Ubutnu 10.4.4
    • Similar Issues:

      Description

      Hello everyone.

      We experience a major and persistent issue with Build Executors hanging infinitely. This affects all recent versions of Jenkins.

      The bug expresses itself like this: After some time of successfully building job (sometimes days, sometimes weeks), the executors of at first some nodes and then progressively more nodes just start failing.

      They accept a new job, start it on one of their executors, begin calling the SCM and then just ... stop. Here's an example log output (full paths excised with angled brackets):

      -----------------------------
      13:57:38 Started by command line by sys_swmdev
      13:57:38 Building remotely on musxbird015 in workspace /local/<path>/<project>@3
      13:57:38 Checkout:<JobName>@3 / /local/<path>/<project>@3 - hudson.remoting.Channel@2aa89e44:musxbird015
      13:57:38 Using strategy: Default
      13:57:38 Last Built Revision: Revision 501d0dbbd090f3dd338ad107b4d84f0e35544a9c (<GIT TAG>)
      -----------------------------

      Even waiting hours will not cause this to progress. Sometime, other executor
      s on the same node still work and other nodes can execute the same job just fine ... until they too fail one by one. Also, sometimes the job crashes & hangs in the middle of execution, instead of during the GIT checkout. The load on the hung node is next to zero during all of this; same is true for the remote GIT server.

      If you break the connection to the node and restart the connection again (Which will, by the way, not remove those jobs from the Jenkins UI. A manual cancel is necessary!), the node starts working again; at least for some time.
      Only a full restart of Jenkins can solve this issue; until it recurrs some days or weeks later.

      All jobs are affected, even the most simple ones that don't do anything. As soon as an Executor has hung, it does not recuperate. Additionally, this problem is completely independent of the load. It can happen with hundreds of jobs in the queue with only a single job executing at a time on the entire build cluster.

      It is as if the server can't read/send responses from/to the nodes anymore. The machines themselves are not hanging and can be accessed normally. Additionally, the script console for these nodes also still works.

      Over all, this bug is extremely strange and difficult to replicate. It happens reliably, just after a seemingly arbitrary amount of time.

      I have attached a thread-dump of one particular machine, and the entire server to this bug report. If you need further information to debug this, feel free to ask for them.

        Attachments

          Activity

          uniqueusername Martin Schröder created issue -
          uniqueusername Martin Schröder made changes -
          Field Original Value New Value
          Description Hello everyone.

          We experience a major and persistent issue with Build Executors hanging infinitely. This affects all recent versions of Jenkins.

          The bug expresses itself like this: After some time of successfully building job (sometimes days, sometimes weeks), the executors of at first some nodes and then progressively more nodes just start failing.

          They accept a new job, start it on one of their executors, begin calling the SCM and then just ... stop. Here's an example log output (full paths excised with angled brackets):

          -----------------------------
          13:57:38 Started by command line by sys_swmdev
          13:57:38 Building remotely on musxbird015 in workspace /local/<path>/<project>@3
          13:57:38 Checkout:XMM6260_SDLGEN@3 / /local/<path>/<project>@3 - hudson.remoting.Channel@2aa89e44:musxbird015
          13:57:38 Using strategy: Default
          13:57:38 Last Built Revision: Revision 501d0dbbd090f3dd338ad107b4d84f0e35544a9c (<GIT TAG>)
          -----------------------------

          Even waiting hours will not cause this to progress. Sometime, other executor
          s on the same node still work and other nodes can execute the same job just fine ... until they too fail one by one. Also, sometimes the job crashes & hangs in the middle of execution, instead of during the GIT checkout. The load on the hung node is next to zero during all of this; same is true for the remote GIT server.



          If you break the connection to the node and restart the connection again (Which will, by the way, not remove those jobs from the Jenkins UI. A manual cancel is necessary!), the node starts working again; at least for some time.
          Only a full restart of Jenkins can solve this issue; until it recurrs some days or weeks later.


          All jobs are affected, even the most simple ones that don't do anything. As soon as an Executor has hung, it does not recuperate. Additionally, this problem is completely independent of the load. It can happen with hundreds of jobs in the queue with only a single job executing at a time on the entire build cluster.


          It is as if the server can't read/send responses from/to the nodes anymore. The machines themselves are not hanging and can be accessed normally. Additionally, the script console for these nodes also still works.



          Over all, this bug is extremely strange and difficult to replicate. It happens reliably, just after a seemingly arbitrary amount of time.



          I have attached a thread-dump of one particular machine, and the entire server to this bug report. If you need further information to debug this, feel free to ask for them.
          Hello everyone.

          We experience a major and persistent issue with Build Executors hanging infinitely. This affects all recent versions of Jenkins.

          The bug expresses itself like this: After some time of successfully building job (sometimes days, sometimes weeks), the executors of at first some nodes and then progressively more nodes just start failing.

          They accept a new job, start it on one of their executors, begin calling the SCM and then just ... stop. Here's an example log output (full paths excised with angled brackets):

          -----------------------------
          13:57:38 Started by command line by sys_swmdev
          13:57:38 Building remotely on musxbird015 in workspace /local/<path>/<project>@3
          13:57:38 Checkout:<JobName>@3 / /local/<path>/<project>@3 - hudson.remoting.Channel@2aa89e44:musxbird015
          13:57:38 Using strategy: Default
          13:57:38 Last Built Revision: Revision 501d0dbbd090f3dd338ad107b4d84f0e35544a9c (<GIT TAG>)
          -----------------------------

          Even waiting hours will not cause this to progress. Sometime, other executor
          s on the same node still work and other nodes can execute the same job just fine ... until they too fail one by one. Also, sometimes the job crashes & hangs in the middle of execution, instead of during the GIT checkout. The load on the hung node is next to zero during all of this; same is true for the remote GIT server.



          If you break the connection to the node and restart the connection again (Which will, by the way, not remove those jobs from the Jenkins UI. A manual cancel is necessary!), the node starts working again; at least for some time.
          Only a full restart of Jenkins can solve this issue; until it recurrs some days or weeks later.


          All jobs are affected, even the most simple ones that don't do anything. As soon as an Executor has hung, it does not recuperate. Additionally, this problem is completely independent of the load. It can happen with hundreds of jobs in the queue with only a single job executing at a time on the entire build cluster.


          It is as if the server can't read/send responses from/to the nodes anymore. The machines themselves are not hanging and can be accessed normally. Additionally, the script console for these nodes also still works.



          Over all, this bug is extremely strange and difficult to replicate. It happens reliably, just after a seemingly arbitrary amount of time.



          I have attached a thread-dump of one particular machine, and the entire server to this bug report. If you need further information to debug this, feel free to ask for them.
          Hide
          uniqueusername Martin Schröder added a comment - - edited

          UPDATE:

          While the error is not easily reproducible, its effect on the ThreadDump is always the same. As soon as an Executor begins to hang, it gets stuck at exactly the same function. For example from the ThreadDump attached above:

          Pipe writer thread: channel
          "Pipe writer thread: channel" Id=12 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@9dd7e73
          at sun.misc.Unsafe.park(Native Method)

          • waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@9dd7e73
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
            at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
            at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
            at java.lang.Thread.run(Thread.java:662)

          It really does seem as if this is a classical, albeit rare, race-condition. This is something of a big show-stopper for us, as it greatly impacts our overall reliability.

          Show
          uniqueusername Martin Schröder added a comment - - edited UPDATE: While the error is not easily reproducible, its effect on the ThreadDump is always the same. As soon as an Executor begins to hang, it gets stuck at exactly the same function. For example from the ThreadDump attached above: Pipe writer thread: channel "Pipe writer thread: channel" Id=12 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@9dd7e73 at sun.misc.Unsafe.park(Native Method) waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@9dd7e73 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:662) It really does seem as if this is a classical, albeit rare, race-condition. This is something of a big show-stopper for us, as it greatly impacts our overall reliability.
          Hide
          evernat evernat added a comment -

          Is it reproduced with a recent Jenkins version?

          Show
          evernat evernat added a comment - Is it reproduced with a recent Jenkins version?
          Hide
          uniqueusername Martin Schröder added a comment -

          > Is it reproduced with a recent Jenkins version?

          We have not yet seen it to recur since switching to the LTS-release 1.509.1 some weeks ago.
          Before that, we followed the non-LTS releases relatively closely. The last versions we used before the switch to LTS was 1.497 and 1.501; one of them experienced the problem at least once.

          As said in the bug description, it is very rarely-occurring bug. Even before, we sometimes had half a year of stable runtime before it struck and necessitated a reboot. But apart from that, the uptime since the last reboot does not seem to matter. The issue can strike freely even hours after a reboot.

          In short: We have not seen it in the LTS-release, but we can't say for certain that the bug is gone.

          Show
          uniqueusername Martin Schröder added a comment - > Is it reproduced with a recent Jenkins version? We have not yet seen it to recur since switching to the LTS-release 1.509.1 some weeks ago. Before that, we followed the non-LTS releases relatively closely. The last versions we used before the switch to LTS was 1.497 and 1.501; one of them experienced the problem at least once. As said in the bug description, it is very rarely-occurring bug. Even before, we sometimes had half a year of stable runtime before it struck and necessitated a reboot. But apart from that, the uptime since the last reboot does not seem to matter. The issue can strike freely even hours after a reboot. In short: We have not seen it in the LTS-release, but we can't say for certain that the bug is gone.
          Hide
          mhschroe Martin Schröder added a comment - - edited

          Update:

          We've now encountered this very same bug again. Our servers are still on 1.509.3, pending full internal testing and release of the new LTS version (1.532.2).

          Therefore, we can only report that the bug still exists, but is (as reported before) very rarely triggered. Incidence rate has slowed down to roughly once in 6 months. We will post an update if it appears in the new LTS version too, once we have deployed it.

          But, given the rarity of the bug, it might be another 6 months until we can post an update.

          Show
          mhschroe Martin Schröder added a comment - - edited Update: We've now encountered this very same bug again. Our servers are still on 1.509.3, pending full internal testing and release of the new LTS version (1.532.2). Therefore, we can only report that the bug still exists, but is (as reported before) very rarely triggered. Incidence rate has slowed down to roughly once in 6 months. We will post an update if it appears in the new LTS version too, once we have deployed it. But, given the rarity of the bug, it might be another 6 months until we can post an update.
          rtyler R. Tyler Croy made changes -
          Workflow JNJira [ 144753 ] JNJira + In-Review [ 176186 ]

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            uniqueusername Martin Schröder
            Votes:
            8 Vote for this issue
            Watchers:
            11 Start watching this issue

              Dates

              Created:
              Updated: