Details
-
Type:
Bug
-
Status: Open (View Workflow)
-
Priority:
Major
-
Resolution: Unresolved
-
Component/s: core
-
Environment:Jenkins Server: Ubuntu 10.4.4
Jenkins Nodes: Ubutnu 10.4.4
-
Similar Issues:
Description
Hello everyone.
We experience a major and persistent issue with Build Executors hanging infinitely. This affects all recent versions of Jenkins.
The bug expresses itself like this: After some time of successfully building job (sometimes days, sometimes weeks), the executors of at first some nodes and then progressively more nodes just start failing.
They accept a new job, start it on one of their executors, begin calling the SCM and then just ... stop. Here's an example log output (full paths excised with angled brackets):
-----------------------------
13:57:38 Started by command line by sys_swmdev
13:57:38 Building remotely on musxbird015 in workspace /local/<path>/<project>@3
13:57:38 Checkout:<JobName>@3 / /local/<path>/<project>@3 - hudson.remoting.Channel@2aa89e44:musxbird015
13:57:38 Using strategy: Default
13:57:38 Last Built Revision: Revision 501d0dbbd090f3dd338ad107b4d84f0e35544a9c (<GIT TAG>)
-----------------------------
Even waiting hours will not cause this to progress. Sometime, other executor
s on the same node still work and other nodes can execute the same job just fine ... until they too fail one by one. Also, sometimes the job crashes & hangs in the middle of execution, instead of during the GIT checkout. The load on the hung node is next to zero during all of this; same is true for the remote GIT server.
If you break the connection to the node and restart the connection again (Which will, by the way, not remove those jobs from the Jenkins UI. A manual cancel is necessary!), the node starts working again; at least for some time.
Only a full restart of Jenkins can solve this issue; until it recurrs some days or weeks later.
All jobs are affected, even the most simple ones that don't do anything. As soon as an Executor has hung, it does not recuperate. Additionally, this problem is completely independent of the load. It can happen with hundreds of jobs in the queue with only a single job executing at a time on the entire build cluster.
It is as if the server can't read/send responses from/to the nodes anymore. The machines themselves are not hanging and can be accessed normally. Additionally, the script console for these nodes also still works.
Over all, this bug is extremely strange and difficult to replicate. It happens reliably, just after a seemingly arbitrary amount of time.
I have attached a thread-dump of one particular machine, and the entire server to this bug report. If you need further information to debug this, feel free to ask for them.
Update:
We've now encountered this very same bug again. Our servers are still on 1.509.3, pending full internal testing and release of the new LTS version (1.532.2).
Therefore, we can only report that the bug still exists, but is (as reported before) very rarely triggered. Incidence rate has slowed down to roughly once in 6 months. We will post an update if it appears in the new LTS version too, once we have deployed it.
But, given the rarity of the bug, it might be another 6 months until we can post an update.