We experience a major and persistent issue with Build Executors hanging infinitely. This affects all recent versions of Jenkins.
The bug expresses itself like this: After some time of successfully building job (sometimes days, sometimes weeks), the executors of at first some nodes and then progressively more nodes just start failing.
They accept a new job, start it on one of their executors, begin calling the SCM and then just ... stop. Here's an example log output (full paths excised with angled brackets):
13:57:38 Started by command line by sys_swmdev
13:57:38 Building remotely on musxbird015 in workspace /local/<path>/<project>@3
13:57:38 Checkout:<JobName>@3 / /local/<path>/<project>@3 - hudson.remoting.Channel@2aa89e44:musxbird015
13:57:38 Using strategy: Default
13:57:38 Last Built Revision: Revision 501d0dbbd090f3dd338ad107b4d84f0e35544a9c (<GIT TAG>)
Even waiting hours will not cause this to progress. Sometime, other executor
s on the same node still work and other nodes can execute the same job just fine ... until they too fail one by one. Also, sometimes the job crashes & hangs in the middle of execution, instead of during the GIT checkout. The load on the hung node is next to zero during all of this; same is true for the remote GIT server.
If you break the connection to the node and restart the connection again (Which will, by the way, not remove those jobs from the Jenkins UI. A manual cancel is necessary!), the node starts working again; at least for some time.
Only a full restart of Jenkins can solve this issue; until it recurrs some days or weeks later.
All jobs are affected, even the most simple ones that don't do anything. As soon as an Executor has hung, it does not recuperate. Additionally, this problem is completely independent of the load. It can happen with hundreds of jobs in the queue with only a single job executing at a time on the entire build cluster.
It is as if the server can't read/send responses from/to the nodes anymore. The machines themselves are not hanging and can be accessed normally. Additionally, the script console for these nodes also still works.
Over all, this bug is extremely strange and difficult to replicate. It happens reliably, just after a seemingly arbitrary amount of time.
I have attached a thread-dump of one particular machine, and the entire server to this bug report. If you need further information to debug this, feel free to ask for them.