-
Bug
-
Resolution: Unresolved
-
Critical
-
Jenkins 2.19.1 LTS
Jenkins 2.19.2 LTS
Jenkins 2.26
Durable Task Plugin 1.12
I hate to create a general "core" bug, as I wish I could redirect this to the correct component. Unfortunately, I can not identify which component is hanging and why, so I do not know how to direct this problem.
This problem started about 2 weeks ago, as we have been adding new Pipeline builds to our build server. So it could be related to one of the pipeline plugins.
The behavior is the following:
- 1 to 2 times a day, all builds on all build slaves will hang. The console log of the build just stops moving forward, and stays stuck at the last line executed / last line returned.
- Once this occurs, attempting to stop a build fails. Clicking stop results in no change in the build status or console log output
- New builds will not start. They sit in the queue, but the slaves will not be started.
- The UI continues to function, so it is possible to view config, get threaddumps, etc.
The only resolution is to restart the Jenkins server.
We are using the vCenter plugin to dynamically start all build slaves. Though, we have been using this configuration for months, and the problem just started.
We have recreated this on both latest Jenkins level (2.26) and Jenkins LTS version 2.19.1
I am attaching a threaddump of the server at the time of one of these hangs.
I can provide any other information that might help in diagnosing this problem
- is duplicated by
-
JENKINS-22824 Jenkins freezes at startup on ensureLoad call.
-
- Resolved
-
- is related to
-
JENKINS-38834 Freestyle jobs hang in 2.19.1 on Windows 10 Nodes
-
- Resolved
-
-
JENKINS-36088 Use NIO rather than JNR whenever possible
-
- Resolved
-
-
JENKINS-19445 Jobs randomly stuck with "building remotely on slave-name" message
-
- Reopened
-
-
JENKINS-16070 Deadlock using Windows native calls
-
- Resolved
-
- links to
We're also seeing deadlocks: Lots of threads all with stacktraces whose deepest point is:
i.e. They're all calling hudson.util.jna.Kernel32Utils.getWin32FileAttributes and this is deadlocking.
As for updating code to use java.nio.file.Files I'm not convinced that this will affect the issue. The problem is that, on Windows, the code is required to detect if a "directory" is either a real directory, a symbolic link to a directory, or a windows "Junction Point" (which is functionally identical to a symbolic link, but is not considered to be a symbolic link by java.nio's isSymbolicLink method).
i.e. No matter how we do this, it'll require a jna call out to Kernel32.DLL's GetFileAttributes function, so we need that to work and not to deadlock.
Also, I'd be quite surprised if this deadlock issue was unique to just the GetFileAttributes function - my guess is that it'll affect all Kernel32 calls, but it's just that file deletion hammers it the most and is, therefore, where most of the problems are seen.
FYI a Windows "Junction Point" is not uncommon - they're more common than symbolic links are on Windows. It's difficult to create a symbolic link on Windows (It's crazy but, on Windows, using symbolic links is a privileged operation. Whilst one can downgrade it to user-level, Windows ignores this for any user that is permitted to run things "as administrator", which is most people. i.e. in effect, admins have less rights than non-admins - it's crazy). However, it's trivial to create a "Junction Point" - any user can do that - this is not a privileged operation.
TL;DR: people who need a symbolic link to a directory on Windows usually use a Junction Point instead of a symbolic link.