Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39179

All builds hang, JNA load deadlock on Windows slave

      I hate to create a general "core" bug, as I wish I could redirect this to the correct component. Unfortunately, I can not identify which component is hanging and why, so I do not know how to direct this problem.

      This problem started about 2 weeks ago, as we have been adding new Pipeline builds to our build server. So it could be related to one of the pipeline plugins.

      The behavior is the following:

      • 1 to 2 times a day, all builds on all build slaves will hang. The console log of the build just stops moving forward, and stays stuck at the last line executed / last line returned.
      • Once this occurs, attempting to stop a build fails. Clicking stop results in no change in the build status or console log output
      • New builds will not start. They sit in the queue, but the slaves will not be started.
      • The UI continues to function, so it is possible to view config, get threaddumps, etc.

      The only resolution is to restart the Jenkins server.

      We are using the vCenter plugin to dynamically start all build slaves. Though, we have been using this configuration for months, and the problem just started.

      We have recreated this on both latest Jenkins level (2.26) and Jenkins LTS version 2.19.1

      I am attaching a threaddump of the server at the time of one of these hangs.

      I can provide any other information that might help in diagnosing this problem

          [JENKINS-39179] All builds hang, JNA load deadlock on Windows slave

          Greg Smith created issue -
          Greg Smith made changes -
          Component/s New: durable-task-plugin [ 18622 ]
          Component/s Original: core [ 15593 ]
          Greg Smith made changes -
          Environment Original: Jenkins 2.19.1 LTS
          and
          Jenkins 2.26
          New: Jenkins 2.19.1 LTS
          and
          Jenkins 2.26
          Durable Task Plugin 1.12
          Greg Smith made changes -
          Summary Original: All builds hang, Builds cannot be stopped, only restart solves New: All builds hang, Builds cannot be stopped, hung FileMonitoringCleanup

          Greg Smith added a comment -

          I have been tracking this down via the /threadDump, to figure out what was hung. And it seems that the problem lies in the attempt to deleteRecursive within the FileMonitoringController on Windows 10.

          Here is an example part of the thread dump that lead me to look there:

          "jenkins.util.Timer [#1] / waiting for hudson.remoting.Channel@64ce159:Windows10Slave-1_84a5e84b-395e-4e76-af9a-083bda1c8258" Id=21 Group=main TIMED_WAITING on hudson.remoting.UserRequest@3d7106cf
          	at java.lang.Object.wait(Native Method)
          	-  waiting on hudson.remoting.UserRequest@3d7106cf
          	at hudson.remoting.Request.call(Request.java:147)
          	at hudson.remoting.Channel.call(Channel.java:796)
          	at hudson.FilePath.act(FilePath.java:985)
          	at hudson.FilePath.act(FilePath.java:974)
          	at hudson.FilePath.deleteRecursive(FilePath.java:1176)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:171)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:288)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:234)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          

          To summarize the steps that I took:

          • All of my builds (about 8 total, 1 per slave, all dynamic slaves, 3 Linux, 5 Windows 10) were hung. The console logs for all were frozen, none could be stopped via UI.
          • I investigated the /threadDump, and found that there were several Timers all waiting on a specific slave to complete a "deleteRecursive" call
          • I logged into that slave, and killed the java.exe process running.
          • All other slaves immediately came back, and their builds continued.

          If you go look at the thread dump on the machine that was hung (before I logged in and killed it) – it looks like this:

          "pool-1-thread-13 for channel" Id=72 Group=main RUNNABLE
          	at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
          	at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
          	at hudson.Util.isSymlink(Util.java:507)
          	at hudson.FilePath.deleteRecursive(FilePath.java:1199)
          	at hudson.FilePath.access$1000(FilePath.java:195)
          	at hudson.FilePath$14.invoke(FilePath.java:1179)
          	at hudson.FilePath$14.invoke(FilePath.java:1176)
          	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:153)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:50)
          	at hudson.remoting.Request$2.run(Request.java:332)
          	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
          	at java.util.concurrent.FutureTask.run(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.lang.Thread.run(Unknown Source)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@e0a4e7f
          

          That is very interesting – as we have not had this problem before, but just last week I updated these Windows 10 machines to the Windows 10 anniversary update.

          That was right around the time that our "all builds hang" problem started.

          I do think that this could be 2 problems:

          • A hang in the cleanup() task of FileMonitoringController should not cause all other pipeline builds to freeze
          • getWin32FileAttributes() has a deadlock issue on Windows 10 with latest updates.

          Greg Smith added a comment - I have been tracking this down via the /threadDump, to figure out what was hung. And it seems that the problem lies in the attempt to deleteRecursive within the FileMonitoringController on Windows 10. Here is an example part of the thread dump that lead me to look there: "jenkins.util.Timer [#1] / waiting for hudson.remoting.Channel@64ce159:Windows10Slave-1_84a5e84b-395e-4e76-af9a-083bda1c8258" Id=21 Group=main TIMED_WAITING on hudson.remoting.UserRequest@3d7106cf at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.UserRequest@3d7106cf at hudson.remoting.Request.call(Request.java:147) at hudson.remoting.Channel.call(Channel.java:796) at hudson.FilePath.act(FilePath.java:985) at hudson.FilePath.act(FilePath.java:974) at hudson.FilePath.deleteRecursive(FilePath.java:1176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:171) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:288) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) To summarize the steps that I took: All of my builds (about 8 total, 1 per slave, all dynamic slaves, 3 Linux, 5 Windows 10) were hung. The console logs for all were frozen, none could be stopped via UI. I investigated the /threadDump, and found that there were several Timers all waiting on a specific slave to complete a "deleteRecursive" call I logged into that slave, and killed the java.exe process running. All other slaves immediately came back, and their builds continued. If you go look at the thread dump on the machine that was hung (before I logged in and killed it) – it looks like this: "pool-1-thread-13 for channel" Id=72 Group=main RUNNABLE at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77) at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98) at hudson.Util.isSymlink(Util.java:507) at hudson.FilePath.deleteRecursive(FilePath.java:1199) at hudson.FilePath.access$1000(FilePath.java:195) at hudson.FilePath$14.invoke(FilePath.java:1179) at hudson.FilePath$14.invoke(FilePath.java:1176) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731) at hudson.remoting.UserRequest.perform(UserRequest.java:153) at hudson.remoting.UserRequest.perform(UserRequest.java:50) at hudson.remoting.Request$2.run(Request.java:332) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@e0a4e7f That is very interesting – as we have not had this problem before, but just last week I updated these Windows 10 machines to the Windows 10 anniversary update. That was right around the time that our "all builds hang" problem started. I do think that this could be 2 problems: A hang in the cleanup() task of FileMonitoringController should not cause all other pipeline builds to freeze getWin32FileAttributes() has a deadlock issue on Windows 10 with latest updates.
          Greg Smith made changes -
          Summary Original: All builds hang, Builds cannot be stopped, hung FileMonitoringCleanup New: All builds hang, hung FileMonitoringTask.cleanup / get attributes on Windows 10
          Greg Smith made changes -
          Component/s New: core [ 15593 ]
          Greg Smith made changes -
          Component/s Original: core [ 15593 ]

          Greg Smith added a comment -

          Another very interesting point of information:

          I looked at the code for getWin32FileAttributes() – and it is taking different code paths based on the length of the path.

          The reason we updated these slaves to Windows 10 Anniversary Edition was that this new build of Windows 10 includes support for long file paths.

          IE, Windows 10 is supposed to support long file paths natively now – if you enable it, as listed here:
          https://mspoweruser.com/ntfs-260-character-windows-10/

          We have enabled this long file path ability in our Windows 10 slave image. We did this because of JENKINS-38706

          So its possible that we:

          • Tried to work around issues caused by JENKINS-38706 on Windows by using this new feature
          • Enabling the feature causes the JNI windows task getWin32FileAttributes to hang
          • That hang in the Windows slave then caused all current builds to hang

          Greg Smith added a comment - Another very interesting point of information: I looked at the code for getWin32FileAttributes() – and it is taking different code paths based on the length of the path. The reason we updated these slaves to Windows 10 Anniversary Edition was that this new build of Windows 10 includes support for long file paths. IE, Windows 10 is supposed to support long file paths natively now – if you enable it, as listed here: https://mspoweruser.com/ntfs-260-character-windows-10/ We have enabled this long file path ability in our Windows 10 slave image. We did this because of JENKINS-38706 So its possible that we: Tried to work around issues caused by JENKINS-38706 on Windows by using this new feature Enabling the feature causes the JNI windows task getWin32FileAttributes to hang That hang in the Windows slave then caused all current builds to hang

          Greg Smith added a comment -

          I can confirm that disabling long file name support on Windows 10 Anniversary Update does not fix the problem.

          I have since reset that registry entry – and we are still getting the lockups.

          Attaching a new threaddump. In this thread dump, here is the new point of failure / lockup:

          "pool-1-thread-6 for channel" Id=26 Group=main RUNNABLE
          	at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
          	at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
          	at hudson.Util.isSymlink(Util.java:507)
          	at hudson.FilePath.deleteRecursive(FilePath.java:1199)
          	at hudson.FilePath.access$1000(FilePath.java:195)
          	at hudson.FilePath$14.invoke(FilePath.java:1179)
          	at hudson.FilePath$14.invoke(FilePath.java:1176)
          	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:153)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:50)
          	at hudson.remoting.Request$2.run(Request.java:332)
          	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
          	at java.util.concurrent.FutureTask.run(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.lang.Thread.run(Unknown Source)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@41f7efd6
          

          Again, this doesn't just lockup one single node – it locks up all builds from proceeding.

          This is pretty much driving my team crazy, as I have to watch for these lockups, then go kill the node that is stuck at "getWin32FileAttributes()" in order for any other builds to progress.

          Greg Smith added a comment - I can confirm that disabling long file name support on Windows 10 Anniversary Update does not fix the problem. I have since reset that registry entry – and we are still getting the lockups. Attaching a new threaddump. In this thread dump, here is the new point of failure / lockup: "pool-1-thread-6 for channel" Id=26 Group=main RUNNABLE at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77) at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98) at hudson.Util.isSymlink(Util.java:507) at hudson.FilePath.deleteRecursive(FilePath.java:1199) at hudson.FilePath.access$1000(FilePath.java:195) at hudson.FilePath$14.invoke(FilePath.java:1179) at hudson.FilePath$14.invoke(FilePath.java:1176) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731) at hudson.remoting.UserRequest.perform(UserRequest.java:153) at hudson.remoting.UserRequest.perform(UserRequest.java:50) at hudson.remoting.Request$2.run(Request.java:332) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@41f7efd6 Again, this doesn't just lockup one single node – it locks up all builds from proceeding. This is pretty much driving my team crazy, as I have to watch for these lockups, then go kill the node that is stuck at "getWin32FileAttributes()" in order for any other builds to progress.

            Unassigned Unassigned
            gregcovertsmith Greg Smith
            Votes:
            4 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated: