Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39179

All builds hang, JNA load deadlock on Windows slave

      I hate to create a general "core" bug, as I wish I could redirect this to the correct component. Unfortunately, I can not identify which component is hanging and why, so I do not know how to direct this problem.

      This problem started about 2 weeks ago, as we have been adding new Pipeline builds to our build server. So it could be related to one of the pipeline plugins.

      The behavior is the following:

      • 1 to 2 times a day, all builds on all build slaves will hang. The console log of the build just stops moving forward, and stays stuck at the last line executed / last line returned.
      • Once this occurs, attempting to stop a build fails. Clicking stop results in no change in the build status or console log output
      • New builds will not start. They sit in the queue, but the slaves will not be started.
      • The UI continues to function, so it is possible to view config, get threaddumps, etc.

      The only resolution is to restart the Jenkins server.

      We are using the vCenter plugin to dynamically start all build slaves. Though, we have been using this configuration for months, and the problem just started.

      We have recreated this on both latest Jenkins level (2.26) and Jenkins LTS version 2.19.1

      I am attaching a threaddump of the server at the time of one of these hangs.

      I can provide any other information that might help in diagnosing this problem

          [JENKINS-39179] All builds hang, JNA load deadlock on Windows slave

          Greg Smith added a comment -

          I have been tracking this down via the /threadDump, to figure out what was hung. And it seems that the problem lies in the attempt to deleteRecursive within the FileMonitoringController on Windows 10.

          Here is an example part of the thread dump that lead me to look there:

          "jenkins.util.Timer [#1] / waiting for hudson.remoting.Channel@64ce159:Windows10Slave-1_84a5e84b-395e-4e76-af9a-083bda1c8258" Id=21 Group=main TIMED_WAITING on hudson.remoting.UserRequest@3d7106cf
          	at java.lang.Object.wait(Native Method)
          	-  waiting on hudson.remoting.UserRequest@3d7106cf
          	at hudson.remoting.Request.call(Request.java:147)
          	at hudson.remoting.Channel.call(Channel.java:796)
          	at hudson.FilePath.act(FilePath.java:985)
          	at hudson.FilePath.act(FilePath.java:974)
          	at hudson.FilePath.deleteRecursive(FilePath.java:1176)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:171)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:288)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:234)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          

          To summarize the steps that I took:

          • All of my builds (about 8 total, 1 per slave, all dynamic slaves, 3 Linux, 5 Windows 10) were hung. The console logs for all were frozen, none could be stopped via UI.
          • I investigated the /threadDump, and found that there were several Timers all waiting on a specific slave to complete a "deleteRecursive" call
          • I logged into that slave, and killed the java.exe process running.
          • All other slaves immediately came back, and their builds continued.

          If you go look at the thread dump on the machine that was hung (before I logged in and killed it) – it looks like this:

          "pool-1-thread-13 for channel" Id=72 Group=main RUNNABLE
          	at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
          	at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
          	at hudson.Util.isSymlink(Util.java:507)
          	at hudson.FilePath.deleteRecursive(FilePath.java:1199)
          	at hudson.FilePath.access$1000(FilePath.java:195)
          	at hudson.FilePath$14.invoke(FilePath.java:1179)
          	at hudson.FilePath$14.invoke(FilePath.java:1176)
          	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:153)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:50)
          	at hudson.remoting.Request$2.run(Request.java:332)
          	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
          	at java.util.concurrent.FutureTask.run(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.lang.Thread.run(Unknown Source)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@e0a4e7f
          

          That is very interesting – as we have not had this problem before, but just last week I updated these Windows 10 machines to the Windows 10 anniversary update.

          That was right around the time that our "all builds hang" problem started.

          I do think that this could be 2 problems:

          • A hang in the cleanup() task of FileMonitoringController should not cause all other pipeline builds to freeze
          • getWin32FileAttributes() has a deadlock issue on Windows 10 with latest updates.

          Greg Smith added a comment - I have been tracking this down via the /threadDump, to figure out what was hung. And it seems that the problem lies in the attempt to deleteRecursive within the FileMonitoringController on Windows 10. Here is an example part of the thread dump that lead me to look there: "jenkins.util.Timer [#1] / waiting for hudson.remoting.Channel@64ce159:Windows10Slave-1_84a5e84b-395e-4e76-af9a-083bda1c8258" Id=21 Group=main TIMED_WAITING on hudson.remoting.UserRequest@3d7106cf at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.UserRequest@3d7106cf at hudson.remoting.Request.call(Request.java:147) at hudson.remoting.Channel.call(Channel.java:796) at hudson.FilePath.act(FilePath.java:985) at hudson.FilePath.act(FilePath.java:974) at hudson.FilePath.deleteRecursive(FilePath.java:1176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:171) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:288) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:234) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) To summarize the steps that I took: All of my builds (about 8 total, 1 per slave, all dynamic slaves, 3 Linux, 5 Windows 10) were hung. The console logs for all were frozen, none could be stopped via UI. I investigated the /threadDump, and found that there were several Timers all waiting on a specific slave to complete a "deleteRecursive" call I logged into that slave, and killed the java.exe process running. All other slaves immediately came back, and their builds continued. If you go look at the thread dump on the machine that was hung (before I logged in and killed it) – it looks like this: "pool-1-thread-13 for channel" Id=72 Group=main RUNNABLE at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77) at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98) at hudson.Util.isSymlink(Util.java:507) at hudson.FilePath.deleteRecursive(FilePath.java:1199) at hudson.FilePath.access$1000(FilePath.java:195) at hudson.FilePath$14.invoke(FilePath.java:1179) at hudson.FilePath$14.invoke(FilePath.java:1176) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731) at hudson.remoting.UserRequest.perform(UserRequest.java:153) at hudson.remoting.UserRequest.perform(UserRequest.java:50) at hudson.remoting.Request$2.run(Request.java:332) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@e0a4e7f That is very interesting – as we have not had this problem before, but just last week I updated these Windows 10 machines to the Windows 10 anniversary update. That was right around the time that our "all builds hang" problem started. I do think that this could be 2 problems: A hang in the cleanup() task of FileMonitoringController should not cause all other pipeline builds to freeze getWin32FileAttributes() has a deadlock issue on Windows 10 with latest updates.

          Greg Smith added a comment -

          Another very interesting point of information:

          I looked at the code for getWin32FileAttributes() – and it is taking different code paths based on the length of the path.

          The reason we updated these slaves to Windows 10 Anniversary Edition was that this new build of Windows 10 includes support for long file paths.

          IE, Windows 10 is supposed to support long file paths natively now – if you enable it, as listed here:
          https://mspoweruser.com/ntfs-260-character-windows-10/

          We have enabled this long file path ability in our Windows 10 slave image. We did this because of JENKINS-38706

          So its possible that we:

          • Tried to work around issues caused by JENKINS-38706 on Windows by using this new feature
          • Enabling the feature causes the JNI windows task getWin32FileAttributes to hang
          • That hang in the Windows slave then caused all current builds to hang

          Greg Smith added a comment - Another very interesting point of information: I looked at the code for getWin32FileAttributes() – and it is taking different code paths based on the length of the path. The reason we updated these slaves to Windows 10 Anniversary Edition was that this new build of Windows 10 includes support for long file paths. IE, Windows 10 is supposed to support long file paths natively now – if you enable it, as listed here: https://mspoweruser.com/ntfs-260-character-windows-10/ We have enabled this long file path ability in our Windows 10 slave image. We did this because of JENKINS-38706 So its possible that we: Tried to work around issues caused by JENKINS-38706 on Windows by using this new feature Enabling the feature causes the JNI windows task getWin32FileAttributes to hang That hang in the Windows slave then caused all current builds to hang

          Greg Smith added a comment -

          I can confirm that disabling long file name support on Windows 10 Anniversary Update does not fix the problem.

          I have since reset that registry entry – and we are still getting the lockups.

          Attaching a new threaddump. In this thread dump, here is the new point of failure / lockup:

          "pool-1-thread-6 for channel" Id=26 Group=main RUNNABLE
          	at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
          	at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
          	at hudson.Util.isSymlink(Util.java:507)
          	at hudson.FilePath.deleteRecursive(FilePath.java:1199)
          	at hudson.FilePath.access$1000(FilePath.java:195)
          	at hudson.FilePath$14.invoke(FilePath.java:1179)
          	at hudson.FilePath$14.invoke(FilePath.java:1176)
          	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:153)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:50)
          	at hudson.remoting.Request$2.run(Request.java:332)
          	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
          	at java.util.concurrent.FutureTask.run(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.lang.Thread.run(Unknown Source)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@41f7efd6
          

          Again, this doesn't just lockup one single node – it locks up all builds from proceeding.

          This is pretty much driving my team crazy, as I have to watch for these lockups, then go kill the node that is stuck at "getWin32FileAttributes()" in order for any other builds to progress.

          Greg Smith added a comment - I can confirm that disabling long file name support on Windows 10 Anniversary Update does not fix the problem. I have since reset that registry entry – and we are still getting the lockups. Attaching a new threaddump. In this thread dump, here is the new point of failure / lockup: "pool-1-thread-6 for channel" Id=26 Group=main RUNNABLE at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77) at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98) at hudson.Util.isSymlink(Util.java:507) at hudson.FilePath.deleteRecursive(FilePath.java:1199) at hudson.FilePath.access$1000(FilePath.java:195) at hudson.FilePath$14.invoke(FilePath.java:1179) at hudson.FilePath$14.invoke(FilePath.java:1176) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731) at hudson.remoting.UserRequest.perform(UserRequest.java:153) at hudson.remoting.UserRequest.perform(UserRequest.java:50) at hudson.remoting.Request$2.run(Request.java:332) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@41f7efd6 Again, this doesn't just lockup one single node – it locks up all builds from proceeding. This is pretty much driving my team crazy, as I have to watch for these lockups, then go kill the node that is stuck at "getWin32FileAttributes()" in order for any other builds to progress.

          Greg Smith added a comment -

          I can now confirm that this same lockup is happening on all versions of Windows 10, on both jre 7 and 8.

          This is killing us. The problem seems to be the loading of the jna libraries from both the getWin32FileAttributes call and the check of swap space at the same time, causing a deadlock on the slave, and then a deadlock from all pipeline builds.

          Greg Smith added a comment - I can now confirm that this same lockup is happening on all versions of Windows 10, on both jre 7 and 8. This is killing us. The problem seems to be the loading of the jna libraries from both the getWin32FileAttributes call and the check of swap space at the same time, causing a deadlock on the slave, and then a deadlock from all pipeline builds.

          Greg Smith added a comment -

          Recreated again on the just release Jenkins 2.19.2 LTS version.

          Attached the thread dump from that slave.

          The locked slave is stuck, trying to load the SwapSpaceMonitor:

          "pool-1-thread-5 for channel" Id=23 Group=main RUNNABLE
          	at com.sun.jna.Pointer.<clinit>(Pointer.java:41)
          	at com.sun.jna.Structure.<clinit>(Structure.java:2078)
          	at org.jvnet.hudson.Windows.monitor(Windows.java:42)
          	at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:124)
          	at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:114)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:153)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:50)
          	at hudson.remoting.Request$2.run(Request.java:332)
          	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.ThreadPoolExecutor$Worker@5edc0c11
          

          Greg Smith added a comment - Recreated again on the just release Jenkins 2.19.2 LTS version. Attached the thread dump from that slave. The locked slave is stuck, trying to load the SwapSpaceMonitor: "pool-1-thread-5 for channel" Id=23 Group=main RUNNABLE at com.sun.jna.Pointer.<clinit>(Pointer.java:41) at com.sun.jna.Structure.<clinit>(Structure.java:2078) at org.jvnet.hudson.Windows.monitor(Windows.java:42) at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:124) at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:114) at hudson.remoting.UserRequest.perform(UserRequest.java:153) at hudson.remoting.UserRequest.perform(UserRequest.java:50) at hudson.remoting.Request$2.run(Request.java:332) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@5edc0c11

          Greg Smith added a comment -

          Adding another full stack track of all slaves and threads.

          If you look at windows slave : Windows10Slave-2_bb5c6f6e-b2c7-47bf-b390-be39091dcb21

          You can see the JNA deadlock situation again.

          Greg Smith added a comment - Adding another full stack track of all slaves and threads. If you look at windows slave : Windows10Slave-2_bb5c6f6e-b2c7-47bf-b390-be39091dcb21 You can see the JNA deadlock situation again.

          Greg Smith added a comment -

          If it were possible to disable the SwapSpaceMonitor – perhaps that might be a way to remove this deadlock. But there seems to be no way to do that.

          Even when the swap space monitoring of slaves is disabled, the UserRequest still runs, as it seems to be updating the values, even if that slave condition is set to ignore.

          Greg Smith added a comment - If it were possible to disable the SwapSpaceMonitor – perhaps that might be a way to remove this deadlock. But there seems to be no way to do that. Even when the swap space monitoring of slaves is disabled, the UserRequest still runs, as it seems to be updating the values, even if that slave condition is set to ignore.

          Greg Smith added a comment -

          Linked JENKINS-38834 – I followed the advice there, to back level to LTS 2.7.4, and we have not seen a deadlock situation yet. (fingers crossed)

          So this problem was introduced sometime between 2.7 and 2.19, and at least through our testing persists through 2.26 (last release before the move to remoting 3.0)

          At some point, I want to validate if this problem occurs in >=2.27, where the remoting 3.0 was introduced. But as it occurs in the 2.19.2 LTS (latest available LTS as of this comment) I still believe this is critical.

          Greg Smith added a comment - Linked JENKINS-38834 – I followed the advice there, to back level to LTS 2.7.4, and we have not seen a deadlock situation yet. (fingers crossed) So this problem was introduced sometime between 2.7 and 2.19, and at least through our testing persists through 2.26 (last release before the move to remoting 3.0) At some point, I want to validate if this problem occurs in >=2.27, where the remoting 3.0 was introduced. But as it occurs in the 2.19.2 LTS (latest available LTS as of this comment) I still believe this is critical.

          retronym added a comment - - edited

          As another datapoint, we (the Scala team) are experiencing the same deadlock on Windows in our Jenkins instance.

          Version details:

          https://gist.github.com/retronym/ad96dc3595d51f1f1d210f8e4eadcbdf

          Our ticket about the issue:

          https://github.com/scala/scala-jenkins-infra/issues/203

          Thread dump:

          https://gist.github.com/retronym/a206e211da392a3e55c604c26543a80b

          We are yet to try the workaround of using -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native as I wasn't sure where to configure this. gregcovertsmith you mentioned in JENKINS-19445 that you couldn't try this because of your use of virtual slaves. We are using "launch slave agents on unix machines via SSH" option. Does anyone know if this can pick up JVM options from one of the configured environment variables?

          Having an option to disable use of JNA from either the cygpath plugin or the SwapSpaceMonitor would workaround the issue we're seeing. Doing a controlled initialization of JNA at Jenkins startup would be ideal, however.

          retronym added a comment - - edited As another datapoint, we (the Scala team) are experiencing the same deadlock on Windows in our Jenkins instance. Version details: https://gist.github.com/retronym/ad96dc3595d51f1f1d210f8e4eadcbdf Our ticket about the issue: https://github.com/scala/scala-jenkins-infra/issues/203 Thread dump: https://gist.github.com/retronym/a206e211da392a3e55c604c26543a80b We are yet to try the workaround of using -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native as I wasn't sure where to configure this. gregcovertsmith you mentioned in JENKINS-19445 that you couldn't try this because of your use of virtual slaves. We are using "launch slave agents on unix machines via SSH" option. Does anyone know if this can pick up JVM options from one of the configured environment variables? Having an option to disable use of JNA from either the cygpath plugin or the SwapSpaceMonitor would workaround the issue we're seeing. Doing a controlled initialization of JNA at Jenkins startup would be ideal, however.

          retronym added a comment -

          I see the configuration option now in the "Configure Slave" UI. It was hidden by default behind the "advanced" button.

          retronym added a comment - I see the configuration option now in the "Configure Slave" UI. It was hidden by default behind the "advanced" button.

          Greg Smith added a comment -

          I am going to try to try the new LTS version of Jenkins this weekend (2.32.1)

          @retronym, have you had an success with later LTS versions, or did the work around work for you?

          Unfortunately, it is correct that I can not use that work around – we use the vcenter plugin to autogenerate slaves, and there is no place for advanced configuration options in the slave configurations it generates.

          I would like to be able to get off of the very old 2.7.4 LTS version, but the 2.19.X series was a no-go for us.

          Greg Smith added a comment - I am going to try to try the new LTS version of Jenkins this weekend (2.32.1) @retronym, have you had an success with later LTS versions, or did the work around work for you? Unfortunately, it is correct that I can not use that work around – we use the vcenter plugin to autogenerate slaves, and there is no place for advanced configuration options in the slave configurations it generates. I would like to be able to get off of the very old 2.7.4 LTS version, but the 2.19.X series was a no-go for us.

          Greg Smith added a comment -

          I upgraded to 2.32.1 – and the problem still exists in that version.

          I had a build lockup within the first few hours.

          But I also upgraded to the latest vcenter plugin: And the latest version has the option to specify Advanced options, meaning I could actually apply the "-Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native" workaround to my dynamically created slaves.

          I will report back if there is a lock up again with this system property applied.

          Greg Smith added a comment - I upgraded to 2.32.1 – and the problem still exists in that version. I had a build lockup within the first few hours. But I also upgraded to the latest vcenter plugin: And the latest version has the option to specify Advanced options, meaning I could actually apply the "-Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native" workaround to my dynamically created slaves. I will report back if there is a lock up again with this system property applied.

          Greg Smith added a comment -

          I can confirm – if the above mentioned property is added to the advanced settings -> java options of the slave, then the lock up problem does not occur.

          When we first upgraded to 2.32.1, we got lockups within a single day. We applied that setting to the slaves being started by the vcenter plugin, and we have now been running for 3 days straight with no lockups.

          Greg Smith added a comment - I can confirm – if the above mentioned property is added to the advanced settings -> java options of the slave, then the lock up problem does not occur. When we first upgraded to 2.32.1, we got lockups within a single day. We applied that setting to the slaves being started by the vcenter plugin, and we have now been running for 3 days straight with no lockups.

          Jesse Glick added a comment -

          Offhand I would say that FilePath.deleteRecursive needs to be updated to use java.nio.file.Files deletion which is likely to be more portable and reliable.

          I have a dim memory of some kind of static initialization deadlock in JNA, but I do not recall what the resolution was—whether Jenkins updated to a fixed version, or attempted to work around it somehow, etc.

          Jesse Glick added a comment - Offhand I would say that FilePath.deleteRecursive needs to be updated to use java.nio.file.Files deletion which is likely to be more portable and reliable. I have a dim memory of some kind of static initialization deadlock in JNA, but I do not recall what the resolution was—whether Jenkins updated to a fixed version, or attempted to work around it somehow, etc.

          pjdarton added a comment - - edited

          We're also seeing deadlocks:  Lots of threads all with stacktraces whose deepest point is:

              at java.lang.Object.wait(Native Method)
              -  waiting on java.lang.J9VMInternals$ClassInitializationLock@7cd6c16
              at java.lang.Object.wait(Object.java:167)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)
              -  locked java.lang.J9VMInternals$ClassInitializationLock@7cd6c16
              at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
              at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
              at hudson.Util.isSymlink(Util.java:507)

          i.e. They're all calling hudson.util.jna.Kernel32Utils.getWin32FileAttributes and this is deadlocking.

          As for updating code to use java.nio.file.Files I'm not convinced that this will affect the issue.  The problem is that, on Windows, the code is required to detect if a "directory" is either a real directory, a symbolic link to a directory, or a windows "Junction Point" (which is functionally identical to a symbolic link, but is not considered to be a symbolic link by java.nio's isSymbolicLink method).

          i.e. No matter how we do this, it'll require a jna call out to Kernel32.DLL's GetFileAttributes function, so we need that to work and not to deadlock.

           

          Also, I'd be quite surprised if this deadlock issue was unique to just the GetFileAttributes function - my guess is that it'll affect all Kernel32 calls, but it's just that file deletion hammers it the most and is, therefore, where most of the problems are seen.

           

          FYI a Windows "Junction Point" is not uncommon - they're more common than symbolic links are on Windows.  It's difficult to create a symbolic link on Windows (It's crazy but, on Windows, using symbolic links is a privileged operation.  Whilst one can downgrade it to user-level, Windows ignores this for any user that is permitted to run things "as administrator", which is most people.  i.e. in effect, admins have less rights than non-admins - it's crazy).  However, it's trivial to create a "Junction Point" - any user can do that - this is not a privileged operation.

          TL;DR: people who need a symbolic link to a directory on Windows usually use a Junction Point instead of a symbolic link.

          pjdarton added a comment - - edited We're also seeing deadlocks:  Lots of threads all with stacktraces whose deepest point is:     at java.lang. Object .wait(Native Method)     -  waiting on java.lang.J9VMInternals$ClassInitializationLock@7cd6c16     at java.lang. Object .wait( Object .java:167)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)     -  locked java.lang.J9VMInternals$ClassInitializationLock@7cd6c16     at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)     at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)     at hudson.Util.isSymlink(Util.java:507) i.e. They're all calling hudson.util.jna.Kernel32Utils.getWin32FileAttributes and this is deadlocking. As for updating code to use java.nio.file.Files I'm not convinced that this will affect the issue.  The problem is that, on Windows, the code is required to detect if a "directory" is either a real directory, a symbolic link to a directory, or a windows "Junction Point" (which is functionally identical to a symbolic link, but is not considered to be a symbolic link by java.nio's isSymbolicLink method). i.e. No matter how we do this, it'll require a jna call out to Kernel32.DLL's GetFileAttributes function, so we need that to work and not to deadlock.   Also, I'd be quite surprised if this deadlock issue was unique to just the GetFileAttributes function - my guess is that it'll affect all Kernel32 calls, but it's just that file deletion hammers it the most and is, therefore, where most of the problems are seen.   FYI a Windows "Junction Point" is not uncommon - they're more common than symbolic links are on Windows.  It's difficult to create a symbolic link on Windows (It's crazy but, on Windows, using symbolic links is a privileged operation.  Whilst one can downgrade it to user-level, Windows ignores this for any user that is permitted to run things "as administrator", which is most people.  i.e. in effect, admins have less rights than non-admins - it's crazy).  However, it's trivial to create a "Junction Point" - any user can do that - this is not a privileged operation. TL;DR: people who need a symbolic link to a directory on Windows usually use a Junction Point instead of a symbolic link.

          pjdarton added a comment - - edited

          My previous comment was incorrect - they weren't all calling isSymlink (if they were all doing the same thing, there wouldn't have been any deadlock).

          I've been doing some digging and I've concluded that while it is the same bug as -JENKINS-16070-, the underlying cause is actually a bug in the JNA library that Jenkins uses.  See https://github.com/java-native-access/jna/issues/652

          1. Jenkins class hudson.util.jna.Kernel32Utils depends on Jenkins class hudson.util.jna.Kernel32 depends on com.sun.jna.Native which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native)
          2. Jenkins class hudson.node_monitors.SwapSpaceMonitor depends on Jenkins class org.jvnet.hudson.Windows depends on com.sun.jna.Structure which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native)

          I believe that the mistake in the JNA code is that com.sun.jna.Native depends on com.sun.jna.Pointer which depends on com.sun.jna.Native, i.e. a circular dependency.

          What I'm seeing is that we have two separate threads causing classloading of these two independently (see stacktraces below), where the first one ("pool-1-thread-3", where a build is trying to "deleteRecursive" an old workspace folder on the slave) has started initialising Native and not got as far as Pointer, and the second thread ("pool-1-thread-9", where the slave monitor subsystem is trying to query the swapspace available) has started initialising Pointer and not got as far as Native, then they'll deadlock waiting for the other thread to finish classloading.

          "pool-1-thread-3 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616289" Id=17 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08 (in native)
              at java.lang.Object.wait(Native Method)
              -  waiting on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08
              at java.lang.Object.wait(Object.java:167)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)
              -  locked java.lang.J9VMInternals$ClassInitializationLock@1ae67f08
              at com.sun.jna.Native.initIDs(Native Method)
              at com.sun.jna.Native.<clinit>(Native.java:148)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at hudson.util.jna.Kernel32Utils.load(Kernel32Utils.java:112)
              at hudson.util.jna.Kernel32.<clinit>(Kernel32.java:37)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
              at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
              at hudson.Util.isSymlink(Util.java:507)
              at hudson.FilePath.deleteRecursive(FilePath.java:1199)
              at hudson.FilePath.access$1000(FilePath.java:195)
              at hudson.FilePath$14.invoke(FilePath.java:1179)
              at hudson.FilePath$14.invoke(FilePath.java:1176)
              at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)
              at hudson.remoting.UserRequest.perform(UserRequest.java:153)
              at hudson.remoting.UserRequest.perform(UserRequest.java:50)
              at hudson.remoting.Request$2.run(Request.java:336)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
              at java.util.concurrent.FutureTask.run(FutureTask.java:273)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
              at hudson.remoting.Engine$1$1.run(Engine.java:94)
              at java.lang.Thread.run(Thread.java:804)
          
              Number of locked synchronizers = 1
              - java.util.concurrent.ThreadPoolExecutor$Worker@819d87b4
          "pool-1-thread-9 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616789" Id=24 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030 (in native)
              at java.lang.Object.wait(Native Method)
              -  waiting on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030
              at java.lang.Object.wait(Object.java:167)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)
              -  locked java.lang.J9VMInternals$ClassInitializationLock@fe8f4030
              at com.sun.jna.Pointer.<clinit>(Pointer.java:41)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)
              at com.sun.jna.Structure.<clinit>(Structure.java:2078)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)
              at org.jvnet.hudson.Windows.monitor(Windows.java:42)
              at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:124)
              at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:114)
              at hudson.remoting.UserRequest.perform(UserRequest.java:153)
              at hudson.remoting.UserRequest.perform(UserRequest.java:50)
              at hudson.remoting.Request$2.run(Request.java:336)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
              at java.util.concurrent.FutureTask.run(FutureTask.java:273)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
              at hudson.remoting.Engine$1$1.run(Engine.java:94)
              at java.lang.Thread.run(Thread.java:804)
          
              Number of locked synchronizers = 1
              - java.util.concurrent.ThreadPoolExecutor$Worker@bf193c54

          As Jesse said in https://issues.jenkins-ci.org/browse/JENKINS-16070?focusedCommentId=170842&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-170842 the proper solution would be to fix JNA.  Doing a workaround in Jenkins is, at best, just going to be papering over the cracks.

          I would suggest that all further efforts be directed at https://github.com/java-native-access/jna/issues/652 and, once that's fixed, the fix to be back-ported to Jenkins' JNA (or fixed in Jenkins and then pushed to the public JNA - either works).

          pjdarton added a comment - - edited My previous comment was incorrect - they weren't all calling isSymlink (if they were all doing the same thing, there wouldn't have been any deadlock). I've been doing some digging and I've concluded that while it is the same bug as - JENKINS-16070 -, the underlying cause is actually a bug in the JNA library that Jenkins uses.  See https://github.com/java-native-access/jna/issues/652 Jenkins class hudson.util.jna.Kernel32Utils depends on Jenkins class hudson.util.jna.Kernel32 depends on com.sun.jna.Native which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native ) Jenkins class hudson.node_monitors.SwapSpaceMonitor depends on Jenkins class org.jvnet.hudson.Windows depends on com.sun.jna.Structure which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native ) I believe that the mistake in the JNA code is that com.sun.jna.Native depends on com.sun.jna.Pointer which depends on com.sun.jna.Native , i.e. a circular dependency. What I'm seeing is that we have two separate threads causing classloading of these two independently (see stacktraces below), where the first one ("pool-1-thread-3", where a build is trying to "deleteRecursive" an old workspace folder on the slave) has started initialising Native and not got as far as Pointer , and the second thread ("pool-1-thread-9", where the slave monitor subsystem is trying to query the swapspace available) has started initialising Pointer and not got as far as Native , then they'll deadlock waiting for the other thread to finish classloading. "pool-1-thread-3 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616289" Id=17 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08 (in native)     at java.lang.Object.wait(Native Method)     -  waiting on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08     at java.lang.Object.wait(Object.java:167)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)     -  locked java.lang.J9VMInternals$ClassInitializationLock@1ae67f08     at com.sun.jna.Native.initIDs(Native Method)     at com.sun.jna.Native.<clinit>(Native.java:148)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at hudson.util.jna.Kernel32Utils.load(Kernel32Utils.java:112)     at hudson.util.jna.Kernel32.<clinit>(Kernel32.java:37)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)     at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)     at hudson.Util.isSymlink(Util.java:507)     at hudson.FilePath.deleteRecursive(FilePath.java:1199)     at hudson.FilePath.access$1000(FilePath.java:195)     at hudson.FilePath$14.invoke(FilePath.java:1179)     at hudson.FilePath$14.invoke(FilePath.java:1176)     at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)     at hudson.remoting.UserRequest.perform(UserRequest.java:153)     at hudson.remoting.UserRequest.perform(UserRequest.java:50)     at hudson.remoting.Request$2.run(Request.java:336)     at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)     at java.util.concurrent.FutureTask.run(FutureTask.java:273)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)     at hudson.remoting.Engine$1$1.run(Engine.java:94)     at java.lang.Thread.run(Thread.java:804)     Number of locked synchronizers = 1     - java.util.concurrent.ThreadPoolExecutor$Worker@819d87b4 "pool-1-thread-9 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616789" Id=24 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030 (in native)     at java.lang.Object.wait(Native Method)     -  waiting on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030     at java.lang.Object.wait(Object.java:167)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)     -  locked java.lang.J9VMInternals$ClassInitializationLock@fe8f4030     at com.sun.jna.Pointer.<clinit>(Pointer.java:41)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)     at com.sun.jna.Structure.<clinit>(Structure.java:2078)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)     at org.jvnet.hudson.Windows.monitor(Windows.java:42)     at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:124)     at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:114)     at hudson.remoting.UserRequest.perform(UserRequest.java:153)     at hudson.remoting.UserRequest.perform(UserRequest.java:50)     at hudson.remoting.Request$2.run(Request.java:336)     at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)     at java.util.concurrent.FutureTask.run(FutureTask.java:273)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)     at hudson.remoting.Engine$1$1.run(Engine.java:94)     at java.lang.Thread.run(Thread.java:804)     Number of locked synchronizers = 1     - java.util.concurrent.ThreadPoolExecutor$Worker@bf193c54 As Jesse said in https://issues.jenkins-ci.org/browse/JENKINS-16070?focusedCommentId=170842&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-170842 the proper solution would be to fix JNA.  Doing a workaround in Jenkins is, at best, just going to be papering over the cracks. I would suggest that all further efforts be directed at https://github.com/java-native-access/jna/issues/652 and, once that's fixed, the fix to be back-ported to Jenkins' JNA (or fixed in Jenkins and then pushed to the public JNA - either works).

          pjdarton added a comment -

          It looks like the underlying deadlock-prone classloading (Native depended on Pointer which depended on Native) within the JNA library has been fixed in the main stream.  See jna issue 652 for details.

          TL;DR: Pointer no longer depends on Native at classloading time. Static field Pointer.SIZE has been removed. Code should use Native.POINTER_SIZE instead.

          As jglick said in JENKINS-16070, this is the "proper" fix to this issue, so what we need now is for Jenkins to use this new version (or to merge these changes into the version that Jenkins uses) and ensure it uses Native.POINTER_SIZE in any place it previously used Pointer.SIZE.

           

          Note: As gregcovertsmith noted above, adding -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native (to the command-line used to launch Jenkins slaves) is an effective workaround - I put this on all my slaves (both static and dynamic) and I've not encountered this issue since.

          pjdarton added a comment - It looks like the underlying deadlock-prone classloading (Native depended on Pointer which depended on Native) within the JNA library has been fixed in the main stream .  See jna issue 652 for details. TL;DR: Pointer no longer depends on Native at classloading time. Static field Pointer.SIZE has been removed. Code should use Native.POINTER_SIZE instead. As jglick said in JENKINS-16070 , this is the "proper" fix to this issue, so what we need now is for Jenkins to use this new version (or to merge these changes into the version that Jenkins uses) and ensure it uses Native.POINTER_SIZE in any place it previously used Pointer.SIZE.   Note: As gregcovertsmith noted above, adding -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native (to the command-line used to launch Jenkins slaves) is an effective workaround - I put this on all my slaves (both static and dynamic) and I've not encountered this issue since.

          Oleg Nenashev added a comment -

          As jglick mentioned elsewhere, JENKINS-36088 is probably a solution for that

          Oleg Nenashev added a comment - As jglick mentioned elsewhere, JENKINS-36088 is probably a solution for that

          Devin Nusbaum added a comment - - edited

          I've submitted a PR to address the symlink handling here. It doesn't fix the root cause addressed in JNA upstream, but I suspect that `isSymlink` is one of the main callers of native code on Windows so hopefully the issue will be less common.

          Devin Nusbaum added a comment - - edited I've submitted a PR to address the symlink handling here . It doesn't fix the root cause addressed in JNA upstream, but I suspect that `isSymlink` is one of the main callers of native code on Windows so hopefully the issue will be less common.

          pjdarton added a comment -

          I agree.
          In my experience, "isSymlink" is called a lot on Windows, especially when deleting things from disk.
          I'd also guess that "isSymlink" usage drowns-out all other JNA usage.

          pjdarton added a comment - I agree. In my experience, "isSymlink" is called a lot on Windows, especially when deleting things from disk. I'd also guess that "isSymlink" usage drowns-out all other JNA usage.

          Code changed in jenkins
          User: Devin Nusbaum
          Path:
          core/src/main/java/hudson/Util.java
          core/src/main/java/hudson/util/jna/Kernel32Utils.java
          core/src/test/java/hudson/FilePathTest.java
          core/src/test/java/hudson/UtilTest.java
          http://jenkins-ci.org/commit/jenkins/52fa4d90b938243ccc273955caa7262154b9f688
          Log:
          JENKINS-39179 JENKINS-36088 Always use NIO to create and detect symbolic links and Windows junctions (#3133)

          • Always use NIO to detect symlinks
          • Make assertion failure message consistent
          • Catch NoSuchFileException to keep tests passing
          • Make method name more specific and simlify assumption
          • Remove obsolete comment and reword the main comment in isSymlink
          • Deprecate Kernel32Util#isJunctionOrSymlink
          • Use assumptions for junction creation and add messages to assumptions
          • Replace deprecated code with recommended alternative
          • Add comment explaining call to DosFileAttributes#isOther
          • Do not fall back to native code when creating symlinks
          • Log FileSystemExceptions when creating symbolic links
          • Catch InvalidPathException and rethrow as IOException
          • Deprecate Kernel32Utils#createSymbolicLink and #getWin32FileAttributes
          • Preserve original logging behavior on Windows and remove useless call to Util#displayIOException

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Devin Nusbaum Path: core/src/main/java/hudson/Util.java core/src/main/java/hudson/util/jna/Kernel32Utils.java core/src/test/java/hudson/FilePathTest.java core/src/test/java/hudson/UtilTest.java http://jenkins-ci.org/commit/jenkins/52fa4d90b938243ccc273955caa7262154b9f688 Log: JENKINS-39179 JENKINS-36088 Always use NIO to create and detect symbolic links and Windows junctions (#3133) Always use NIO to detect symlinks Make assertion failure message consistent Catch NoSuchFileException to keep tests passing Make method name more specific and simlify assumption Remove obsolete comment and reword the main comment in isSymlink Deprecate Kernel32Util#isJunctionOrSymlink Use assumptions for junction creation and add messages to assumptions Replace deprecated code with recommended alternative Add comment explaining call to DosFileAttributes#isOther Do not fall back to native code when creating symlinks Log FileSystemExceptions when creating symbolic links Catch InvalidPathException and rethrow as IOException Deprecate Kernel32Utils#createSymbolicLink and #getWin32FileAttributes Preserve original logging behavior on Windows and remove useless call to Util#displayIOException

          Jesse Glick added a comment -

          I attached a build of an experimental plugin to this page; sources on GitHub: avoid-agent-jna-deadlock-plugin. It may work around the problem, and more easily than the previous workaround of configuring -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native on every agent (since you need merely install the plugin for the workaround to take effect). Without knowing how to reproduce the problem from scratch, I cannot confirm that it helps.

          The JNA fix is as yet unreleased—scheduled for JNA 5.0.0 (due to its introducing an incompatible API change). Jenkins still uses 4.2.1. Updating to the current release 4.5.0 would not help in this regard, and I am loath to begin using an unreleased custom build or fork.

          The direction we would like to take is to simply avoid using JNA at all from core, unless there is no plausible alternative. That has already been done in the case mentioned here, that of FilePath.deleteRecursive. See also workflow-support PR 48 which may help.

          Jesse Glick added a comment - I attached a build of an experimental plugin to this page; sources on GitHub:  avoid-agent-jna-deadlock-plugin . It may work around the problem, and more easily than the previous workaround of configuring -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native on every agent (since you need merely install the plugin for the workaround to take effect). Without knowing how to reproduce the problem from scratch, I cannot confirm that it helps. The JNA fix is as yet unreleased—scheduled for JNA 5.0.0 (due to its introducing an incompatible API change). Jenkins still uses 4.2.1. Updating to the current release 4.5.0 would not help in this regard, and I am loath to begin using an unreleased custom build or fork. The direction we would like to take is to simply avoid using JNA at all from core, unless there is no plausible alternative. That has already been done in the case mentioned here, that of FilePath.deleteRecursive . See also  workflow-support PR 48  which may help.

          @jglick: I've verified that the plug-in works properly for Windows slaves. Unfortunately we have a mixed installation base of Linux slaves as well, which break when "Launch slave agents via SSH" option is used:

           

          <===[JENKINS REMOTING CAPACITY]===>channel started
          Slave.jar version: 2.53.2
          This is a Unix slave
          Preloading JNA to avoid JENKINS-39179
          Slave JVM has not reported exit code. Is it still running?
          [04/23/18 08:29:08] Launch failed - cleaning up connection
          [04/23/18 08:29:08] [SSH] Connection closed.
          ERROR: Connection terminated
          

          I'm attaching MyLinuxSlave-SystemInformation.txt. May the problem be related with using a somehow old (1.7) Java version?

          Although it doesn't work (yet), thanks for the effort! I really prefer this to be the way (instead of changing configuration in all Windows nodes) until an official fix is provided.

           

          Helder Magalhães added a comment - @ jglick : I've verified that the plug-in works properly for Windows slaves. Unfortunately we have a mixed installation base of Linux slaves as well, which break when "Launch slave agents via SSH" option is used:   <===[JENKINS REMOTING CAPACITY]===>channel started Slave.jar version: 2.53.2 This is a Unix slave Preloading JNA to avoid JENKINS-39179 Slave JVM has not reported exit code. Is it still running? [04/23/18 08:29:08] Launch failed - cleaning up connection [04/23/18 08:29:08] [SSH] Connection closed. ERROR: Connection terminated I'm attaching MyLinuxSlave-SystemInformation.txt . May the problem be related with using a somehow old (1.7) Java version? Although it doesn't work (yet), thanks for the effort! I really prefer this to be the way (instead of changing configuration in all Windows nodes) until an official fix is provided.  

          pjdarton added a comment -

          heldermagalhaes You should be using Java 8 (aka 1.8) on both the master and slaves.  Support for 1.7 ceased last year.  See https://jenkins.io/blog/2017/04/10/jenkins-has-upgraded-to-java-8/

          If you're using (very) different Javas on the masters and slaves then you can get weird errors.

          pjdarton added a comment - heldermagalhaes You should be using Java 8 (aka 1.8) on both the master and slaves.  Support for 1.7 ceased last year.  See https://jenkins.io/blog/2017/04/10/jenkins-has-upgraded-to-java-8/ If you're using (very) different Javas on the masters and slaves then you can get weird errors.

            Unassigned Unassigned
            gregcovertsmith Greg Smith
            Votes:
            4 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated: