Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39179

All builds hang, JNA load deadlock on Windows slave

      I hate to create a general "core" bug, as I wish I could redirect this to the correct component. Unfortunately, I can not identify which component is hanging and why, so I do not know how to direct this problem.

      This problem started about 2 weeks ago, as we have been adding new Pipeline builds to our build server. So it could be related to one of the pipeline plugins.

      The behavior is the following:

      • 1 to 2 times a day, all builds on all build slaves will hang. The console log of the build just stops moving forward, and stays stuck at the last line executed / last line returned.
      • Once this occurs, attempting to stop a build fails. Clicking stop results in no change in the build status or console log output
      • New builds will not start. They sit in the queue, but the slaves will not be started.
      • The UI continues to function, so it is possible to view config, get threaddumps, etc.

      The only resolution is to restart the Jenkins server.

      We are using the vCenter plugin to dynamically start all build slaves. Though, we have been using this configuration for months, and the problem just started.

      We have recreated this on both latest Jenkins level (2.26) and Jenkins LTS version 2.19.1

      I am attaching a threaddump of the server at the time of one of these hangs.

      I can provide any other information that might help in diagnosing this problem

          [JENKINS-39179] All builds hang, JNA load deadlock on Windows slave

          pjdarton added a comment - - edited

          We're also seeing deadlocks:  Lots of threads all with stacktraces whose deepest point is:

              at java.lang.Object.wait(Native Method)
              -  waiting on java.lang.J9VMInternals$ClassInitializationLock@7cd6c16
              at java.lang.Object.wait(Object.java:167)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)
              -  locked java.lang.J9VMInternals$ClassInitializationLock@7cd6c16
              at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
              at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
              at hudson.Util.isSymlink(Util.java:507)

          i.e. They're all calling hudson.util.jna.Kernel32Utils.getWin32FileAttributes and this is deadlocking.

          As for updating code to use java.nio.file.Files I'm not convinced that this will affect the issue.  The problem is that, on Windows, the code is required to detect if a "directory" is either a real directory, a symbolic link to a directory, or a windows "Junction Point" (which is functionally identical to a symbolic link, but is not considered to be a symbolic link by java.nio's isSymbolicLink method).

          i.e. No matter how we do this, it'll require a jna call out to Kernel32.DLL's GetFileAttributes function, so we need that to work and not to deadlock.

           

          Also, I'd be quite surprised if this deadlock issue was unique to just the GetFileAttributes function - my guess is that it'll affect all Kernel32 calls, but it's just that file deletion hammers it the most and is, therefore, where most of the problems are seen.

           

          FYI a Windows "Junction Point" is not uncommon - they're more common than symbolic links are on Windows.  It's difficult to create a symbolic link on Windows (It's crazy but, on Windows, using symbolic links is a privileged operation.  Whilst one can downgrade it to user-level, Windows ignores this for any user that is permitted to run things "as administrator", which is most people.  i.e. in effect, admins have less rights than non-admins - it's crazy).  However, it's trivial to create a "Junction Point" - any user can do that - this is not a privileged operation.

          TL;DR: people who need a symbolic link to a directory on Windows usually use a Junction Point instead of a symbolic link.

          pjdarton added a comment - - edited We're also seeing deadlocks:  Lots of threads all with stacktraces whose deepest point is:     at java.lang. Object .wait(Native Method)     -  waiting on java.lang.J9VMInternals$ClassInitializationLock@7cd6c16     at java.lang. Object .wait( Object .java:167)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)     -  locked java.lang.J9VMInternals$ClassInitializationLock@7cd6c16     at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)     at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)     at hudson.Util.isSymlink(Util.java:507) i.e. They're all calling hudson.util.jna.Kernel32Utils.getWin32FileAttributes and this is deadlocking. As for updating code to use java.nio.file.Files I'm not convinced that this will affect the issue.  The problem is that, on Windows, the code is required to detect if a "directory" is either a real directory, a symbolic link to a directory, or a windows "Junction Point" (which is functionally identical to a symbolic link, but is not considered to be a symbolic link by java.nio's isSymbolicLink method). i.e. No matter how we do this, it'll require a jna call out to Kernel32.DLL's GetFileAttributes function, so we need that to work and not to deadlock.   Also, I'd be quite surprised if this deadlock issue was unique to just the GetFileAttributes function - my guess is that it'll affect all Kernel32 calls, but it's just that file deletion hammers it the most and is, therefore, where most of the problems are seen.   FYI a Windows "Junction Point" is not uncommon - they're more common than symbolic links are on Windows.  It's difficult to create a symbolic link on Windows (It's crazy but, on Windows, using symbolic links is a privileged operation.  Whilst one can downgrade it to user-level, Windows ignores this for any user that is permitted to run things "as administrator", which is most people.  i.e. in effect, admins have less rights than non-admins - it's crazy).  However, it's trivial to create a "Junction Point" - any user can do that - this is not a privileged operation. TL;DR: people who need a symbolic link to a directory on Windows usually use a Junction Point instead of a symbolic link.

          pjdarton added a comment - - edited

          My previous comment was incorrect - they weren't all calling isSymlink (if they were all doing the same thing, there wouldn't have been any deadlock).

          I've been doing some digging and I've concluded that while it is the same bug as -JENKINS-16070-, the underlying cause is actually a bug in the JNA library that Jenkins uses.  See https://github.com/java-native-access/jna/issues/652

          1. Jenkins class hudson.util.jna.Kernel32Utils depends on Jenkins class hudson.util.jna.Kernel32 depends on com.sun.jna.Native which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native)
          2. Jenkins class hudson.node_monitors.SwapSpaceMonitor depends on Jenkins class org.jvnet.hudson.Windows depends on com.sun.jna.Structure which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native)

          I believe that the mistake in the JNA code is that com.sun.jna.Native depends on com.sun.jna.Pointer which depends on com.sun.jna.Native, i.e. a circular dependency.

          What I'm seeing is that we have two separate threads causing classloading of these two independently (see stacktraces below), where the first one ("pool-1-thread-3", where a build is trying to "deleteRecursive" an old workspace folder on the slave) has started initialising Native and not got as far as Pointer, and the second thread ("pool-1-thread-9", where the slave monitor subsystem is trying to query the swapspace available) has started initialising Pointer and not got as far as Native, then they'll deadlock waiting for the other thread to finish classloading.

          "pool-1-thread-3 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616289" Id=17 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08 (in native)
              at java.lang.Object.wait(Native Method)
              -  waiting on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08
              at java.lang.Object.wait(Object.java:167)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)
              -  locked java.lang.J9VMInternals$ClassInitializationLock@1ae67f08
              at com.sun.jna.Native.initIDs(Native Method)
              at com.sun.jna.Native.<clinit>(Native.java:148)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at hudson.util.jna.Kernel32Utils.load(Kernel32Utils.java:112)
              at hudson.util.jna.Kernel32.<clinit>(Kernel32.java:37)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)
              at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)
              at hudson.Util.isSymlink(Util.java:507)
              at hudson.FilePath.deleteRecursive(FilePath.java:1199)
              at hudson.FilePath.access$1000(FilePath.java:195)
              at hudson.FilePath$14.invoke(FilePath.java:1179)
              at hudson.FilePath$14.invoke(FilePath.java:1176)
              at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)
              at hudson.remoting.UserRequest.perform(UserRequest.java:153)
              at hudson.remoting.UserRequest.perform(UserRequest.java:50)
              at hudson.remoting.Request$2.run(Request.java:336)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
              at java.util.concurrent.FutureTask.run(FutureTask.java:273)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
              at hudson.remoting.Engine$1$1.run(Engine.java:94)
              at java.lang.Thread.run(Thread.java:804)
          
              Number of locked synchronizers = 1
              - java.util.concurrent.ThreadPoolExecutor$Worker@819d87b4
          "pool-1-thread-9 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616789" Id=24 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030 (in native)
              at java.lang.Object.wait(Native Method)
              -  waiting on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030
              at java.lang.Object.wait(Object.java:167)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)
              -  locked java.lang.J9VMInternals$ClassInitializationLock@fe8f4030
              at com.sun.jna.Pointer.<clinit>(Pointer.java:41)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)
              at com.sun.jna.Structure.<clinit>(Structure.java:2078)
              at java.lang.J9VMInternals.initializeImpl(Native Method)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)
              at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)
              at org.jvnet.hudson.Windows.monitor(Windows.java:42)
              at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:124)
              at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:114)
              at hudson.remoting.UserRequest.perform(UserRequest.java:153)
              at hudson.remoting.UserRequest.perform(UserRequest.java:50)
              at hudson.remoting.Request$2.run(Request.java:336)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
              at java.util.concurrent.FutureTask.run(FutureTask.java:273)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
              at hudson.remoting.Engine$1$1.run(Engine.java:94)
              at java.lang.Thread.run(Thread.java:804)
          
              Number of locked synchronizers = 1
              - java.util.concurrent.ThreadPoolExecutor$Worker@bf193c54

          As Jesse said in https://issues.jenkins-ci.org/browse/JENKINS-16070?focusedCommentId=170842&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-170842 the proper solution would be to fix JNA.  Doing a workaround in Jenkins is, at best, just going to be papering over the cracks.

          I would suggest that all further efforts be directed at https://github.com/java-native-access/jna/issues/652 and, once that's fixed, the fix to be back-ported to Jenkins' JNA (or fixed in Jenkins and then pushed to the public JNA - either works).

          pjdarton added a comment - - edited My previous comment was incorrect - they weren't all calling isSymlink (if they were all doing the same thing, there wouldn't have been any deadlock). I've been doing some digging and I've concluded that while it is the same bug as - JENKINS-16070 -, the underlying cause is actually a bug in the JNA library that Jenkins uses.  See https://github.com/java-native-access/jna/issues/652 Jenkins class hudson.util.jna.Kernel32Utils depends on Jenkins class hudson.util.jna.Kernel32 depends on com.sun.jna.Native which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native ) Jenkins class hudson.node_monitors.SwapSpaceMonitor depends on Jenkins class org.jvnet.hudson.Windows depends on com.sun.jna.Structure which depends on com.sun.jna.Pointer (which depends on com.sun.jna.Native ) I believe that the mistake in the JNA code is that com.sun.jna.Native depends on com.sun.jna.Pointer which depends on com.sun.jna.Native , i.e. a circular dependency. What I'm seeing is that we have two separate threads causing classloading of these two independently (see stacktraces below), where the first one ("pool-1-thread-3", where a build is trying to "deleteRecursive" an old workspace folder on the slave) has started initialising Native and not got as far as Pointer , and the second thread ("pool-1-thread-9", where the slave monitor subsystem is trying to query the swapspace available) has started initialising Pointer and not got as far as Native , then they'll deadlock waiting for the other thread to finish classloading. "pool-1-thread-3 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616289" Id=17 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08 (in native)     at java.lang.Object.wait(Native Method)     -  waiting on java.lang.J9VMInternals$ClassInitializationLock@1ae67f08     at java.lang.Object.wait(Object.java:167)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)     -  locked java.lang.J9VMInternals$ClassInitializationLock@1ae67f08     at com.sun.jna.Native.initIDs(Native Method)     at com.sun.jna.Native.<clinit>(Native.java:148)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at hudson.util.jna.Kernel32Utils.load(Kernel32Utils.java:112)     at hudson.util.jna.Kernel32.<clinit>(Kernel32.java:37)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at hudson.util.jna.Kernel32Utils.getWin32FileAttributes(Kernel32Utils.java:77)     at hudson.util.jna.Kernel32Utils.isJunctionOrSymlink(Kernel32Utils.java:98)     at hudson.Util.isSymlink(Util.java:507)     at hudson.FilePath.deleteRecursive(FilePath.java:1199)     at hudson.FilePath.access$1000(FilePath.java:195)     at hudson.FilePath$14.invoke(FilePath.java:1179)     at hudson.FilePath$14.invoke(FilePath.java:1176)     at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2731)     at hudson.remoting.UserRequest.perform(UserRequest.java:153)     at hudson.remoting.UserRequest.perform(UserRequest.java:50)     at hudson.remoting.Request$2.run(Request.java:336)     at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)     at java.util.concurrent.FutureTask.run(FutureTask.java:273)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)     at hudson.remoting.Engine$1$1.run(Engine.java:94)     at java.lang.Thread.run(Thread.java:804)     Number of locked synchronizers = 1     - java.util.concurrent.ThreadPoolExecutor$Worker@819d87b4 "pool-1-thread-9 for Channel to jenkins.mydomain.com/1.2.3.4 id=3616789" Id=24 Group=main WAITING on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030 (in native)     at java.lang.Object.wait(Native Method)     -  waiting on java.lang.J9VMInternals$ClassInitializationLock@fe8f4030     at java.lang.Object.wait(Object.java:167)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:274)     -  locked java.lang.J9VMInternals$ClassInitializationLock@fe8f4030     at com.sun.jna.Pointer.<clinit>(Pointer.java:41)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)     at com.sun.jna.Structure.<clinit>(Structure.java:2078)     at java.lang.J9VMInternals.initializeImpl(Native Method)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:237)     at java.lang.J9VMInternals.initialize(J9VMInternals.java:204)     at org.jvnet.hudson.Windows.monitor(Windows.java:42)     at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:124)     at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:114)     at hudson.remoting.UserRequest.perform(UserRequest.java:153)     at hudson.remoting.UserRequest.perform(UserRequest.java:50)     at hudson.remoting.Request$2.run(Request.java:336)     at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)     at java.util.concurrent.FutureTask.run(FutureTask.java:273)     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)     at hudson.remoting.Engine$1$1.run(Engine.java:94)     at java.lang.Thread.run(Thread.java:804)     Number of locked synchronizers = 1     - java.util.concurrent.ThreadPoolExecutor$Worker@bf193c54 As Jesse said in https://issues.jenkins-ci.org/browse/JENKINS-16070?focusedCommentId=170842&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-170842 the proper solution would be to fix JNA.  Doing a workaround in Jenkins is, at best, just going to be papering over the cracks. I would suggest that all further efforts be directed at https://github.com/java-native-access/jna/issues/652 and, once that's fixed, the fix to be back-ported to Jenkins' JNA (or fixed in Jenkins and then pushed to the public JNA - either works).

          pjdarton added a comment -

          It looks like the underlying deadlock-prone classloading (Native depended on Pointer which depended on Native) within the JNA library has been fixed in the main stream.  See jna issue 652 for details.

          TL;DR: Pointer no longer depends on Native at classloading time. Static field Pointer.SIZE has been removed. Code should use Native.POINTER_SIZE instead.

          As jglick said in JENKINS-16070, this is the "proper" fix to this issue, so what we need now is for Jenkins to use this new version (or to merge these changes into the version that Jenkins uses) and ensure it uses Native.POINTER_SIZE in any place it previously used Pointer.SIZE.

           

          Note: As gregcovertsmith noted above, adding -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native (to the command-line used to launch Jenkins slaves) is an effective workaround - I put this on all my slaves (both static and dynamic) and I've not encountered this issue since.

          pjdarton added a comment - It looks like the underlying deadlock-prone classloading (Native depended on Pointer which depended on Native) within the JNA library has been fixed in the main stream .  See jna issue 652 for details. TL;DR: Pointer no longer depends on Native at classloading time. Static field Pointer.SIZE has been removed. Code should use Native.POINTER_SIZE instead. As jglick said in JENKINS-16070 , this is the "proper" fix to this issue, so what we need now is for Jenkins to use this new version (or to merge these changes into the version that Jenkins uses) and ensure it uses Native.POINTER_SIZE in any place it previously used Pointer.SIZE.   Note: As gregcovertsmith noted above, adding -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native (to the command-line used to launch Jenkins slaves) is an effective workaround - I put this on all my slaves (both static and dynamic) and I've not encountered this issue since.

          Oleg Nenashev added a comment -

          As jglick mentioned elsewhere, JENKINS-36088 is probably a solution for that

          Oleg Nenashev added a comment - As jglick mentioned elsewhere, JENKINS-36088 is probably a solution for that

          Devin Nusbaum added a comment - - edited

          I've submitted a PR to address the symlink handling here. It doesn't fix the root cause addressed in JNA upstream, but I suspect that `isSymlink` is one of the main callers of native code on Windows so hopefully the issue will be less common.

          Devin Nusbaum added a comment - - edited I've submitted a PR to address the symlink handling here . It doesn't fix the root cause addressed in JNA upstream, but I suspect that `isSymlink` is one of the main callers of native code on Windows so hopefully the issue will be less common.

          pjdarton added a comment -

          I agree.
          In my experience, "isSymlink" is called a lot on Windows, especially when deleting things from disk.
          I'd also guess that "isSymlink" usage drowns-out all other JNA usage.

          pjdarton added a comment - I agree. In my experience, "isSymlink" is called a lot on Windows, especially when deleting things from disk. I'd also guess that "isSymlink" usage drowns-out all other JNA usage.

          Code changed in jenkins
          User: Devin Nusbaum
          Path:
          core/src/main/java/hudson/Util.java
          core/src/main/java/hudson/util/jna/Kernel32Utils.java
          core/src/test/java/hudson/FilePathTest.java
          core/src/test/java/hudson/UtilTest.java
          http://jenkins-ci.org/commit/jenkins/52fa4d90b938243ccc273955caa7262154b9f688
          Log:
          JENKINS-39179 JENKINS-36088 Always use NIO to create and detect symbolic links and Windows junctions (#3133)

          • Always use NIO to detect symlinks
          • Make assertion failure message consistent
          • Catch NoSuchFileException to keep tests passing
          • Make method name more specific and simlify assumption
          • Remove obsolete comment and reword the main comment in isSymlink
          • Deprecate Kernel32Util#isJunctionOrSymlink
          • Use assumptions for junction creation and add messages to assumptions
          • Replace deprecated code with recommended alternative
          • Add comment explaining call to DosFileAttributes#isOther
          • Do not fall back to native code when creating symlinks
          • Log FileSystemExceptions when creating symbolic links
          • Catch InvalidPathException and rethrow as IOException
          • Deprecate Kernel32Utils#createSymbolicLink and #getWin32FileAttributes
          • Preserve original logging behavior on Windows and remove useless call to Util#displayIOException

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Devin Nusbaum Path: core/src/main/java/hudson/Util.java core/src/main/java/hudson/util/jna/Kernel32Utils.java core/src/test/java/hudson/FilePathTest.java core/src/test/java/hudson/UtilTest.java http://jenkins-ci.org/commit/jenkins/52fa4d90b938243ccc273955caa7262154b9f688 Log: JENKINS-39179 JENKINS-36088 Always use NIO to create and detect symbolic links and Windows junctions (#3133) Always use NIO to detect symlinks Make assertion failure message consistent Catch NoSuchFileException to keep tests passing Make method name more specific and simlify assumption Remove obsolete comment and reword the main comment in isSymlink Deprecate Kernel32Util#isJunctionOrSymlink Use assumptions for junction creation and add messages to assumptions Replace deprecated code with recommended alternative Add comment explaining call to DosFileAttributes#isOther Do not fall back to native code when creating symlinks Log FileSystemExceptions when creating symbolic links Catch InvalidPathException and rethrow as IOException Deprecate Kernel32Utils#createSymbolicLink and #getWin32FileAttributes Preserve original logging behavior on Windows and remove useless call to Util#displayIOException

          Jesse Glick added a comment -

          I attached a build of an experimental plugin to this page; sources on GitHub: avoid-agent-jna-deadlock-plugin. It may work around the problem, and more easily than the previous workaround of configuring -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native on every agent (since you need merely install the plugin for the workaround to take effect). Without knowing how to reproduce the problem from scratch, I cannot confirm that it helps.

          The JNA fix is as yet unreleased—scheduled for JNA 5.0.0 (due to its introducing an incompatible API change). Jenkins still uses 4.2.1. Updating to the current release 4.5.0 would not help in this regard, and I am loath to begin using an unreleased custom build or fork.

          The direction we would like to take is to simply avoid using JNA at all from core, unless there is no plausible alternative. That has already been done in the case mentioned here, that of FilePath.deleteRecursive. See also workflow-support PR 48 which may help.

          Jesse Glick added a comment - I attached a build of an experimental plugin to this page; sources on GitHub:  avoid-agent-jna-deadlock-plugin . It may work around the problem, and more easily than the previous workaround of configuring -Dhudson.remoting.RemoteClassLoader.force=com.sun.jna.Native on every agent (since you need merely install the plugin for the workaround to take effect). Without knowing how to reproduce the problem from scratch, I cannot confirm that it helps. The JNA fix is as yet unreleased—scheduled for JNA 5.0.0 (due to its introducing an incompatible API change). Jenkins still uses 4.2.1. Updating to the current release 4.5.0 would not help in this regard, and I am loath to begin using an unreleased custom build or fork. The direction we would like to take is to simply avoid using JNA at all from core, unless there is no plausible alternative. That has already been done in the case mentioned here, that of FilePath.deleteRecursive . See also  workflow-support PR 48  which may help.

          @jglick: I've verified that the plug-in works properly for Windows slaves. Unfortunately we have a mixed installation base of Linux slaves as well, which break when "Launch slave agents via SSH" option is used:

           

          <===[JENKINS REMOTING CAPACITY]===>channel started
          Slave.jar version: 2.53.2
          This is a Unix slave
          Preloading JNA to avoid JENKINS-39179
          Slave JVM has not reported exit code. Is it still running?
          [04/23/18 08:29:08] Launch failed - cleaning up connection
          [04/23/18 08:29:08] [SSH] Connection closed.
          ERROR: Connection terminated
          

          I'm attaching MyLinuxSlave-SystemInformation.txt. May the problem be related with using a somehow old (1.7) Java version?

          Although it doesn't work (yet), thanks for the effort! I really prefer this to be the way (instead of changing configuration in all Windows nodes) until an official fix is provided.

           

          Helder Magalhães added a comment - @ jglick : I've verified that the plug-in works properly for Windows slaves. Unfortunately we have a mixed installation base of Linux slaves as well, which break when "Launch slave agents via SSH" option is used:   <===[JENKINS REMOTING CAPACITY]===>channel started Slave.jar version: 2.53.2 This is a Unix slave Preloading JNA to avoid JENKINS-39179 Slave JVM has not reported exit code. Is it still running? [04/23/18 08:29:08] Launch failed - cleaning up connection [04/23/18 08:29:08] [SSH] Connection closed. ERROR: Connection terminated I'm attaching MyLinuxSlave-SystemInformation.txt . May the problem be related with using a somehow old (1.7) Java version? Although it doesn't work (yet), thanks for the effort! I really prefer this to be the way (instead of changing configuration in all Windows nodes) until an official fix is provided.  

          pjdarton added a comment -

          heldermagalhaes You should be using Java 8 (aka 1.8) on both the master and slaves.  Support for 1.7 ceased last year.  See https://jenkins.io/blog/2017/04/10/jenkins-has-upgraded-to-java-8/

          If you're using (very) different Javas on the masters and slaves then you can get weird errors.

          pjdarton added a comment - heldermagalhaes You should be using Java 8 (aka 1.8) on both the master and slaves.  Support for 1.7 ceased last year.  See https://jenkins.io/blog/2017/04/10/jenkins-has-upgraded-to-java-8/ If you're using (very) different Javas on the masters and slaves then you can get weird errors.

            Unassigned Unassigned
            gregcovertsmith Greg Smith
            Votes:
            4 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated: