Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-68122

Agent connection broken (randomly) with error java.util.concurrent.TimeoutException (regression in 2.325)

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • Jenkins 2.332.1 on Ubuntu 18.04 with OpenJDk 11.0.14
      Amazon EC2 plugin 1.68
    • 2.343, 2.332.3

      After upgrade Jenkins from 2.319.2 to 2.332.1 we start experience with EC2 agent connection broken with time out ping thread error:  

      java.util.concurrent.TimeoutException: Ping started at 1648107727099 hasn't completed by 1648107967100java.util.concurrent.TimeoutException: Ping started at 1648107727099 hasn't completed by 1648107967100 at hudson.remoting.PingThread.ping(PingThread.java:132) at hudson.remoting.PingThread.run(PingThread.java:88)

      This happen randomly and the build job is hung at pipeline git check out stage. When the agent connection broken, we can re-launch and it reconnect but the build job seem no longer can access the agent and just stall until cancel. While this happen, other EC2 agent still running and ping at os level from master to agent in question still get response. We tried to disable "Response Time" in Preventive Node monitoring (manage Node and Cloud). This just delay the broken connection from 2 missing ping to 5 or 6 as the master continue to monitor disk space, swap... Kill the job and rebuild will success most of the time (some time stuck on the same broken connection). 

          [JENKINS-68122] Agent connection broken (randomly) with error java.util.concurrent.TimeoutException (regression in 2.325)

          Kapa Wo added a comment - - edited

          jstack on failed agent found one dead lock

          Found one Java-level deadlock:
          =============================
          "RemoteInvocationHandler [#1]":
            waiting to lock monitor 0x00007fcf28002980 (object 0x0000000624c21510, a hudson.util.RingBufferLogHandler),
            which is held by "Channel reader thread: channel"
          "Channel reader thread: channel":
            waiting to lock monitor 0x00007fcf28006580 (object 0x0000000624c00ce0, a hudson.remoting.RemoteClassLoader),
            which is held by "pool-1-thread-1 for channel id=10591761"
          "pool-1-thread-1 for channel id=10591761":
            waiting to lock monitor 0x00007fcf28002980 (object 0x0000000624c21510, a hudson.util.RingBufferLogHandler),
            which is held by "Channel reader thread: channel"Java stack information for the threads listed above:
          ===================================================
          "RemoteInvocationHandler [#1]":
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  - waiting to lock <0x0000000624c21510> (a hudson.util.RingBufferLogHandler)
                  at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:1051)
                  at hudson.remoting.RemoteInvocationHandler$Unexporter.reportStats(RemoteInvocationHandler.java:702)
                  at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:594)
                  at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.14/Executors.java:515)
                  at java.util.concurrent.FutureTask.run(java.base@11.0.14/FutureTask.java:264)
                  at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121)
                  at java.lang.Thread.run(java.base@11.0.14/Thread.java:829)
          "Channel reader thread: channel":
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  - locked <0x0000000624c21510> (a hudson.util.RingBufferLogHandler)
                  at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:1092)
                  at hudson.remoting.Channel$1.handle(Channel.java:608)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:81)
          "pool-1-thread-1 for channel id=10591761":
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  - waiting to lock <0x0000000624c21510> (a hudson.util.RingBufferLogHandler)
                  at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:1092)
                  at hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:387)
                  - locked <0x0000000624ad56f0> (a java.util.Collections$SynchronizedMap)
                  at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:253)
                  at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14/ClassLoader.java:589)
                  - locked <0x0000000624c00ce0> (a hudson.remoting.RemoteClassLoader)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14/ClassLoader.java:522)
                  at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:123)
                  at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:118)
                  at hudson.remoting.UserRequest.perform(UserRequest.java:211)
                  at hudson.remoting.UserRequest.perform(UserRequest.java:54)
                  at hudson.remoting.Request$2.run(Request.java:376)
                  at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
                  at hudson.remoting.InterceptingExecutorService$$Lambda$44/0x0000000840098840.call(Unknown Source)
                  at java.util.concurrent.FutureTask.run(java.base@11.0.14/FutureTask.java:264)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14/ThreadPoolExecutor.java:628)
                  at java.lang.Thread.run(java.base@11.0.14/Thread.java:829)Found 1 deadlock.
          
          

          Kapa Wo added a comment - - edited jstack on failed agent found one dead lock Found one Java-level deadlock: ============================= "RemoteInvocationHandler [#1]" : waiting to lock monitor 0x00007fcf28002980 (object 0x0000000624c21510, a hudson.util.RingBufferLogHandler), which is held by "Channel reader thread: channel" "Channel reader thread: channel" : waiting to lock monitor 0x00007fcf28006580 (object 0x0000000624c00ce0, a hudson.remoting.RemoteClassLoader), which is held by "pool-1-thread-1 for channel id=10591761" "pool-1-thread-1 for channel id=10591761" : waiting to lock monitor 0x00007fcf28002980 (object 0x0000000624c21510, a hudson.util.RingBufferLogHandler), which is held by "Channel reader thread: channel" Java stack information for the threads listed above: =================================================== "RemoteInvocationHandler [#1]" : at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - waiting to lock <0x0000000624c21510> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:1051) at hudson.remoting.RemoteInvocationHandler$Unexporter.reportStats(RemoteInvocationHandler.java:702) at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:594) at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.14/Executors.java:515) at java.util.concurrent.FutureTask.run(java.base@11.0.14/FutureTask.java:264) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121) at java.lang. Thread .run(java.base@11.0.14/ Thread .java:829) "Channel reader thread: channel" : at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - locked <0x0000000624c21510> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:1092) at hudson.remoting.Channel$1.handle(Channel.java:608) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:81) "pool-1-thread-1 for channel id=10591761" : at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - waiting to lock <0x0000000624c21510> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14/Logger.java:1092) at hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:387) - locked <0x0000000624ad56f0> (a java.util.Collections$SynchronizedMap) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:253) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223) at java.lang. ClassLoader .loadClass(java.base@11.0.14/ ClassLoader .java:589) - locked <0x0000000624c00ce0> (a hudson.remoting.RemoteClassLoader) at java.lang. ClassLoader .loadClass(java.base@11.0.14/ ClassLoader .java:522) at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:123) at hudson.node_monitors.SwapSpaceMonitor$MonitorTask.call(SwapSpaceMonitor.java:118) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at hudson.remoting.InterceptingExecutorService$$Lambda$44/0x0000000840098840.call(Unknown Source) at java.util.concurrent.FutureTask.run(java.base@11.0.14/FutureTask.java:264) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14/ThreadPoolExecutor.java:628) at java.lang. Thread .run(java.base@11.0.14/ Thread .java:829)Found 1 deadlock.

          Kapa Wo added a comment -

          Disable ping tread on both server and agent per document, disable node monitor in Global Setting. Still got error randomly.

          This is a Unix agent
          WARNING: An illegal reflective access operation has occurred
          WARNING: Illegal reflective access by jenkins.slaves.StandardOutputSwapper$ChannelSwapper to constructor java.io.FileDescriptor(int)
          WARNING: Please consider reporting this to the maintainers of jenkins.slaves.StandardOutputSwapper$ChannelSwapper
          WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
          WARNING: All illegal access operations will be denied in a future release
          Evacuated stdout
          Agent successfully connected and online
          ERROR: Failed to monitor for Architecture
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          ERROR: Failed to monitor for Free Swap Space
          ERROR: Failed to monitor for Clock Difference
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          ERROR: Failed to monitor for Response Time
          ERROR: Failed to monitor for Free Disk Space
          ERROR: Failed to monitor for Free Temp Space
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:56)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)ERROR: Failed to monitor for Architecture
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          ERROR: Failed to monitor for Clock Difference
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          ERROR: Failed to monitor for Free Temp Space
          ERROR: Failed to monitor for Response Time
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          ERROR: ERROR: Failed to monitor for Free Disk Space
          Failed to monitor for Free Swap Space
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:56)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          java.util.concurrent.TimeoutException
          	at hudson.remoting.Request$1.get(Request.java:321)
          	at hudson.remoting.Request$1.get(Request.java:240)
          	at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112)
          	at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)
          	at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)
          
          

          Kapa Wo added a comment - Disable ping tread on both server and agent per document , disable node monitor in Global Setting. Still got error randomly. This is a Unix agent WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by jenkins.slaves.StandardOutputSwapper$ChannelSwapper to constructor java.io.FileDescriptor( int ) WARNING: Please consider reporting this to the maintainers of jenkins.slaves.StandardOutputSwapper$ChannelSwapper WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Evacuated stdout Agent successfully connected and online ERROR: Failed to monitor for Architecture java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) ERROR: Failed to monitor for Free Swap Space ERROR: Failed to monitor for Clock Difference java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) ERROR: Failed to monitor for Response Time ERROR: Failed to monitor for Free Disk Space ERROR: Failed to monitor for Free Temp Space java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:56) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)ERROR: Failed to monitor for Architecture java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) ERROR: Failed to monitor for Clock Difference java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) ERROR: Failed to monitor for Free Temp Space ERROR: Failed to monitor for Response Time java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) ERROR: ERROR: Failed to monitor for Free Disk Space Failed to monitor for Free Swap Space java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:56) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305) java.util.concurrent.TimeoutException at hudson.remoting.Request$1.get(Request.java:321) at hudson.remoting.Request$1.get(Request.java:240) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:66) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitorDetailed(AbstractAsyncNodeMonitorDescriptor.java:112) at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76) at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:305)

          We're seeing what I believe is the same bug. It started after we upgraded to 2.332.1 last week.

          19 out of 20 times the agent goes into a locked state sometime after it has connected to the master but before it receives a job to execute.

          Pretty much a total showstopper for us, Jenkins is unusable at the moment.  Really hoping there is some workaround for this, like downgrading to the previous agent.jar perhaps?

          I've taken a number of thread dumps of hanging agent.jar processes. They're all slightly different. What they all have in common is multiple threads stuck waiting:
          at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)

          I'll upload a couple of the dumps, but they're really all much the same as what kapawo has uploaded.

          Robert Andersson added a comment - We're seeing what I believe is the same bug. It started after we upgraded to 2.332.1 last week. 19 out of 20 times the agent goes into a locked state sometime after it has connected to the master but before it receives a job to execute. Pretty much a total showstopper for us, Jenkins is unusable at the moment.  Really hoping there is some workaround for this, like downgrading to the previous agent.jar perhaps? I've taken a number of thread dumps of hanging agent.jar processes. They're all slightly different. What they all have in common is multiple threads stuck waiting: at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) I'll upload a couple of the dumps, but they're really all much the same as what kapawo  has uploaded.

          Jorge Torres Martinez added a comment - - edited

          We are also hitting this while testing upgrade to 2.332.1 from 2.319.3 . We are hitting this issue when trying to add agents via the swarm-plugin. Our pipeline hangs in the node step and this deadlock is found on the agent's side:

          Found one Java-level deadlock:
          =============================
          "pool-1-thread-16":
            waiting to lock monitor 0x00007f2f04007ab8 (object 0x000000008d73ebe8, a hudson.remoting.RemoteClassLoader),
            which is held by "pool-1-thread-8 / waiting for JNLP4-connect connection to <Jenkins Controller> id=58"
          "pool-1-thread-8 / waiting for JNLP4-connect connection to <Jenkins Controller> id=58":
            waiting to lock monitor 0x00007f2f040035f8 (object 0x00000000f1d1ba30, a hudson.util.RingBufferLogHandler),
            which is held by "pool-1-thread-16"
          
          
          Java stack information for the threads listed above:
          ===================================================
          "pool-1-thread-16":
          	at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
          	- locked <0x00000000f1d1ba30> (a hudson.util.RingBufferLogHandler)
          	at java.util.logging.Logger.log(Logger.java:738)
          	at java.util.logging.Logger.doLog(Logger.java:765)
          	at java.util.logging.Logger.log(Logger.java:831)
          	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Writer.run(BIONetworkLayer.java:184)
          	- locked <0x000000008d6cfea8> (a org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Writer)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
          	at hudson.remoting.Engine$1$$Lambda$12/910059620.run(Unknown Source)
          	at java.lang.Thread.run(Thread.java:748)
          "pool-1-thread-8 / waiting for JNLP4-connect connection to <Jenkins Controller> id=58":
          	at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)	
          - waiting to lock <0x00000000f1d1ba30> (a hudson.util.RingBufferLogHandler)
          	at java.util.logging.Logger.log(Logger.java:738)
          	at java.util.logging.Logger.doLog(Logger.java:765)
          	at java.util.logging.Logger.log(Logger.java:851)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:486)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:246)
          	- locked <0x000000008d6cfc08> (a java.lang.Object)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:198)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:700)
          	at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156)
          	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.write(ChannelApplicationLayer.java:325)
          	at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:301)
          	at hudson.remoting.Channel.send(Channel.java:766)
          	- locked <0x000000008d6a2ef8> (a hudson.remoting.Channel)
          	at hudson.remoting.Request.call(Request.java:167)
          	- locked <0x00000000dd8c3f08> (a hudson.remoting.RemoteInvocationHandler$RPCRequest)
          	- locked <0x000000008d6a2ef8> (a hudson.remoting.Channel)
          	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:289)
          	at com.sun.proxy.$Proxy6.fetch3(Unknown Source)
          	at hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:348)
          	at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:253)
          	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
          	- locked <0x000000008d73ebe8> (a hudson.remoting.RemoteClassLoader)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
          	at java.lang.Class.getDeclaringClass0(Native Method)
          	at java.lang.Class.getDeclaringClass(Class.java:1235)
          	at java.lang.Class.getEnclosingClass(Class.java:1277)
          	at java.lang.Class.getSimpleBinaryName(Class.java:1443)
          	at java.lang.Class.getSimpleName(Class.java:1309)
          	at java.lang.Class.isAnonymousClass(Class.java:1411)
          	at org.jenkinsci.remoting.util.AnonymousClassWarnings.doCheck(AnonymousClassWarnings.java:76)
          	at org.jenkinsci.remoting.util.AnonymousClassWarnings.lambda$check$0(AnonymousClassWarnings.java:66)
          	at org.jenkinsci.remoting.util.AnonymousClassWarnings$$Lambda$26/541145666.run(Unknown Source)
          	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
          	at hudson.remoting.InterceptingExecutorService$$Lambda$27/1987554325.call(Unknown Source)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
          	at hudson.remoting.Engine$1$$Lambda$12/910059620.run(Unknown Source)
          	at java.lang.Thread.run(Thread.java:748)
          
          
          Found 1 deadlock. 

          Looking at the code SSLEngineFilterLayer#L486 and BIONetworkLayer.java#L184 seems to me that this problem presents itself when logLevel is set to FINEST. I have been able to work around it by modifying my logging.properties.

          Jorge Torres Martinez added a comment - - edited We are also hitting this while testing upgrade to 2.332.1 from 2.319.3 . We are hitting this issue when trying to add agents via the swarm-plugin. Our pipeline hangs in the node step and this deadlock is found on the agent's side: Found one Java-level deadlock: ============================= "pool-1-thread-16":   waiting to lock monitor 0x00007f2f04007ab8 (object 0x000000008d73ebe8, a hudson.remoting.RemoteClassLoader),   which is held by "pool-1-thread-8 / waiting for JNLP4-connect connection to <Jenkins Controller> id=58" "pool-1-thread-8 / waiting for JNLP4-connect connection to <Jenkins Controller> id=58":   waiting to lock monitor 0x00007f2f040035f8 (object 0x00000000f1d1ba30, a hudson.util.RingBufferLogHandler),   which is held by "pool-1-thread-16" Java stack information for the threads listed above: =================================================== "pool-1-thread-16": at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - locked <0x00000000f1d1ba30> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(Logger.java:738) at java.util.logging.Logger.doLog(Logger.java:765) at java.util.logging.Logger.log(Logger.java:831) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Writer.run(BIONetworkLayer.java:184) - locked <0x000000008d6cfea8> (a org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Writer) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at hudson.remoting.Engine$1$$Lambda$12/910059620.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) "pool-1-thread-8 / waiting for JNLP4-connect connection to <Jenkins Controller> id=58": at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - waiting to lock <0x00000000f1d1ba30> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(Logger.java:738) at java.util.logging.Logger.doLog(Logger.java:765) at java.util.logging.Logger.log(Logger.java:851) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:486) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:246) - locked <0x000000008d6cfc08> (a java.lang.Object) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:198) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:700) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.write(ChannelApplicationLayer.java:325) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:301) at hudson.remoting.Channel.send(Channel.java:766) - locked <0x000000008d6a2ef8> (a hudson.remoting.Channel) at hudson.remoting.Request.call(Request.java:167) - locked <0x00000000dd8c3f08> (a hudson.remoting.RemoteInvocationHandler$RPCRequest) - locked <0x000000008d6a2ef8> (a hudson.remoting.Channel) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:289) at com.sun.proxy.$Proxy6.fetch3(Unknown Source) at hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:348) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:253) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) - locked <0x000000008d73ebe8> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.getDeclaringClass0(Native Method) at java.lang.Class.getDeclaringClass(Class.java:1235) at java.lang.Class.getEnclosingClass(Class.java:1277) at java.lang.Class.getSimpleBinaryName(Class.java:1443) at java.lang.Class.getSimpleName(Class.java:1309) at java.lang.Class.isAnonymousClass(Class.java:1411) at org.jenkinsci.remoting.util.AnonymousClassWarnings.doCheck(AnonymousClassWarnings.java:76) at org.jenkinsci.remoting.util.AnonymousClassWarnings.lambda$check$0(AnonymousClassWarnings.java:66) at org.jenkinsci.remoting.util.AnonymousClassWarnings$$Lambda$26/541145666.run(Unknown Source) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at hudson.remoting.InterceptingExecutorService$$Lambda$27/1987554325.call(Unknown Source) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at hudson.remoting.Engine$1$$Lambda$12/910059620.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) Found 1 deadlock. Looking at the code SSLEngineFilterLayer#L486  and BIONetworkLayer.java#L184  seems to me that this problem presents itself when logLevel is set to FINEST . I have been able to work around it by modifying my logging.properties.

          jorgetm 

          Our usage that causes the hang is also adding of agents via swarm plugin.

          About your workaround, we haven't done anything with logging.properties, so we should be running at the JDK default INFO level, and we're still getting the deadlocks.

          I'm curious exactly what you did to work around this bug.

          Robert Andersson added a comment - jorgetm   Our usage that causes the hang is also adding of agents via swarm plugin. About your workaround, we haven't done anything with logging.properties, so we should be running at the JDK default INFO level, and we're still getting the deadlocks. I'm curious exactly what you did to work around this bug.

          roband7 In our case, we pass in `-Djava.util.logging.config.file` when we launch the swarm-client.jar. The logging.properties file we pass in has historically set the log level to ALL since we have hit issues in the past with the client and wanted to have logs for debugging our swarm agents. I modified the level to CONFIG and haven't hit the issue again (have spun up a few affected builds successfully, will keep trying to reproduce, but before the agents would hang on every build). At first, I tried FINE but experienced some hangs with that so downgraded even further.

          Jorge Torres Martinez added a comment - roband7 In our case, we pass in `-Djava.util.logging.config.file` when we launch the swarm-client.jar. The logging.properties file we pass in has historically set the log level to ALL since we have hit issues in the past with the client and wanted to have logs for debugging our swarm agents. I modified the level to CONFIG and haven't hit the issue again (have spun up a few affected builds successfully, will keep trying to reproduce, but before the agents would hang on every build). At first, I tried FINE but experienced some hangs with that so downgraded even further.

          My guess is that something is just iffy with logging and thread synchronization in the agent, and that different log levels and different usage patterns just happen to have different risks of hitting the bug.

          I'll try changing the log level from INFO to WARNING just to lower the risk even more.

          Robert Andersson added a comment - My guess is that something is just iffy with logging and thread synchronization in the agent, and that different log levels and different usage patterns just happen to have different risks of hitting the bug. I'll try changing the log level from INFO to WARNING just to lower the risk even more.

          The common pattern in all our thread dumps is that:

          Thread A
          locks a hudson.remoting.RemoteClassLoader
          tries to lock a hudson.util.RingBufferLogHandler (or waits on another thread that does)

          Thread B
          locks a hudson.util.RingBufferLogHandler
          tries to lock a hudson.remoting.RemoteClassLoader (or waits on another thread that does)

          So a classic ABBA deadlock.

          Robert Andersson added a comment - The common pattern in all our thread dumps is that: Thread A locks a hudson.remoting.RemoteClassLoader tries to lock a hudson.util.RingBufferLogHandler (or waits on another thread that does) Thread B locks a hudson.util.RingBufferLogHandler tries to lock a hudson.remoting.RemoteClassLoader (or waits on another thread that does) So a classic ABBA deadlock.

          Kapa Wo added a comment - - edited

          We also did not use custom log property but thanks to Jorge comment, I notice that in Jenkins system.property this jenkins.model.Jenkins.initLogLevel is set as FINE level by default. We pass this line -Djenkins.model.Jenkins.initLogLevel=INFO to the jvm option of the client configure and we did not have any broken agent since then.

          I guest that setting cause the hudson.util.RingBufferLogHandler.publish thread hung and block all other java threads including the ping thread which make the master terminate the agent connection. However, we are not sure why this did not happen when we are at 2.319 version. 

          Kapa Wo added a comment - - edited We also did not use custom log property but thanks to Jorge comment, I notice that in Jenkins system.property  this jenkins.model.Jenkins.initLogLevel is set as FINE level by default. We pass this line -Djenkins.model.Jenkins.initLogLevel=INFO to the jvm option of the client configure and we did not have any broken agent since then. I guest that setting cause the hudson.util.RingBufferLogHandler.publish thread hung and block all other java threads including the ping thread which make the master terminate the agent connection. However, we are not sure why this did not happen when we are at 2.319 version. 

          Robert Andersson added a comment - - edited

          Hmm, neither of the suggested workarounds seem to help here. I now do:

          exec java \
           -cp agent.jar \
           -Djenkins.model.Jenkins.initLogLevel=INFO \
           -Djava.util.logging.config.file=/usr/local/etc/logging.properties \
           hudson.remoting.jnlp.Main \
           -headless \
           -direct $JENKINS_URL \
           -protocols JNLP4-connect \
           -instanceIdentity $INSTANCE_IDENTITY \
           -noreconnect \
           -workDir /tmp \
           $DOCKER_SWARM_PLUGIN_JENKINS_AGENT_SECRET \
           $DOCKER_SWARM_PLUGIN_JENKINS_AGENT_NAME

          And in logging.properties:

          .level = INFO
          
          handlers = java.util.logging.ConsoleHandler
          
          java.util.logging.ConsoleHandler.level = INFO
          java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

          Still the same hang, the FINE/FINEST log calls still got down to the RingBufferLogHandler

          Looking at the Jenkins source code the RingBufferLogHandler is programmatically added in hudson.slaves.SlaveComputer.SlaveInitializer.call 
          But how to get the RingBufferLogHandler down to INFO loglevel?

          Guess the changes above don't apply since this is in another ClassLoader?

          Robert Andersson added a comment - - edited Hmm, neither of the suggested workarounds seem to help here. I now do: exec java \ -cp agent.jar \ -Djenkins.model.Jenkins.initLogLevel=INFO \ -Djava.util.logging.config.file=/usr/local/etc/logging.properties \ hudson.remoting.jnlp.Main \ -headless \ -direct $JENKINS_URL \ -protocols JNLP4-connect \ -instanceIdentity $INSTANCE_IDENTITY \ -noreconnect \ -workDir /tmp \ $DOCKER_SWARM_PLUGIN_JENKINS_AGENT_SECRET \ $DOCKER_SWARM_PLUGIN_JENKINS_AGENT_NAME And in logging.properties: .level = INFO handlers = java.util.logging.ConsoleHandler java.util.logging.ConsoleHandler.level = INFO java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter Still the same hang, the FINE/FINEST log calls still got down to the RingBufferLogHandler Looking at the Jenkins source code the RingBufferLogHandler is programmatically added in hudson.slaves.SlaveComputer.SlaveInitializer.call  But how to get the RingBufferLogHandler down to INFO loglevel? Guess the changes above don't apply since this is in another ClassLoader?

          Kapa Wo added a comment -

          Hi Robert, I think you need to pass the parameters before the agent.jar argument. You can check the format and all info of system.property at this link https://www.jenkins.io/doc/book/managing/system-properties/

          I hope this would help. By the way we have no broken agent connection since we use that parameter. 

          Thank you Jorge Torres Martinez again for this discover. I think the default FINE level explain the why the connection intermittent broken. 

          Kapa Wo added a comment - Hi Robert, I think you need to pass the parameters before the agent.jar argument. You can check the format and all info of system.property at this link  https://www.jenkins.io/doc/book/managing/system-properties/ I hope this would help. By the way we have no broken agent connection since we use that parameter.  Thank you Jorge Torres Martinez again for this discover. I think the default FINE level explain the why the connection intermittent broken. 

          Basil Crow added a comment -

          jglick Shall we tag team this? This looks serious, and I can trace back the proximate cause to your commits to RingBufferLogHandler in 2.325 (jenkinsci/jenkins#6018 and jenkinsci/jenkins#6044).

          I can reproduce a hang with a Java 11 controller on AWS (us-west-2d) and a Java 11 inbound agent (TCP port, not WebSocket) on my desktop in northern California and this logging.properties file that turns up logging to the maximum. I cannot reproduce a hang in an integration test locally. I cannot reproduce a hang across networks without the aforementioned logging.properties.

          To reproduce, I start a sample Pipeline build while the agent is down, wait for the "Still waiting to schedule task" and "'agent' is offline" messages, and then launch the agent with

          $ java -jar agent.jar -jnlpUrl ${JENKINS_URL}/computer/<redacted>/jenkins-agent.jnlp -secret <redacted> -workDir "<redacted>" -loggingConfig src/jenkinsci/swarm-plugin/client/logging.properties
          

          With that, the build hangs during Git cloning. I have gotten three hangs in a row now. The relevant portions of the thread dump are provided below. With your commits to RingBufferLogHandler from 2.325 reverted as in https://github.com/basil/jenkins/commit/4fe10a9cca0153ed55dcdb6be1c2fa3b48fdf309 I have run the same scenario successfully three times in a row.

          My preference would be for you to file a PR either reverting as in https://github.com/basil/jenkins/commit/4fe10a9cca0153ed55dcdb6be1c2fa3b48fdf309 or fixing the problem. I can then test the PR in the environment described above and merge it toward the next weekly. Presumably someone else can do a backport toward 2.332.3 LTS.

          Selected portions of thread dump during hang below:

          "IOHub#1: Selector[keys:0, gen:0] / pool-1-thread-1" #16 daemon prio=5 os_prio=0 cpu=2.87ms elapsed=124.19s tid=0x00007f04e400a800 nid=0x89c83 waiting for monitor entry  [0x00007f0532928000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  - waiting to lock <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1072)
                  at org.jenkinsci.remoting.protocol.IOHub.processScheduledTasks(IOHub.java:571)
                  at org.jenkinsci.remoting.protocol.IOHub.run(IOHub.java:447)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
                  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
                  at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          
          "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=438.11ms elapsed=123.89s tid=0x00007f04e4037000 nid=0x89c87 waiting for monitor entry  [0x00007f0532726000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  - waiting to lock <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092)
                  at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:288)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
                  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
                  at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          
          "RemoteInvocationHandler [#1]" #21 daemon prio=5 os_prio=0 cpu=281.00ms elapsed=123.58s tid=0x00007f04e0006800 nid=0x89c89 waiting for monitor entry  [0x00007f05338f6000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  - waiting to lock <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1051)
                  at hudson.remoting.RemoteInvocationHandler$Unexporter.reportStats(RemoteInvocationHandler.java:702)
                  at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:594)
                  at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.14.1/Executors.java:515)
                  at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264)
                  at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          
          "pool-1-thread-10 for JNLP4-connect connection to <redacted> id=180" #28 daemon prio=5 os_prio=0 cpu=244.14ms elapsed=122.56s tid=0x00007f04c4005800 nid=0x89d3e waiting for monitor entry  [0x00007f0531d19000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  - locked <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1072)
                  at hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:417)
                  at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:258)
                  at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:228)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:589)
                  - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:522)
                  at java.lang.ClassLoader.defineClass1(java.base@11.0.14.1/Native Method)
                  at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:1017)
                  at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:878)
                  at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:458)
                  at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:292)
                  - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader)
                  at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:269)
                  at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:228)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:589)
                  - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:522)
                  at java.lang.ClassLoader.defineClass1(java.base@11.0.14.1/Native Method)
                  at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:1017)
                  at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:878)
                  at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:458)
                  at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:292)
                  - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader)
                  at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:269)
                  at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:228)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:589)
                  - locked <0x000000062824d4f0> (a hudson.remoting.RemoteClassLoader)
                  at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:522)
                  at org.jenkinsci.plugins.gitclient.GitClient.<clinit>(GitClient.java:50)
                  at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.<init>(CliGitAPIImpl.java:333)
                  at hudson.plugins.git.GitAPI.<init>(GitAPI.java:78)
                  at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:173)
                  at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:154)
                  at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3487)
                  at hudson.remoting.UserRequest.perform(UserRequest.java:211)
                  at hudson.remoting.UserRequest.perform(UserRequest.java:54)
                  at hudson.remoting.Request$2.run(Request.java:376)
                  at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
                  at hudson.remoting.InterceptingExecutorService$$Lambda$109/0x000000084017e840.call(Unknown Source)
                  at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
                  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
                  at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          

          Basil Crow added a comment - jglick Shall we tag team this? This looks serious, and I can trace back the proximate cause to your commits to RingBufferLogHandler in 2.325 ( jenkinsci/jenkins#6018 and jenkinsci/jenkins#6044 ). I can reproduce a hang with a Java 11 controller on AWS ( us-west-2d ) and a Java 11 inbound agent (TCP port, not WebSocket) on my desktop in northern California and this logging.properties file that turns up logging to the maximum. I cannot reproduce a hang in an integration test locally. I cannot reproduce a hang across networks without the aforementioned logging.properties . To reproduce, I start a sample Pipeline build while the agent is down, wait for the "Still waiting to schedule task" and "'agent' is offline" messages, and then launch the agent with $ java -jar agent.jar -jnlpUrl ${JENKINS_URL}/computer/<redacted>/jenkins-agent.jnlp -secret <redacted> -workDir "<redacted>" -loggingConfig src/jenkinsci/swarm-plugin/client/logging.properties With that, the build hangs during Git cloning. I have gotten three hangs in a row now. The relevant portions of the thread dump are provided below. With your commits to RingBufferLogHandler from 2.325 reverted as in https://github.com/basil/jenkins/commit/4fe10a9cca0153ed55dcdb6be1c2fa3b48fdf309 I have run the same scenario successfully three times in a row. My preference would be for you to file a PR either reverting as in https://github.com/basil/jenkins/commit/4fe10a9cca0153ed55dcdb6be1c2fa3b48fdf309 or fixing the problem. I can then test the PR in the environment described above and merge it toward the next weekly. Presumably someone else can do a backport toward 2.332.3 LTS. Selected portions of thread dump during hang below: "IOHub#1: Selector[keys:0, gen:0] / pool-1-thread-1" #16 daemon prio=5 os_prio=0 cpu=2.87ms elapsed=124.19s tid=0x00007f04e400a800 nid=0x89c83 waiting for monitor entry [0x00007f0532928000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - waiting to lock <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1072) at org.jenkinsci.remoting.protocol.IOHub.processScheduledTasks(IOHub.java:571) at org.jenkinsci.remoting.protocol.IOHub.run(IOHub.java:447) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829) "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=438.11ms elapsed=123.89s tid=0x00007f04e4037000 nid=0x89c87 waiting for monitor entry [0x00007f0532726000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - waiting to lock <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:288) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829) "RemoteInvocationHandler [#1]" #21 daemon prio=5 os_prio=0 cpu=281.00ms elapsed=123.58s tid=0x00007f04e0006800 nid=0x89c89 waiting for monitor entry [0x00007f05338f6000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - waiting to lock <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1051) at hudson.remoting.RemoteInvocationHandler$Unexporter.reportStats(RemoteInvocationHandler.java:702) at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:594) at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.14.1/Executors.java:515) at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:121) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829) "pool-1-thread-10 for JNLP4-connect connection to <redacted> id=180" #28 daemon prio=5 os_prio=0 cpu=244.14ms elapsed=122.56s tid=0x00007f04c4005800 nid=0x89d3e waiting for monitor entry [0x00007f0531d19000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) - locked <0x0000000629f591f0> (a hudson.util.RingBufferLogHandler) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1072) at hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:417) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:258) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:228) at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:589) - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:522) at java.lang.ClassLoader.defineClass1(java.base@11.0.14.1/Native Method) at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:1017) at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:878) at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:458) at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:292) - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:269) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:228) at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:589) - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:522) at java.lang.ClassLoader.defineClass1(java.base@11.0.14.1/Native Method) at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:1017) at java.lang.ClassLoader.defineClass(java.base@11.0.14.1/ClassLoader.java:878) at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:458) at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:292) - locked <0x00000006251c6060> (a hudson.remoting.RemoteClassLoader) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:269) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:228) at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:589) - locked <0x000000062824d4f0> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(java.base@11.0.14.1/ClassLoader.java:522) at org.jenkinsci.plugins.gitclient.GitClient.<clinit>(GitClient.java:50) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.<init>(CliGitAPIImpl.java:333) at hudson.plugins.git.GitAPI.<init>(GitAPI.java:78) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:173) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:154) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3487) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at hudson.remoting.InterceptingExecutorService$$Lambda$109/0x000000084017e840.call(Unknown Source) at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)

          Jesse Glick added a comment -

          I think I know what is going on here but it is hard to confirm due to limitations of the thread dump format. Would you mind checking whether https://github.com/jenkinsci/jenkins/pull/6444 fixes the hang?

          Jesse Glick added a comment - I think I know what is going on here but it is hard to confirm due to limitations of the thread dump format. Would you mind checking whether https://github.com/jenkinsci/jenkins/pull/6444 fixes the hang?

          Basil Crow added a comment -

          jglick Close, but no cigar. At commit 1fd6a5bb67 I ran three experiments and got three hangs. The results are different, though: I no longer get any stack traces from IOHub#processScheduledTasks, BIONetworkLayer$Reader#run, RemoteInvocationHandler$Unexporter#reportStats, or RemoteClassLoader#prefetchClassReference, but I do still get one stack trace from SSLEngineFilterLayer#processResult in each hang, and in one hang I also got a stack trace from SSLEngineFilterLayer#processWrite. Results pasted below.

          Building on your previous commit with

          diff --git a/core/src/main/java/hudson/util/RingBufferLogHandler.java b/core/src/main/java/hudson/util/RingBufferLogHandler.java
          index 68136f50c6..d7fe1a877a 100644
          --- a/core/src/main/java/hudson/util/RingBufferLogHandler.java
          +++ b/core/src/main/java/hudson/util/RingBufferLogHandler.java
          @@ -30,6 +30,7 @@ import java.util.List;
           import java.util.logging.Handler;
           import java.util.logging.Level;
           import java.util.logging.LogRecord;
          +import net.jcip.annotations.GuardedBy;
           
           /**
            * Log {@link Handler} that stores the log records into a ring buffer.
          @@ -46,8 +47,15 @@ public class RingBufferLogHandler extends Handler {
                   }
               }
           
          +    private final Object lock = new Object();
          +
          +    @GuardedBy("lock")
               private int start = 0;
          +
          +    @GuardedBy("lock")
               private final LogRecordRef[] records;
          +
          +    @GuardedBy("lock")
               private int size;
           
               /**
          @@ -76,7 +84,7 @@ public class RingBufferLogHandler extends Handler {
               @Override
               public void publish(LogRecord record) {
                   LogRecordRef logRecordRef = new LogRecordRef(record);
          -        synchronized (this) {
          +        synchronized (lock) {
                       int len = records.length;
                       records[(start + size) % len] = logRecordRef;
                       if (size == len) {
          @@ -87,9 +95,11 @@ public class RingBufferLogHandler extends Handler {
                   }
               }
           
          -    public synchronized void clear() {
          -        size = 0;
          -        start = 0;
          +    public void clear() {
          +        synchronized (lock) {
          +            size = 0;
          +            start = 0;
          +        }
               }
           
               /**
          @@ -104,7 +114,7 @@ public class RingBufferLogHandler extends Handler {
                       @Override
                       public LogRecord get(int index) {
                           // flip the order
          -                synchronized (RingBufferLogHandler.this) {
          +                synchronized (lock) {
                               LogRecord r = records[(start + (size - (index + 1))) % records.length].get();
                               // We cannot just omit collected entries due to the List interface.
                               return r != null ? r : new LogRecord(Level.OFF, "<discarded>");
          @@ -113,7 +123,7 @@ public class RingBufferLogHandler extends Handler {
           
                       @Override
                       public int size() {
          -                synchronized (RingBufferLogHandler.this) {
          +                synchronized (lock) {
                               // Not actually correct if a log record is added
                               // after this is called but before the list is iterated.
                               // However the size should only ever grow, up to the ring buffer max,
          

          I can chase away the problem completely. With the above changes on top of your original PR, I have run the experiment three times in a row without any hangs. If you think this patch makes sense, please update your PR and then I will prepare a backport PR to get an incremental build to provide relief to affected LTS users.


          Stack traces from commit 1fd6a5bb67:

          Run 1

          "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=2385.81ms elapsed=30.05s tid=0x00007f99cc02f000 nid=0xce45f waiting for monitor entry  [0x00007f9a2920a000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processResult(SSLEngineFilterLayer.java:436)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:344)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:119)
                  at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677)
                  at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:137)
                  at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:51)
                  at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:293)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
                  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
                  at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          
          "pool-1-thread-8 : IO ID=1075 : seq#=1074" #26 daemon prio=5 os_prio=0 cpu=414.01ms elapsed=29.48s tid=0x00007f99a4005000 nid=0xce467 waiting for monitor entry  [0x00007f99bf8f8000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:512)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:244)
                  - locked <0x00000006190863b8> (a java.lang.Object)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:196)
                  at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:700)
                  at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156)
                  at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.write(ChannelApplicationLayer.java:327)
                  at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:303)
                  at hudson.remoting.Channel.send(Channel.java:765)
                  - locked <0x000000061902cf00> (a hudson.remoting.Channel)
                  at hudson.remoting.ProxyOutputStream$Chunk.lambda$execute$0(ProxyOutputStream.java:275)
                  at hudson.remoting.ProxyOutputStream$Chunk$$Lambda$113/0x0000000840193c40.run(Unknown Source)
                  at hudson.remoting.PipeWriter$1.run(PipeWriter.java:159)
                  at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.14.1/Executors.java:515)
                  at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264)
                  at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
                  at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
                  at hudson.remoting.InterceptingExecutorService$$Lambda$109/0x000000084017e840.call(Unknown Source)
                  at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
                  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
                  at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          

          Run 2

          "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=2357.43ms elapsed=15.86s tid=0x00007f9ca8037000 nid=0xcf4c0 waiting for monitor entry  [0x00007f9d1483e000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processResult(SSLEngineFilterLayer.java:436)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:344)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:119)
                  at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677)
                  at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:137)
                  at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:51)
                  at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:293)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
                  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
                  at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          

          Run 3

          "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=2835.32ms elapsed=19.86s tid=0x00007f3818036800 nid=0xd71d0 waiting for monitor entry  [0x00007f386f0f5000]
             java.lang.Thread.State: BLOCKED (on object monitor)
                  at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979)
                  at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006)
                  at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processResult(SSLEngineFilterLayer.java:436)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:344)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:119)
                  at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677)
                  at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:137)
                  at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:51)
                  at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:293)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628)
                  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
                  at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source)
                  at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)
          

          Basil Crow added a comment - jglick Close, but no cigar. At commit 1fd6a5bb67 I ran three experiments and got three hangs. The results are different, though: I no longer get any stack traces from IOHub#processScheduledTasks , BIONetworkLayer$Reader#run , RemoteInvocationHandler$Unexporter#reportStats , or RemoteClassLoader#prefetchClassReference , but I do still get one stack trace from SSLEngineFilterLayer#processResult in each hang, and in one hang I also got a stack trace from SSLEngineFilterLayer#processWrite . Results pasted below. Building on your previous commit with diff --git a/core/src/main/java/hudson/util/RingBufferLogHandler.java b/core/src/main/java/hudson/util/RingBufferLogHandler.java index 68136f50c6..d7fe1a877a 100644 --- a/core/src/main/java/hudson/util/RingBufferLogHandler.java +++ b/core/src/main/java/hudson/util/RingBufferLogHandler.java @@ -30,6 +30,7 @@ import java.util.List; import java.util.logging.Handler; import java.util.logging.Level; import java.util.logging.LogRecord; + import net.jcip.annotations.GuardedBy; /** * Log {@link Handler} that stores the log records into a ring buffer. @@ -46,8 +47,15 @@ public class RingBufferLogHandler extends Handler { } } + private final Object lock = new Object (); + + @GuardedBy( "lock" ) private int start = 0; + + @GuardedBy( "lock" ) private final LogRecordRef[] records; + + @GuardedBy( "lock" ) private int size; /** @@ -76,7 +84,7 @@ public class RingBufferLogHandler extends Handler { @Override public void publish(LogRecord record) { LogRecordRef logRecordRef = new LogRecordRef(record); - synchronized ( this ) { + synchronized (lock) { int len = records.length; records[(start + size) % len] = logRecordRef; if (size == len) { @@ -87,9 +95,11 @@ public class RingBufferLogHandler extends Handler { } } - public synchronized void clear() { - size = 0; - start = 0; + public void clear() { + synchronized (lock) { + size = 0; + start = 0; + } } /** @@ -104,7 +114,7 @@ public class RingBufferLogHandler extends Handler { @Override public LogRecord get( int index) { // flip the order - synchronized (RingBufferLogHandler. this ) { + synchronized (lock) { LogRecord r = records[(start + (size - (index + 1))) % records.length].get(); // We cannot just omit collected entries due to the List interface . return r != null ? r : new LogRecord(Level.OFF, "<discarded>" ); @@ -113,7 +123,7 @@ public class RingBufferLogHandler extends Handler { @Override public int size() { - synchronized (RingBufferLogHandler. this ) { + synchronized (lock) { // Not actually correct if a log record is added // after this is called but before the list is iterated. // However the size should only ever grow, up to the ring buffer max, I can chase away the problem completely. With the above changes on top of your original PR, I have run the experiment three times in a row without any hangs. If you think this patch makes sense, please update your PR and then I will prepare a backport PR to get an incremental build to provide relief to affected LTS users. Stack traces from commit 1fd6a5bb67 : Run 1 "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=2385.81ms elapsed=30.05s tid=0x00007f99cc02f000 nid=0xce45f waiting for monitor entry [0x00007f9a2920a000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processResult(SSLEngineFilterLayer.java:436) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:344) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:119) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677) at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:137) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:51) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829) "pool-1-thread-8 : IO ID=1075 : seq#=1074" #26 daemon prio=5 os_prio=0 cpu=414.01ms elapsed=29.48s tid=0x00007f99a4005000 nid=0xce467 waiting for monitor entry [0x00007f99bf8f8000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:512) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:244) - locked <0x00000006190863b8> (a java.lang.Object) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:196) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:700) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.write(ChannelApplicationLayer.java:327) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:303) at hudson.remoting.Channel.send(Channel.java:765) - locked <0x000000061902cf00> (a hudson.remoting.Channel) at hudson.remoting.ProxyOutputStream$Chunk.lambda$execute$0(ProxyOutputStream.java:275) at hudson.remoting.ProxyOutputStream$Chunk$$Lambda$113/0x0000000840193c40.run(Unknown Source) at hudson.remoting.PipeWriter$1.run(PipeWriter.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.14.1/Executors.java:515) at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at hudson.remoting.InterceptingExecutorService$$Lambda$109/0x000000084017e840.call(Unknown Source) at java.util.concurrent.FutureTask.run(java.base@11.0.14.1/FutureTask.java:264) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829) Run 2 "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=2357.43ms elapsed=15.86s tid=0x00007f9ca8037000 nid=0xcf4c0 waiting for monitor entry [0x00007f9d1483e000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processResult(SSLEngineFilterLayer.java:436) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:344) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:119) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677) at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:137) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:51) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829) Run 3 "pool-1-thread-3" #19 daemon prio=5 os_prio=0 cpu=2835.32ms elapsed=19.86s tid=0x00007f3818036800 nid=0xd71d0 waiting for monitor entry [0x00007f386f0f5000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.util.RingBufferLogHandler.publish(RingBufferLogHandler.java:78) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:979) at java.util.logging.Logger.doLog(java.logging@11.0.14.1/Logger.java:1006) at java.util.logging.Logger.log(java.logging@11.0.14.1/Logger.java:1092) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processResult(SSLEngineFilterLayer.java:436) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:344) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:119) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677) at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:137) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:51) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14.1/ThreadPoolExecutor.java:628) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) at hudson.remoting.Engine$1$$Lambda$85/0x0000000840125c40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.14.1/Thread.java:829)

          Jesse Glick added a comment -

          Whereas I thought I knew what was going on in the original thread dumps, that hypothesis is apparently violated by the thread dump excerpts you show (is there anything else at all related in the full thread dumps? Found one Java-level deadlock?) and it does not seem to make any sense for the new line 78

          LogRecordRef logRecordRef = new LogRecordRef(record);
          

          to be blocked on any object monitor. Why introducing a separate lock object would help in this context, I cannot guess, so it is unclear that moving the LogRecordRef constructor (which I hypothesized triggered class loading) out of the synchronized block is even necessary. IOW this might work just as well:

              public void publish(LogRecord record) {
                  synchronized (lock) {
                      int len = records.length;
                      records[(start + size) % len] = new LogRecordRef(record);
                      if (size == len) {
                          start = (start + 1) % len;
                      } else {
                          size++;
                      }
                  }
              }
          

          Jesse Glick added a comment - Whereas I thought I knew what was going on in the original thread dumps, that hypothesis is apparently violated by the thread dump excerpts you show (is there anything else at all related in the full thread dumps? Found one Java-level deadlock ?) and it does not seem to make any sense for the new line 78 LogRecordRef logRecordRef = new LogRecordRef(record); to be blocked on any object monitor. Why introducing a separate lock object would help in this context, I cannot guess, so it is unclear that moving the LogRecordRef constructor (which I hypothesized triggered class loading) out of the synchronized block is even necessary. IOW this might work just as well: public void publish(LogRecord record) { synchronized (lock) { int len = records.length; records[(start + size) % len] = new LogRecordRef(record); if (size == len) { start = (start + 1) % len; } else { size++; } } }

          I have done some testing with all of your patches. I'm still getting deadlocks.

          Looking at this some more, and experimenting with my own patches.

          My conclusion is that it doesn't matter at all how much we change the synchronization here.

          The only thing that matters is the fact that instantiation of a LogRecordRef inside the publish method triggers the classloader.

          So any thread that has the classloader locked, and does something that waits for another thread, the other thread does some logging, and hits publish that instantiates LogRecordRef that tries to lock the classloader, boom deadlock.

          Uploading a threaddump showing this, this was taken with the combined patches from Jesse and Basil.

          One hackish fix for this issue is to just instantiate a dummy LogRecordRef in the RingBufferLogHandler constructor. I'm sure there's better ways.

          Robert Andersson added a comment - I have done some testing with all of your patches. I'm still getting deadlocks. Looking at this some more, and experimenting with my own patches. My conclusion is that it doesn't matter at all how much we change the synchronization here. The only thing that matters is the fact that instantiation of a LogRecordRef inside the publish method triggers the classloader. So any thread that has the classloader locked, and does something that waits for another thread, the other thread does some logging, and hits publish that instantiates LogRecordRef that tries to lock the classloader, boom deadlock. Uploading a threaddump showing this, this was taken with the combined patches from Jesse and Basil. One hackish fix for this issue is to just instantiate a dummy LogRecordRef in the RingBufferLogHandler constructor. I'm sure there's better ways.

          Jesse Glick added a comment -

          If https://github.com/jenkinsci/jenkins/pull/6446 does the trick, then great.

          What I did not fully process before is that this is an agent-side hang. That explains why LogRecordRef has still not been loaded, long after controller startup. Forcing class loading in the constructor inside the agent JVM, before the logger has been registered, could help.

          Jesse Glick added a comment - If https://github.com/jenkinsci/jenkins/pull/6446 does the trick, then great. What I did not fully process before is that this is an agent-side hang. That explains why LogRecordRef has still not been loaded, long after controller startup. Forcing class loading in the constructor inside the agent JVM, before the logger has been registered, could help.

          That constructor hack is in fact the exact same workaround I ended up with  

          We've been running it for a few hours now on our production Jenkins server, so far not a single hang. Looking good.

          Robert Andersson added a comment - That constructor hack is in fact the exact same workaround I ended up with   We've been running it for a few hours now on our production Jenkins server, so far not a single hang. Looking good.

          Basil Crow added a comment -

          I agree that we should preload LogRecordRef on the agent side to avoid an ABBA deadlock. But rather than doing this in the constructor inside the agent JVM, I wonder if it might be cleaner to do this inside a static initializer as in commit 38efb0b99c. That way it only needs to be done once when RingBufferLogHandler is loaded on the agent rather than every time a new instance of RingBufferLogHandler is created. roband7 Would you be interested in testing this commit?

          Basil Crow added a comment - I agree that we should preload LogRecordRef on the agent side to avoid an ABBA deadlock. But rather than doing this in the constructor inside the agent JVM, I wonder if it might be cleaner to do this inside a static initializer as in commit 38efb0b99c . That way it only needs to be done once when RingBufferLogHandler is loaded on the agent rather than every time a new instance of RingBufferLogHandler is created. roband7 Would you be interested in testing this commit?

          Sure, if Jesse agrees with that fix I'll compile and install it on our production server later today. Cannot do it right away, been enough disruptions since the 2.332.1 upgrade.

          Our setup hits the bug fast due to heavy agent usage, so will know soon if the new fix works too.

          Robert Andersson added a comment - Sure, if Jesse agrees with that fix I'll compile and install it on our production server later today. Cannot do it right away, been enough disruptions since the 2.332.1 upgrade. Our setup hits the bug fast due to heavy agent usage, so will know soon if the new fix works too.

          Jesse Glick added a comment -

          Well I hope the fix works this time and it can go into 2.332.3. Since this is pretty severe for those who encounter it, I took a stab at an agent-side workaround. Completely untested, just based on code inspection, but if it works then it might allow you to use a stock core for the next few weeks: https://github.com/jenkinsci/remoting/pull/527

          Jesse Glick added a comment - Well I hope the fix works this time and it can go into 2.332.3. Since this is pretty severe for those who encounter it, I took a stab at an agent-side workaround. Completely untested, just based on code inspection, but if it works then it might allow you to use a stock core for the next few weeks: https://github.com/jenkinsci/remoting/pull/527

          Jesse Glick added a comment -

          every time a new instance of RingBufferLogHandler is created

          Well, that would be once in the controller JVM plus any occasional plugin usages, and once in the agent JVM. Not really important. But sure, either way works (or ought to work).

          Jesse Glick added a comment - every time a new instance of RingBufferLogHandler is created Well, that would be once in the controller JVM plus any occasional plugin usages, and once in the agent JVM. Not really important. But sure, either way works (or ought to work).

          Basil Crow added a comment -

          Fixed in jenkinsci/jenkins#6449 toward 2.343.

          Basil Crow added a comment - Fixed in jenkinsci/jenkins#6449 toward 2.343.

          Mark Chester added a comment -

          I read through the comments here and didn't see any workaround for until the new LTS is released.  We are running on 2.332.3 and have this problem BAD.  We are failing jobs several times per day.  Is there any workaround that does not involve leaving the LTS line?

          Mark Chester added a comment - I read through the comments here and didn't see any workaround for until the new LTS is released.  We are running on 2.332.3 and have this problem BAD.  We are failing jobs several times per day.  Is there any workaround that does not involve leaving the LTS line?

          Mark Waite added a comment -

          koyaanisqatsi the earlier comments indicate that reducing the logging level of the agent seems to reduce the issue. Have you tried that?

          Mark Waite added a comment - koyaanisqatsi the earlier comments indicate that reducing the logging level of the agent seems to reduce the issue. Have you tried that?

          Mark Chester added a comment -

          markewaite I'm not sure how to do that on our installation.  We are not launching agents from a CLI.  They are launched automatically in AWS by "Clouds and Nodes" configs.  I'm trying to identify the plugin that provides this, which I believe is this guy: https://plugins.jenkins.io/ssh-slaves/  I have not found any way to inject a logging config into our agent startup process.

          Mark Chester added a comment - markewaite I'm not sure how to do that on our installation.  We are not launching agents from a CLI.  They are launched automatically in AWS by "Clouds and Nodes" configs.  I'm trying to identify the plugin that provides this, which I believe is this guy: https://plugins.jenkins.io/ssh-slaves/   I have not found any way to inject a logging config into our agent startup process.

          Mark Waite added a comment -

          If you're not increasing the logging level, then there is likely no benefit to reducing the logging level.

          You could assist with testing the Jenkins 2.346.1 release candidate and confirm that it is resolved in your environment with that release.

          Mark Waite added a comment - If you're not increasing the logging level, then there is likely no benefit to reducing the logging level. You could assist with testing the Jenkins 2.346.1 release candidate and confirm that it is resolved in your environment with that release.

          Basil Crow added a comment -

          markewaite, you appear to be confused about which LTS release this fix was backported to. The fix was backported to 2.332.3, so testing 2.346.1 will not help. koyaanisqatsi I suggest filing a new ticket, as your problem may be unrelated.

          Basil Crow added a comment - markewaite , you appear to be confused about which LTS release this fix was backported to. The fix was backported to 2.332.3, so testing 2.346.1 will not help. koyaanisqatsi I suggest filing a new ticket, as your problem may be unrelated.

          Kapa Wo added a comment -

          we no longer has this issue after upgrade to 2.332.3. Matt Chester, your issue may be different. However, you can try our work around (earlier comment) which we used before the 2.332.3 LTS released. 

          Kapa Wo added a comment - we no longer has this issue after upgrade to 2.332.3. Matt Chester, your issue may be different. However, you can try our work around (earlier comment) which we used before the 2.332.3 LTS released. 

          Mark Chester added a comment -

          Well, the issue has cleared up after I reverted my agent configs back to an EC2 instance type that did not use NVMe storage (to m6i.2xlarge from c5d.4xlarge) and the AMI back to one older than current (to ami-04dd4500af104442f, from ami-0c1bc246476a5572b in eu-west-1).  I also disabled "Stop/Disconnect on Idle Timeout", which I had enabled to save the initialization time of a new instance.  We had updated to NVMe storage due to needing the low-latency storage for Docker-based builds, but that seems to have affected Jenkins in a negative way.  It didn't help the builds anyway, so reverting wasn't a big deal.  Normally I would have made only one change at a time, so that I could see the impact.  But this was blocking a lot of developers and a hotfix we needed to get out.

          I agree my case sounds like something different than this issue.  I don't have the liberty of upgrading outside the LTS versions.  If we can find some time and approval to test things further, I'll open a new issue and include diagnostic info.

          Mark Chester added a comment - Well, the issue has cleared up after I reverted my agent configs back to an EC2 instance type that did not use NVMe storage (to m6i.2xlarge from c5d.4xlarge) and the AMI back to one older than current (to ami-04dd4500af104442f, from ami-0c1bc246476a5572b in eu-west-1).  I also disabled "Stop/Disconnect on Idle Timeout", which I had enabled to save the initialization time of a new instance.  We had updated to NVMe storage due to needing the low-latency storage for Docker-based builds, but that seems to have affected Jenkins in a negative way.  It didn't help the builds anyway, so reverting wasn't a big deal.  Normally I would have made only one change at a time, so that I could see the impact.  But this was blocking a lot of developers and a hotfix we needed to get out. I agree my case sounds like something different than this issue.  I don't have the liberty of upgrading outside the LTS versions.  If we can find some time and approval to test things further, I'll open a new issue and include diagnostic info.

            basil Basil Crow
            kapawo Kapa Wo
            Votes:
            5 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: