Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65873

java.lang.OutOfMemoryError: unable to create new native thread

    • 2.362

      We regularly see issues with the jenkins/inbound-agent in our Jenkins logs on Kubernetes. It seems to occur in around 1% of all jobs.

      The error message is below.

      Whilst the error message refers to java.lang.OutOfMemoryError: and unable to create new native thread we have checked the pods and nodes in the cluster and there is always sufficient memory or threads available at the time of the error.

      The specific versions for this specific error message are:

      jenkins/inbound-agent:4.3-4

      Jenkins 2.263.4

      However we have also seen this error occur with different versions of both the inbound-agent and Jenkins.

      Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-100-64-244-120.eu-west-1.compute.internal/100.64.244.120:39138
      	at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1800)
      	at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
      	at hudson.remoting.Channel.call(Channel.java:1001)
      	at hudson.FilePath.act(FilePath.java:1157)
      	at hudson.FilePath.act(FilePath.java:1146)
      	at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:121)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:904)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:835)
      	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1288)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
      	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      java.lang.OutOfMemoryError: unable to create new native thread
      	at java.lang.Thread.start0(Native Method)
      	at java.lang.Thread.start(Thread.java:717)
      	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
      	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
      	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
      	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
      	at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49)
      	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72)
      	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68)
      	at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)
      	at hudson.remoting.UserRequest.deserialize(UserRequest.java:290)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:189)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
      	at hudson.remoting.Request$2.run(Request.java:369)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
      Caused: java.io.IOException: Remote call on JNLP4-connect connection from ip-100-64-244-120.eu-west-1.compute.internal/100.64.244.120:39138 failed
      	at hudson.remoting.Channel.call(Channel.java:1007)
      	at hudson.FilePath.act(FilePath.java:1157)
      	at hudson.FilePath.act(FilePath.java:1146)
      	at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:121)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:904)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:835)
      	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1288)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
      	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

          [JENKINS-65873] java.lang.OutOfMemoryError: unable to create new native thread

          Wasim added a comment -

          This error was seen again today with 
          inbound-agent:4.7-1

          Wasim added a comment - This error was seen again today with  inbound-agent:4.7-1

          Kevin added a comment -

          I have tested up through inbound-agent:4.9-1 w/ controller versions 2.289.2 and 2.304 – the issue is still happening at the same rate.

          I created a job on a dev controller deployment that only checks out scm. I have it set to run once per minute and usually get ~1 failure per hour.

          Kevin added a comment - I have tested up through inbound-agent:4.9-1 w/ controller versions 2.289.2 and 2.304 – the issue is still happening at the same rate. I created a job on a dev controller deployment that only checks out scm. I have it set to run once per minute and usually get ~1 failure per hour.

          Christopher added a comment - - edited

          I have this problem frequently.  It is very frustrating.  I have attached a jstack thread capture of the jenkins agent that shows thousands of threads similar to this:

           

          "pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000]"pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000]   java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for  <0x14d9c070> (a java.util.concurrent.SynchronousQueue$TransferStack) at
          java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) at
          java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Unknown Source) at
          java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source) at
          java.util.concurrent.SynchronousQueue.poll(Unknown Source) at
          java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) at
          java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at
          java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at
          hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at
          hudson.remoting.Engine$1$$Lambda$10/10833120.run(Unknown Source) at
          java.lang.Thread.run(Unknown Source)
          

           

           

          My OutOfMemory exception usually looks like this:

           

           

          java.lang.OutOfMemoryError: unable to create new native thread
           at java.lang.Thread.start0(Native Method)
           at java.lang.Thread.start(Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor.addWorker(Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
           at java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
           at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51)
           at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50)
           at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44)
           at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66)
           at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:122)
           at java.io.ObjectOutputStream.writeNonProxyDesc(Unknown Source)
           at java.io.ObjectOutputStream.writeClassDesc(Unknown Source)
           at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
           at java.io.ObjectOutputStream.writeObject0(Unknown Source)
           at java.io.ObjectOutputStream.writeObject(Unknown Source)
           at hudson.remoting.Command.writeTo(Command.java:111)
           at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:286)
           at hudson.remoting.Channel.send(Channel.java:766)
           at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:158)
           at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:117)
           at hudson.util.StreamCopyThread.run(StreamCopyThread.java:71)
           
          

           

          After reading the code for 20 minutes, my hunch is that whenever .flush() is called on an output stream somewhere, that ends up creating a new worker thread to perform the flush.  for some reason those thread accumulate and dont recycle fast enough under some condition.  

           

          I created a script to capture a stack dump once per minute.  The number of pool-1-thread-nnnnn things would surge up and down over the course of the 90 minutes that my job takes to run.  most of the time the pool thread seem to recycle fast enough that it works.  however, sometimes they can't and thus the OOME.

           

          It is probably worth noting that my job generates a lot of output.  250000-350000 lines for 35M-50M in output.  Also, the jenkins agent is running on an AWS instance and the jenkins master is in a separate datacenter.  So perhaps the flushing threads take just long enough over the distant network connection that it contributes to the problem.

           

          I don't remember having this issue (as much) before the agent and the master where farther apart, but I could be wrong.

           

          Christopher added a comment - - edited I have this problem frequently.  It is very frustrating.  I have attached a jstack thread capture of the jenkins agent that shows thousands of threads similar to this:   "pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000] "pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000]   java.lang. Thread .State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for   <0x14d9c070> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Unknown Source) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source) at java.util.concurrent.SynchronousQueue.poll(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at hudson.remoting.Engine$1$$Lambda$10/10833120.run(Unknown Source) at java.lang. Thread .run(Unknown Source)     My OutOfMemory exception usually looks like this:     java.lang.OutOfMemoryError: unable to create new native thread at java.lang. Thread .start0(Native Method) at java.lang. Thread .start(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.addWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at java.util.concurrent.AbstractExecutorService.submit(Unknown Source) at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66) at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:122) at java.io.ObjectOutputStream.writeNonProxyDesc(Unknown Source) at java.io.ObjectOutputStream.writeClassDesc(Unknown Source) at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source) at java.io.ObjectOutputStream.writeObject0(Unknown Source) at java.io.ObjectOutputStream.writeObject(Unknown Source) at hudson.remoting.Command.writeTo(Command.java:111) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:286) at hudson.remoting.Channel.send(Channel.java:766) at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:158) at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:117) at hudson.util.StreamCopyThread.run(StreamCopyThread.java:71)     After reading the code for 20 minutes, my hunch is that whenever .flush() is called on an output stream somewhere, that ends up creating a new worker thread to perform the flush.  for some reason those thread accumulate and dont recycle fast enough under some condition.     I created a script to capture a stack dump once per minute.  The number of pool-1-thread-nnnnn things would surge up and down over the course of the 90 minutes that my job takes to run.  most of the time the pool thread seem to recycle fast enough that it works.  however, sometimes they can't and thus the OOME.   It is probably worth noting that my job generates a lot of output.  250000-350000 lines for 35M-50M in output.  Also, the jenkins agent is running on an AWS instance and the jenkins master is in a separate datacenter.  So perhaps the flushing threads take just long enough over the distant network connection that it contributes to the problem.   I don't remember having this issue (as much) before the agent and the master where farther apart, but I could be wrong.  

          Christopher added a comment -

          Upon further investigation, the number of pool-1-thread-nnnnnn threads varies directly with the pace of log output. 

           

          In a quiet period (several minutes with no output, there are only 2 of these threads.  At the end of that quiet period when a lot (10000+ lines of logs are produced suddenly) the number of these threads jumps as high as 460.  After the log volume slows down, that number of thread slowly drops as well back down to 50 or so.  Then as logs are produced at a slower pace, the number of threads remains around there or will drop to as low as 10-15.  As log volume spikes back up during noisier tests, the number of threads spikes back up as well.

           

          The highest number I managed to capture was around 2500 of these threads.  Obviously having 2500 threads to flush log streams points to a problem somewhere. 

           

          My Speculation:  Perhaps the idea of a worker thread to flush the log was to eliminate some performance bottleneck somewhere in the past?  Maybe flushes were slow in some scenario and having them be async was a win.  But it seems as though in this case the complexity is causing the problem.  Perhaps there needs to be a limit to the number of threads allowed in a worker pool somewhere?

           

           

          Christopher added a comment - Upon further investigation, the number of pool-1-thread-nnnnnn threads varies directly with the pace of log output.    In a quiet period (several minutes with no output, there are only 2 of these threads.  At the end of that quiet period when a lot (10000+ lines of logs are produced suddenly) the number of these threads jumps as high as 460.  After the log volume slows down, that number of thread slowly drops as well back down to 50 or so.  Then as logs are produced at a slower pace, the number of threads remains around there or will drop to as low as 10-15.  As log volume spikes back up during noisier tests, the number of threads spikes back up as well.   The highest number I managed to capture was around 2500 of these threads.  Obviously having 2500 threads to flush log streams points to a problem somewhere.    My Speculation:  Perhaps the idea of a worker thread to flush the log was to eliminate some performance bottleneck somewhere in the past?  Maybe flushes were slow in some scenario and having them be async was a win.  But it seems as though in this case the complexity is causing the problem.  Perhaps there needs to be a limit to the number of threads allowed in a worker pool somewhere?    

          Christopher added a comment -

          After reading the code a bit more and playing with some of the unit test, I think I have a solution:

          After: "git clone https://github.com/jenkinsci/remoting.git", I made this change:

           

          --- a/src/main/java/hudson/remoting/Engine.java
          +++ b/src/main/java/hudson/remoting/Engine.java
          @@ -113,7 +113,7 @@
          {{ /**}}
          {{ * Thread pool that sets {@link #CURRENT}.}}
          {{ */}}
          - private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() {
          + private final ExecutorService executor = Executors.newFixedThreadPool(100,new ThreadFactory() {
          {{ private final ThreadFactory defaultFactory = Executors.defaultThreadFactory();}}
          {{ @Override}}
          {{ public Thread newThread(@Nonnull final Runnable r) {}}

           

          I'm running this now and (so far) the problem has not happened again.  blame showed that this {{Executors.newCachedThreadPool }}call has been there for a long time (13 years or so, probably longer).  I don't know this code well enough to understand what other implications this might have, but it does seem to address the thread creation overflow.  

           

          After reading the code in Executors.newCachedThreadPool, it basically creates a thread pool where any new request immediately creates a new thread if there are no currently idle threads to service that request.  The number of threads is unbounded (effectively).   The Executors.newFixedThreadPool uses a different queue for work request and will simply queue them up and have no more than X (100 here) worker threads processing that queue.  After discovering how those 2 Executors work internally, the Executors.newCachedThreadPool seems like a bomb waiting to go off.  It's unbounded approach to thread creation (2^29 or something is it's actual limit) sets up a scenario where even a tiny delay in the time spent in each task can quickly blow up the number of threads.  As an experiment, I added a Thread.sleep(1), for a 1ms sleep, to doCheck in AnonymousClassWarnings.  That caused the ProxyWriterTest to create 1000 threads instead of it's usual 20 to 30.

           

          Christopher added a comment - After reading the code a bit more and playing with some of the unit test, I think I have a solution: After: "git clone https://github.com/jenkinsci/remoting.git ", I made this change:   --- a/src/main/java/hudson/remoting/Engine.java +++ b/src/main/java/hudson/remoting/Engine.java @@ -113,7 +113,7 @@ {{ /**}} {{ * Thread pool that sets {@link #CURRENT}.}} {{ */}} - private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() { + private final ExecutorService executor = Executors.newFixedThreadPool(100,new ThreadFactory() { {{ private final ThreadFactory defaultFactory = Executors.defaultThreadFactory();}} {{ @Override}} {{ public Thread newThread(@Nonnull final Runnable r) {}}   I'm running this now and (so far) the problem has not happened again.  blame showed that this {{Executors.newCachedThreadPool }}call has been there for a long time (13 years or so, probably longer).  I don't know this code well enough to understand what other implications this might have, but it does seem to address the thread creation overflow.     After reading the code in Executors.newCachedThreadPool, it basically creates a thread pool where any new request immediately creates a new thread if there are no currently idle threads to service that request.  The number of threads is unbounded (effectively).   The Executors.newFixedThreadPool uses a different queue for work request and will simply queue them up and have no more than X (100 here) worker threads processing that queue.  After discovering how those 2 Executors work internally, the Executors.newCachedThreadPool seems like a bomb waiting to go off.  It's unbounded approach to thread creation (2^29 or something is it's actual limit) sets up a scenario where even a tiny delay in the time spent in each task can quickly blow up the number of threads.  As an experiment, I added a Thread.sleep(1), for a 1ms sleep, to doCheck in AnonymousClassWarnings.  That caused the ProxyWriterTest to create 1000 threads instead of it's usual 20 to 30.  

          Basil Crow added a comment -

          Wow, thanks for providing this analysis, cpholt! To give some background, I'm not a maintainer of Remoting (nor do I really understand how it works), but I am a user who has been frustrated with this bug for years, and I maintain the Swarm plugin (a thin wrapper on top of Remoting). I've posted my stack trace earlier in the comments: while I don't have a call to flush() as you do, some other aspects of my setup are similar to yours. For example, I also have my controller in an on-prem datacenter separate from the agents running in AWS.

          I think the key insight you've provided is that this error may be a symptom of thread exhaustion. I didn't consider this as a possibility, but with an unbounded number of threads I can see how they could become exhausted. It's not clear to me what causes these threads to remain active for a long period of time in some cases and to be recycled quickly in other cases. Most systems software that I've seen (e.g., the Linux NFS server) has a configurable thread limit defaulting to some sensible number (e.g. 64 threads) and allowing the user to tune the option if necessary. Perhaps the same should be done here - defaulting to something like 64 and allowing the user to customize the value with a system property. I would encourage you to open a pull request with such a change and see what the maintainers think. I bet they would be receptive to it, though I can't know for sure. Also note that there's another instance of Executors.newCachedThreadPool in the Launcher class that might be susceptible to the same problem.

          Now if you'll allow me to speculate, and with the caveat that I really don't understand how Remoting works, one of the things I've noticed is that the open source Remoting supports both a BIONetworkLayer (blocking I/O) and a NIONetworkLayer (non-blocking I/O), but as far as I can tell, only the BIONetworkLayer is ever exposed in the open-source version of Jenkins. The NIONetworkLayer only seems to be exposed to users in the UI in the commercial version of CloudBees CI (which I've never used) and is documented as follows:

          The non-blocking I/O connector limits the number of threads that are used to maintain the SSH channel: when there are a large number of channels (that is, many SSH agents) the non-blocking connector uses fewer threads. This permits the Jenkins UI to remain more responsive than with the standard SSH agent connector.

          I find it interesting that they talk about the non-blocking connector using fewer threads. I'm not sure if it's relevant at all to the problem we're dealing with, but it's certainly interesting background information. If you're willing to hack around with Remoting, you can get the non-blocking I/O behavior by commenting out .withPreferNonBlockingIO(false) in Engine. I'm a bit curious if this would make a difference in your use case - with the caveat that I don't really have a solid theory here, since I am too unfamiliar with all of this to be able to begin to reason about things clearly.

          Basil Crow added a comment - Wow, thanks for providing this analysis, cpholt ! To give some background, I'm not a maintainer of Remoting (nor do I really understand how it works), but I am a user who has been frustrated with this bug for years, and I maintain the Swarm plugin (a thin wrapper on top of Remoting). I've posted my stack trace earlier in the comments: while I don't have a call to flush() as you do, some other aspects of my setup are similar to yours. For example, I also have my controller in an on-prem datacenter separate from the agents running in AWS. I think the key insight you've provided is that this error may be a symptom of thread exhaustion. I didn't consider this as a possibility, but with an unbounded number of threads I can see how they could become exhausted. It's not clear to me what causes these threads to remain active for a long period of time in some cases and to be recycled quickly in other cases. Most systems software that I've seen (e.g., the Linux NFS server) has a configurable thread limit defaulting to some sensible number (e.g. 64 threads) and allowing the user to tune the option if necessary. Perhaps the same should be done here - defaulting to something like 64 and allowing the user to customize the value with a system property. I would encourage you to open a pull request with such a change and see what the maintainers think. I bet they would be receptive to it, though I can't know for sure. Also note that there's another instance of Executors.newCachedThreadPool in the Launcher class that might be susceptible to the same problem. Now if you'll allow me to speculate, and with the caveat that I really don't understand how Remoting works, one of the things I've noticed is that the open source Remoting supports both a BIONetworkLayer (blocking I/O) and a NIONetworkLayer (non-blocking I/O), but as far as I can tell, only the BIONetworkLayer is ever exposed in the open-source version of Jenkins. The NIONetworkLayer only seems to be exposed to users in the UI in the commercial version of CloudBees CI (which I've never used) and is documented as follows : The non-blocking I/O connector limits the number of threads that are used to maintain the SSH channel: when there are a large number of channels (that is, many SSH agents) the non-blocking connector uses fewer threads. This permits the Jenkins UI to remain more responsive than with the standard SSH agent connector. I find it interesting that they talk about the non-blocking connector using fewer threads. I'm not sure if it's relevant at all to the problem we're dealing with, but it's certainly interesting background information. If you're willing to hack around with Remoting, you can get the non-blocking I/O behavior by commenting out .withPreferNonBlockingIO(false) in Engine . I'm a bit curious if this would make a difference in your use case - with the caveat that I don't really have a solid theory here, since I am too unfamiliar with all of this to be able to begin to reason about things clearly.

          Christopher added a comment -

          basil:  I saw the other spot in Launcher and it was my first attempt at a fix, but Engine proved to be the code path I needed.  I'm guessing that someone who understands this code would fix both.  

           

          The threads remain active for 60 seconds.  Controlled by a parameter in Executors.newCachedThreadPool.  They get reused if another task is needed within that time.   It's when a massive flood of tasks hits all at once that the number of threads can spike.  I think in this case it's when a line of output is written (or flushed?).  My Jenkins jobs that trip this bug tend to accumulate large amounts of logs in internal buffers and dump them on certain events.  So the sudden spike in demand leads to thread exhaustion, when a thread limited approach can easily handle the spike in writes.

           

          The documentation you reference is probably related but exploring that avenue is beyond the amount of time I can invest in this, unless my simple fix fails.

           

          Hopefully one of the Remoting maintainers can chime in and shed some light on this...

          Christopher added a comment - basil :  I saw the other spot in Launcher and it was my first attempt at a fix, but Engine proved to be the code path I needed.  I'm guessing that someone who understands this code would fix both.     The threads remain active for 60 seconds.  Controlled by a parameter in Executors.newCachedThreadPool.  They get reused if another task is needed within that time.   It's when a massive flood of tasks hits all at once that the number of threads can spike.  I think in this case it's when a line of output is written (or flushed?).  My Jenkins jobs that trip this bug tend to accumulate large amounts of logs in internal buffers and dump them on certain events.  So the sudden spike in demand leads to thread exhaustion, when a thread limited approach can easily handle the spike in writes.   The documentation you reference is probably related but exploring that avenue is beyond the amount of time I can invest in this, unless my simple fix fails.   Hopefully one of the Remoting maintainers can chime in and shed some light on this...

          Basil Crow added a comment -

          I put together a local reproducer for this bug. First, I created a Python script to create a large burst of output:

          #!/usr/bin/python3
          
          lipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur mollis, sem in aliquet consectetur, diam lacus faucibus leo, ut tincidunt diam elit id justo. Nulla ac libero ut felis iaculis suscipit in in massa. Etiam consectetur suscipit ornare. Pellentesque eu diam tempus, lobortis est non, vulputate nulla. Fusce sagittis sodales turpis, sit amet imperdiet lorem lobortis quis. Cras ac ex nisi. Sed in nisl cursus, consectetur enim non, ultrices libero. In egestas malesuada erat, sit amet consectetur sapien. Nulla massa augue, cursus vitae malesuada ac, tincidunt eu ex. Aliquam vitae mi euismod, placerat sapien a, luctus sapien. Vestibulum at libero pulvinar, vestibulum purus ac, cursus erat. Phasellus vitae orci id ante maximus fermentum. Fusce posuere tincidunt leo, eget placerat sapien fringilla quis. Sed cursus mauris odio, ac interdum felis auctor vel. In ut aliquam massa. Praesent porttitor euismod urna. Suspendisse potenti. In porta libero vel interdum iaculis. Fusce non volutpat lacus. Proin arcu tortor, placerat a sagittis eget, commodo vel ante. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur sem sem, aliquet at varius vel, dapibus eu lacus. Mauris vel ipsum neque. Etiam elit erat, auctor non sagittis a, volutpat sed nisi. Vivamus tellus dui, tincidunt a imperdiet et, mollis sed orci. Sed euismod, mauris at finibus luctus, diam nisl scelerisque orci, ut efficitur tellus augue nec ante. Quisque commodo ipsum quis nunc dapibus vehicula. Pellentesque dignissim ultricies tortor et euismod. Proin feugiat iaculis nunc sed aliquet. Suspendisse fringilla turpis egestas neque fringilla, at malesuada lacus rutrum. Nam eu venenatis orci. Praesent bibendum dictum dictum. Quisque rhoncus turpis a neque sollicitudin blandit. Donec eget magna ultricies nisi tempor aliquam nec eget neque. Aenean sagittis nunc nec est vehicula suscipit. Sed vitae bibendum quam. Fusce at mi arcu. Ut eget diam quis enim commodo consequat. Aliquam pulvinar erat sit amet mi sollicitudin, eget mollis dui blandit. Nam varius, mi eget interdum consectetur, nibh nulla venenatis orci, sit amet vulputate leo odio at nulla. Donec nunc elit, auctor eget molestie vitae, fermentum a lacus. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Curabitur massa enim, vulputate in ipsum nec, blandit lacinia mi. Nulla dignissim est eget congue suscipit. Phasellus sit amet porttitor urna. In ut lacinia sapien Vivamus dapibus consectetur massa, et vehicula ex molestie vitae. Duis efficitur ut sapien eu euismod. Donec id lorem dignissim, aliquam odio id, suscipit lorem. Pellentesque sit amet vulputate sem, ac blandit nunc. Pellentesque faucibus augue sed cursus molestie. Mauris quis nulla erat. In et nulla vel ex fringilla lacinia quis sit amet risus. Nunc in erat quis nisi laoreet iaculis. Pellentesque lobortis pulvinar justo, imperdiet gravida justo ultricies eget. Pellentesque vehicula purus et metus hendrerit, sed placerat metus tincidunt. Proin lacinia hendrerit quam, eu pharetra urna ullamcorper id. Cras mattis eu sem sed facilisis. Vestibulum sit amet libero sit amet eros condimentum congue. Suspendisse et ultricies ante, in rutrum magna. Etiam et fringilla mi, non eleifend arcu. Vestibulum sit amet tristique felis, at congue odio. Ut posuere interdum justo at.\n"
          
          mystr = ""
          
          for x in range(0, 999999):
              mystr += lipsum
          
          print(mystr)
          

          I put this script in /tmp/lipsum.py.

          Then I built Remoting with a 100 millisecond sleep:

          diff --git a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java
          index a4912aa8..10928c25 100644
          --- a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java
          +++ b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java
          @@ -71,6 +71,11 @@ public class AnonymousClassWarnings {
               }
           
               private static void doCheck(@Nonnull Class<?> c) {
          +        try {
          +            Thread.sleep(100);
          +        } catch (Throwable t) {
          +            // do nothing
          +        }
                   if (Enum.class.isAssignableFrom(c)) { // e.g., com.cloudbees.plugins.credentials.CredentialsScope$1 ~ CredentialsScope.SYSTEM
                       // ignore, enums serialize specially
                   } else if (c.isAnonymousClass()) { // e.g., pkg.Outer$1
          

          I installed this with mvn clean install -DskipTests.

          In Jenkins core I used this patch:

          diff --git a/pom.xml b/pom.xml
          index 234651e2bb..44ec2d9423 100644
          --- a/pom.xml
          +++ b/pom.xml
          @@ -91,7 +91,7 @@ THE SOFTWARE.
               <changelog.url>https://www.jenkins.io/changelog</changelog.url>
           
               <!-- Bundled Remoting version -->
          -    <remoting.version>4.10</remoting.version>
          +    <remoting.version>4.11-SNAPSHOT</remoting.version>
               <!-- Minimum Remoting version, which is tested for API compatibility -->
               <remoting.minimum.supported.version>3.14</remoting.minimum.supported.version>
           
          diff --git a/test/src/test/java/hudson/model/ProjectTest.java b/test/src/test/java/hudson/model/ProjectTest.java
          index a63bda9445..c6cd2890ca 100644
          --- a/test/src/test/java/hudson/model/ProjectTest.java
          +++ b/test/src/test/java/hudson/model/ProjectTest.java
          @@ -33,6 +33,8 @@ import hudson.Launcher;
           import hudson.Util;
           import hudson.model.queue.QueueTaskFuture;
           import hudson.security.AccessDeniedException3;
          +import hudson.slaves.RetentionStrategy;
          +import hudson.slaves.SlaveComputer;
           import hudson.tasks.ArtifactArchiver;
           import hudson.tasks.BatchFile;
           import hudson.tasks.BuildTrigger;
          @@ -78,6 +80,7 @@ import hudson.security.ACLContext;
           import hudson.slaves.Cloud;
           import hudson.slaves.DumbSlave;
           import hudson.slaves.NodeProvisioner;
          +import org.jvnet.hudson.test.SimpleCommandLauncher;
           import org.jvnet.hudson.test.TestExtension;
           import java.util.List;
           import java.util.ArrayList;
          @@ -154,7 +157,29 @@ public class ProjectTest {
                   assertNotNull("Project should have Transient Action TransientAction.", p.getAction(TransientAction.class));
                   createAction = false;
               }
          -    
          +
          +    @Test
          +    public void testRemoting() throws Exception {
          +        FreeStyleProject p = j.createFreeStyleProject("project");
          +        int sz = j.jenkins.getNodes().size();
          +        SimpleCommandLauncher launcher = new SimpleCommandLauncher(
          +                String.format("\"%s/bin/java\" -Djava.awt.headless=true -Xmx1g -Xms1g -jar \"%s\"",
          +                        System.getProperty("java.home"),
          +                        new File(j.jenkins.getJnlpJars("agent.jar").getURL().toURI()).getAbsolutePath()));
          +        Slave agent = new DumbSlave("agent" + sz, "description", j.createTmpDir().getPath(), "1", Node.Mode.NORMAL, "", launcher, RetentionStrategy.NOOP, Collections.emptyList());
          +        j.jenkins.addNode(agent);
          +        j.waitOnline(agent);
          +        SlaveComputer computer = (SlaveComputer) agent.toComputer();
          +        System.err.println(computer.getLog());
          +        p.setAssignedNode(agent);
          +        p.getBuildersList().add(new Shell("python3 /tmp/lipsum.py"));
          +        try {
          +            j.buildAndAssertSuccess(p);
          +        } finally {
          +            System.err.println(computer.getLog());
          +        }
          +    }
          +
               @Test
               public void testGetEnvironment() throws Exception{
                   FreeStyleProject p = j.createFreeStyleProject("project");
          
          

          Running the above with MAVEN_OPTS=-Xmx4g mvn clean verify -Dspotbugs.skip=true -Dcheckstyle.skip=true -Dtest=hudson.model.ProjectTest#testRemoting, the test passes. Watching the thread count, I get up to 15,500 threads for the agent process. This is a lot of threads, but not enough to trigger an out of memory error on my system.

          Next I needed a way to trigger the error. I'm on a Linux desktop with about 1,500 threads running at idle, so tried putting various numbers in /proc/sys/kernel/threads-max to limit the maximum number of threads on my system. By default the limit was over 250,000 threads, which didn't result in an OOM. A limit of 23,000 threads still wasn't enough to trigger an OOM. But a limit of 22,000 threads was enough to consistently trigger this:

          SEVERE: Unexpected error in channel channel
          java.lang.OutOfMemoryError: unable to create new native thread
                  at java.lang.Thread.start0(Native Method)
                  at java.lang.Thread.start(Thread.java:717)
                  at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
                  at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
                  at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
                  at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51)
                  at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50)
                  at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44)
                  at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66)
                  at hudson.remoting.ClassFilter$RegExpClassFilter.isBlacklisted(ClassFilter.java:304)
                  at hudson.remoting.ClassFilter$1.isBlacklisted(ClassFilter.java:123)
                  at hudson.remoting.ClassFilter.check(ClassFilter.java:78)
                  at hudson.remoting.ObjectInputStreamEx.resolveClass(ObjectInputStreamEx.java:61)
                  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
                  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
                  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160)
                  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
                  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
                  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
                  at hudson.remoting.Command.readFromObjectStream(Command.java:155)
                  at hudson.remoting.Command.readFrom(Command.java:142)
                  at hudson.remoting.Command.readFrom(Command.java:128)
                  at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                  at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
          

          Finding a baseline was important: in the passing scenario, my regular desktop applications were using about 1,500 threads, the agent was using about 15,500 threads, and the other test machinery (e.g. the Jenkins controller process and JUnit) must have been using about ~5,000 threads. Based on this, I added this patch to Remoting:

          diff --git a/src/main/java/hudson/remoting/Engine.java b/src/main/java/hudson/remoting/Engine.java
          index f62d556b..59404346 100644
          --- a/src/main/java/hudson/remoting/Engine.java
          +++ b/src/main/java/hudson/remoting/Engine.java
          @@ -113,7 +113,7 @@ public class Engine extends Thread {
               /**
                * Thread pool that sets {@link #CURRENT}.
                */
          -    private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() {
          +    private final ExecutorService executor = Executors.newFixedThreadPool(5000, new ThreadFactory() {
                   private final ThreadFactory defaultFactory = Executors.defaultThreadFactory();
                   @Override
                   public Thread newThread(@Nonnull final Runnable r) {
          diff --git a/src/main/java/hudson/remoting/Launcher.java b/src/main/java/hudson/remoting/Launcher.java
          index 15742223..8823a1bf 100644
          --- a/src/main/java/hudson/remoting/Launcher.java
          +++ b/src/main/java/hudson/remoting/Launcher.java
          @@ -748,7 +748,7 @@ public class Launcher {
                * @since 2.24
                */
               public static void main(InputStream is, OutputStream os, Mode mode, boolean performPing, @CheckForNull JarCache cache) throws IOException, InterruptedException {
          -        ExecutorService executor = Executors.newCachedThreadPool();
          +        ExecutorService executor = Executors.newFixedThreadPool(5000);
                   ChannelBuilder cb = new ChannelBuilder("channel", executor)
                           .withMode(mode)
                           .withJarCacheOrDefault(cache);
          

          I wanted these numbers to be as high as possible in order for the test to finish in a reasonable amount of time with the 100 millisecond sleep in Remoting while still being low enough to demonstrate that bounding the thread pools works to get the test to pass in a 22,000-thread-limited environment (which was failing with newCachedThreadPool). My theory was that this would cap the agent process at 10,000 threads (rather than the old 15,500), which (along with the 1,500 threads for my desktop applications and the test machinery's 5,000 - 6,000 threads) should still be well under the system's 22,000 thread limit. Sure enough, the fix worked! The test passed again. No more OOM.

          I think this demonstrates that putting an upper bound on the number of threads in Remoting will solve this problem.

          Basil Crow added a comment - I put together a local reproducer for this bug. First, I created a Python script to create a large burst of output: #!/usr/bin/python3 lipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur mollis, sem in aliquet consectetur, diam lacus faucibus leo, ut tincidunt diam elit id justo. Nulla ac libero ut felis iaculis suscipit in in massa. Etiam consectetur suscipit ornare. Pellentesque eu diam tempus, lobortis est non, vulputate nulla. Fusce sagittis sodales turpis, sit amet imperdiet lorem lobortis quis. Cras ac ex nisi. Sed in nisl cursus, consectetur enim non, ultrices libero. In egestas malesuada erat, sit amet consectetur sapien. Nulla massa augue, cursus vitae malesuada ac, tincidunt eu ex. Aliquam vitae mi euismod, placerat sapien a, luctus sapien. Vestibulum at libero pulvinar, vestibulum purus ac, cursus erat. Phasellus vitae orci id ante maximus fermentum. Fusce posuere tincidunt leo, eget placerat sapien fringilla quis. Sed cursus mauris odio, ac interdum felis auctor vel. In ut aliquam massa. Praesent porttitor euismod urna. Suspendisse potenti. In porta libero vel interdum iaculis. Fusce non volutpat lacus. Proin arcu tortor, placerat a sagittis eget, commodo vel ante. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur sem sem, aliquet at varius vel, dapibus eu lacus. Mauris vel ipsum neque. Etiam elit erat, auctor non sagittis a, volutpat sed nisi. Vivamus tellus dui, tincidunt a imperdiet et, mollis sed orci. Sed euismod, mauris at finibus luctus, diam nisl scelerisque orci, ut efficitur tellus augue nec ante. Quisque commodo ipsum quis nunc dapibus vehicula. Pellentesque dignissim ultricies tortor et euismod. Proin feugiat iaculis nunc sed aliquet. Suspendisse fringilla turpis egestas neque fringilla, at malesuada lacus rutrum. Nam eu venenatis orci. Praesent bibendum dictum dictum. Quisque rhoncus turpis a neque sollicitudin blandit. Donec eget magna ultricies nisi tempor aliquam nec eget neque. Aenean sagittis nunc nec est vehicula suscipit. Sed vitae bibendum quam. Fusce at mi arcu. Ut eget diam quis enim commodo consequat. Aliquam pulvinar erat sit amet mi sollicitudin, eget mollis dui blandit. Nam varius, mi eget interdum consectetur, nibh nulla venenatis orci, sit amet vulputate leo odio at nulla. Donec nunc elit, auctor eget molestie vitae, fermentum a lacus. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Curabitur massa enim, vulputate in ipsum nec, blandit lacinia mi. Nulla dignissim est eget congue suscipit. Phasellus sit amet porttitor urna. In ut lacinia sapien Vivamus dapibus consectetur massa, et vehicula ex molestie vitae. Duis efficitur ut sapien eu euismod. Donec id lorem dignissim, aliquam odio id, suscipit lorem. Pellentesque sit amet vulputate sem, ac blandit nunc. Pellentesque faucibus augue sed cursus molestie. Mauris quis nulla erat. In et nulla vel ex fringilla lacinia quis sit amet risus. Nunc in erat quis nisi laoreet iaculis. Pellentesque lobortis pulvinar justo, imperdiet gravida justo ultricies eget. Pellentesque vehicula purus et metus hendrerit, sed placerat metus tincidunt. Proin lacinia hendrerit quam, eu pharetra urna ullamcorper id. Cras mattis eu sem sed facilisis. Vestibulum sit amet libero sit amet eros condimentum congue. Suspendisse et ultricies ante, in rutrum magna. Etiam et fringilla mi, non eleifend arcu. Vestibulum sit amet tristique felis, at congue odio. Ut posuere interdum justo at.\n" mystr = "" for x in range(0, 999999): mystr += lipsum print(mystr) I put this script in /tmp/lipsum.py . Then I built Remoting with a 100 millisecond sleep: diff --git a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java index a4912aa8..10928c25 100644 --- a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java +++ b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java @@ -71,6 +71,11 @@ public class AnonymousClassWarnings { } private static void doCheck(@Nonnull Class<?> c) { + try { + Thread.sleep(100); + } catch (Throwable t) { + // do nothing + } if (Enum.class.isAssignableFrom(c)) { // e.g., com.cloudbees.plugins.credentials.CredentialsScope$1 ~ CredentialsScope.SYSTEM // ignore, enums serialize specially } else if (c.isAnonymousClass()) { // e.g., pkg.Outer$1 I installed this with mvn clean install -DskipTests . In Jenkins core I used this patch: diff --git a/pom.xml b/pom.xml index 234651e2bb..44ec2d9423 100644 --- a/pom.xml +++ b/pom.xml @@ -91,7 +91,7 @@ THE SOFTWARE. <changelog.url>https://www.jenkins.io/changelog</changelog.url> <!-- Bundled Remoting version --> - <remoting.version>4.10</remoting.version> + <remoting.version>4.11-SNAPSHOT</remoting.version> <!-- Minimum Remoting version, which is tested for API compatibility --> <remoting.minimum.supported.version>3.14</remoting.minimum.supported.version> diff --git a/test/src/test/java/hudson/model/ProjectTest.java b/test/src/test/java/hudson/model/ProjectTest.java index a63bda9445..c6cd2890ca 100644 --- a/test/src/test/java/hudson/model/ProjectTest.java +++ b/test/src/test/java/hudson/model/ProjectTest.java @@ -33,6 +33,8 @@ import hudson.Launcher; import hudson.Util; import hudson.model.queue.QueueTaskFuture; import hudson.security.AccessDeniedException3; +import hudson.slaves.RetentionStrategy; +import hudson.slaves.SlaveComputer; import hudson.tasks.ArtifactArchiver; import hudson.tasks.BatchFile; import hudson.tasks.BuildTrigger; @@ -78,6 +80,7 @@ import hudson.security.ACLContext; import hudson.slaves.Cloud; import hudson.slaves.DumbSlave; import hudson.slaves.NodeProvisioner; +import org.jvnet.hudson.test.SimpleCommandLauncher; import org.jvnet.hudson.test.TestExtension; import java.util.List; import java.util.ArrayList; @@ -154,7 +157,29 @@ public class ProjectTest { assertNotNull("Project should have Transient Action TransientAction.", p.getAction(TransientAction.class)); createAction = false; } - + + @Test + public void testRemoting() throws Exception { + FreeStyleProject p = j.createFreeStyleProject("project"); + int sz = j.jenkins.getNodes().size(); + SimpleCommandLauncher launcher = new SimpleCommandLauncher( + String.format("\"%s/bin/java\" -Djava.awt.headless=true -Xmx1g -Xms1g -jar \"%s\"", + System.getProperty("java.home"), + new File(j.jenkins.getJnlpJars("agent.jar").getURL().toURI()).getAbsolutePath())); + Slave agent = new DumbSlave("agent" + sz, "description", j.createTmpDir().getPath(), "1", Node.Mode.NORMAL, "", launcher, RetentionStrategy.NOOP, Collections.emptyList()); + j.jenkins.addNode(agent); + j.waitOnline(agent); + SlaveComputer computer = (SlaveComputer) agent.toComputer(); + System.err.println(computer.getLog()); + p.setAssignedNode(agent); + p.getBuildersList().add(new Shell("python3 /tmp/lipsum.py")); + try { + j.buildAndAssertSuccess(p); + } finally { + System.err.println(computer.getLog()); + } + } + @Test public void testGetEnvironment() throws Exception{ FreeStyleProject p = j.createFreeStyleProject("project"); Running the above with MAVEN_OPTS=-Xmx4g mvn clean verify -Dspotbugs.skip=true -Dcheckstyle.skip=true -Dtest=hudson.model.ProjectTest#testRemoting , the test passes. Watching the thread count, I get up to 15,500 threads for the agent process. This is a lot of threads, but not enough to trigger an out of memory error on my system. Next I needed a way to trigger the error. I'm on a Linux desktop with about 1,500 threads running at idle, so tried putting various numbers in /proc/sys/kernel/threads-max to limit the maximum number of threads on my system. By default the limit was over 250,000 threads, which didn't result in an OOM. A limit of 23,000 threads still wasn't enough to trigger an OOM. But a limit of 22,000 threads was enough to consistently trigger this: SEVERE: Unexpected error in channel channel java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66) at hudson.remoting.ClassFilter$RegExpClassFilter.isBlacklisted(ClassFilter.java:304) at hudson.remoting.ClassFilter$1.isBlacklisted(ClassFilter.java:123) at hudson.remoting.ClassFilter.check(ClassFilter.java:78) at hudson.remoting.ObjectInputStreamEx.resolveClass(ObjectInputStreamEx.java:61) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at hudson.remoting.Command.readFromObjectStream(Command.java:155) at hudson.remoting.Command.readFrom(Command.java:142) at hudson.remoting.Command.readFrom(Command.java:128) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61) Finding a baseline was important: in the passing scenario, my regular desktop applications were using about 1,500 threads, the agent was using about 15,500 threads, and the other test machinery (e.g. the Jenkins controller process and JUnit) must have been using about ~5,000 threads. Based on this, I added this patch to Remoting: diff --git a/src/main/java/hudson/remoting/Engine.java b/src/main/java/hudson/remoting/Engine.java index f62d556b..59404346 100644 --- a/src/main/java/hudson/remoting/Engine.java +++ b/src/main/java/hudson/remoting/Engine.java @@ -113,7 +113,7 @@ public class Engine extends Thread { /** * Thread pool that sets {@link #CURRENT}. */ - private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() { + private final ExecutorService executor = Executors.newFixedThreadPool(5000, new ThreadFactory() { private final ThreadFactory defaultFactory = Executors.defaultThreadFactory(); @Override public Thread newThread(@Nonnull final Runnable r) { diff --git a/src/main/java/hudson/remoting/Launcher.java b/src/main/java/hudson/remoting/Launcher.java index 15742223..8823a1bf 100644 --- a/src/main/java/hudson/remoting/Launcher.java +++ b/src/main/java/hudson/remoting/Launcher.java @@ -748,7 +748,7 @@ public class Launcher { * @since 2.24 */ public static void main(InputStream is, OutputStream os, Mode mode, boolean performPing, @CheckForNull JarCache cache) throws IOException, InterruptedException { - ExecutorService executor = Executors.newCachedThreadPool(); + ExecutorService executor = Executors.newFixedThreadPool(5000); ChannelBuilder cb = new ChannelBuilder("channel", executor) .withMode(mode) .withJarCacheOrDefault(cache); I wanted these numbers to be as high as possible in order for the test to finish in a reasonable amount of time with the 100 millisecond sleep in Remoting while still being low enough to demonstrate that bounding the thread pools works to get the test to pass in a 22,000-thread-limited environment (which was failing with newCachedThreadPool ). My theory was that this would cap the agent process at 10,000 threads (rather than the old 15,500), which (along with the 1,500 threads for my desktop applications and the test machinery's 5,000 - 6,000 threads) should still be well under the system's 22,000 thread limit. Sure enough, the fix worked! The test passed again. No more OOM. I think this demonstrates that putting an upper bound on the number of threads in Remoting will solve this problem.

          Christopher added a comment -

          The limit you chose of 5000 would still have failed for me.  My Jenkins agent runs on a modestly powered windows instance.  Ideally this limit would come from a parameter or system property.  System property via -D in the agent launcher script?  I'm not sure what the usual project standard is for such settings.  Is it part of the Node config in the jenkins UI?

           

           

          Christopher added a comment - The limit you chose of 5000 would still have failed for me.  My Jenkins agent runs on a modestly powered windows instance.  Ideally this limit would come from a parameter or system property.  System property via -D in the agent launcher script?  I'm not sure what the usual project standard is for such settings.  Is it part of the Node config in the jenkins UI?    

          Basil Crow added a comment -

          The limit you chose of 5000 would still have failed for me.

          Right - this wasn't intended to be the actual production limit, but just a way to get the problem to reproduce locally.

          System property via -D in the agent launcher script? I'm not sure what the usual project standard is for such settings. Is it part of the Node config in the jenkins UI?

          I'm not too sure either. This is where we'd really need some design guidance from maintainers. jthompson are you still maintaining Remoting?

          Basil Crow added a comment - The limit you chose of 5000 would still have failed for me. Right - this wasn't intended to be the actual production limit, but just a way to get the problem to reproduce locally. System property via -D in the agent launcher script? I'm not sure what the usual project standard is for such settings. Is it part of the Node config in the jenkins UI? I'm not too sure either. This is where we'd really need some design guidance from maintainers. jthompson are you still maintaining Remoting?

          Jeff Thompson added a comment -

          Yes, basil , I am still maintaining Remoting, but on a completely volunteer basis these days. This month and next my free time is very limited so I'm not going to have much time for prepping a change and testing. I'd be happy to look over a PR, though, especially if it came with sufficient explanation and testing. I'm not sure how much of this area I understand already.

          Jeff Thompson added a comment - Yes, basil , I am still maintaining Remoting, but on a completely volunteer basis these days. This month and next my free time is very limited so I'm not going to have much time for prepping a change and testing. I'd be happy to look over a PR, though, especially if it came with sufficient explanation and testing. I'm not sure how much of this area I understand already.

          Reformatted the stack trace in the description to make it not display so many pairs of braces.

          Kalle Niemitalo added a comment - Reformatted the stack trace in the description to make it not display so many pairs of braces.

          We have been seeing same issue in our environment. basil / cpholt Do you have docker image with fix to try ?

          Gangadhar Rayudu added a comment - We have been seeing same issue in our environment. basil  / cpholt Do you have docker image with fix to try ?

          Enrico Walther added a comment - - edited

          We use a self prepared Docker image based on ubuntu focal and remoting-4.10.jar. We observe the same issue running as pod in our K8S Cluster.

           

          apiVersion: "v1"
          kind: "Pod"
          metadata:
            labels:
              jenkins: "slave"
              jenkins/label-digest: "168c12f11d09a233175f435329c242e1f2f941f9"
              jenkins/label: "jenkins-slave-simple"
            name: "jenkins-slave-simple-w4z4f"
          spec:
            containers:
            - env:
              - name: "JENKINS_SECRET"
                value: "********"
              - name: "JENKINS_AGENT_NAME"
                value: "jenkins-slave-simple-w4z4f"
              - name: "JENKINS_NAME"
                value: "jenkins-slave-simple-w4z4f"
              - name: "JENKINS_AGENT_WORKDIR"
                value: "/home/jenkins"
              - name: "JENKINS_URL"
                value: "https://<xxx>"
              image: "registry<xxx>/jenkins-slave-simple:4.10"
              imagePullPolicy: "Always"
              name: "jnlp"
              resources:
                limits:
                  memory: "1024Mi"
                  cpu: "500m"
                requests:
                  memory: "512Mi"
                  cpu: "100m"
              tty: true
              volumeMounts:
              - mountPath: "/home/jenkins"
                name: "workspace-volume"
                readOnly: false
              workingDir: "/home/jenkins"
            hostNetwork: false
            imagePullSecrets:
            - name: "registry-gitlab"
            nodeSelector:
              kubernetes.io/os: "linux"
            restartPolicy: "Never"
            volumes:
            - emptyDir:
                medium: ""
              name: "workspace-volume"
          
          Running on jenkins-slave-simple-w4z4f in /home/jenkins/workspace/<xxx>
          [Pipeline] {
          [Pipeline] stage
          [Pipeline] { (Checkout)
          [Pipeline] deleteDir
          [Pipeline] withCredentials
          Masking supported pattern matches of $BBUser
          [Pipeline] {
          [Pipeline] sh
          [Pipeline] }
          [Pipeline] // withCredentials
          [Pipeline] }
          [Pipeline] // stage
          [Pipeline] emailext
          Request made to compress build log
          #648811 is still in progress; ignoring for purposes of comparison
          Sending email to: <xxx>
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] End of Pipeline
          Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994
          		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1795)
          		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
          		at hudson.remoting.Channel.call(Channel.java:1001)
          		at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123)
          		at hudson.Launcher$ProcStarter.start(Launcher.java:508)
          		at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)
          		at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136)
          		at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320)
          		at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)
          		at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)
          		at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
          		at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source)
          		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          		at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
          		at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
          		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
          		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
          		at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
          		at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
          		at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
          		at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)
          		at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
          		at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158)
          		at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)
          		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)
          		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
          		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
          		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
          		at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
          		at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
          		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
          		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
          		at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
          		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
          		at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)
          		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
          		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
          		at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
          		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
          		at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76)
          		at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
          		at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66)
          		at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source)
          		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
          		at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
          		at com.cloudbees.groovy.cps.Next.step(Next.java:83)
          		at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
          		at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
          		at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
          		at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
          		at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
          		at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
          		at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
          		at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
          		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
          		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
          		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
          		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
          		at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
          		at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
          		at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
          		at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          		at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
          		at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          		at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
          		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          		at java.base/java.lang.Thread.run(Unknown Source)
          java.lang.OutOfMemoryError: unable to create new native thread
          	at java.lang.Thread.start0(Native Method)
          	at java.lang.Thread.start(Thread.java:717)
          	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
          	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
          	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
          	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
          	at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49)
          	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72)
          	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68)
          	at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          	at java.lang.reflect.Method.invoke(Method.java:498)
          	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
          	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
          	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
          	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
          	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
          	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
          	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
          	at hudson.remoting.UserRequest.deserialize(UserRequest.java:289)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:189)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
          	at hudson.remoting.Request$2.run(Request.java:376)
          	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
          	at java.lang.Thread.run(Thread.java:748)
          Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 failed
          	at hudson.remoting.Channel.call(Channel.java:1005)
          	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123)
          	at hudson.Launcher$ProcStarter.start(Launcher.java:508)
          	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)
          	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136)
          	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320)
          	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)
          	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)
          	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
          	at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source)
          	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
          	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
          	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
          	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
          	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
          	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
          	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
          	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)
          	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
          	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158)
          	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)
          	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)
          	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
          	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
          	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
          	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
          	at WorkflowScript.run(WorkflowScript:155)
          	at ___cps.transform___(Native Method)
          	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
          	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
          	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
          	at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
          	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
          	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)
          	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
          	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
          	at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
          	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
          	at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76)
          	at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
          	at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66)
          	at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source)
          	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
          	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
          	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
          	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
          	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
          	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
          	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
          	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
          	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
          	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
          	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
          	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
          	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
          	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
          	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          	at java.base/java.lang.Thread.run(Unknown Source)
          Finished: FAILURE
          

          Enrico Walther added a comment - - edited We use a self prepared Docker image based on ubuntu focal and remoting-4.10.jar . We observe the same issue running as pod in our K8S Cluster.   apiVersion: "v1" kind: "Pod" metadata: labels: jenkins: "slave" jenkins/label-digest: "168c12f11d09a233175f435329c242e1f2f941f9" jenkins/label: "jenkins-slave-simple" name: "jenkins-slave-simple-w4z4f" spec: containers: - env: - name: "JENKINS_SECRET" value: "********" - name: "JENKINS_AGENT_NAME" value: "jenkins-slave-simple-w4z4f" - name: "JENKINS_NAME" value: "jenkins-slave-simple-w4z4f" - name: "JENKINS_AGENT_WORKDIR" value: "/home/jenkins" - name: "JENKINS_URL" value: "https: //<xxx>" image: "registry<xxx>/jenkins-slave-simple:4.10" imagePullPolicy: "Always" name: "jnlp" resources: limits: memory: "1024Mi" cpu: "500m" requests: memory: "512Mi" cpu: "100m" tty: true volumeMounts: - mountPath: "/home/jenkins" name: "workspace-volume" readOnly: false workingDir: "/home/jenkins" hostNetwork: false imagePullSecrets: - name: "registry-gitlab" nodeSelector: kubernetes.io/os: "linux" restartPolicy: "Never" volumes: - emptyDir: medium: "" name: "workspace-volume" Running on jenkins-slave-simple-w4z4f in /home/jenkins/workspace/<xxx> [Pipeline] { [Pipeline] stage [Pipeline] { (Checkout) [Pipeline] deleteDir [Pipeline] withCredentials Masking supported pattern matches of $BBUser [Pipeline] { [Pipeline] sh [Pipeline] } [Pipeline] // withCredentials [Pipeline] } [Pipeline] // stage [Pipeline] emailext Request made to compress build log #648811 is still in progress; ignoring for purposes of comparison Sending email to: <xxx> [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1795) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) at hudson.remoting.Channel.call(Channel.java:1001) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123) at hudson.Launcher$ProcStarter.start(Launcher.java:508) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76) at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66) at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang. Thread .run(Unknown Source) java.lang.OutOfMemoryError: unable to create new native thread at java.lang. Thread .start0(Native Method) at java.lang. Thread .start( Thread .java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at hudson.remoting.UserRequest.deserialize(UserRequest.java:289) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at java.lang. Thread .run( Thread .java:748) Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 failed at hudson.remoting.Channel.call(Channel.java:1005) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123) at hudson.Launcher$ProcStarter.start(Launcher.java:508) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at WorkflowScript.run(WorkflowScript:155) at ___cps.transform___(Native Method) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76) at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66) at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang. Thread .run(Unknown Source) Finished: FAILURE

          Donald Gobin added a comment -

          The problem seems to be with the Jenkins core itself and the way it is spawning threads to log messages as can seen from the stack trace:

           

          java.lang.OutOfMemoryError: unable to create new native thread
          ***********************************************************************************************************
          	at java.lang.Thread.start0(Native Method)
          	at java.lang.Thread.start(Thread.java:717)
          	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
          	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
          	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
          	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
          	at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49)
          	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72)
          	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68)
          	at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77)
          *********************************************************************************************************** 
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          	at java.lang.reflect.Method.invoke(Method.java:498)
          	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
          	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
          	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
          	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
          	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
          	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
          	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
          	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
          	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
          	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)
          	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)
          	at hudson.remoting.UserRequest.deserialize(UserRequest.java:290)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:189)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
          	at hudson.remoting.Request$2.run(Request.java:369)
          	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
          

          This is from the server side as the workflow plugin does not exist on the agent side.

           

           

           

          Donald Gobin added a comment - The problem seems to be with the Jenkins core itself and the way it is spawning threads to log messages as can seen from the stack trace:   java.lang.OutOfMemoryError: unable to create new native thread *********************************************************************************************************** at java.lang. Thread .start0(Native Method) at java.lang. Thread .start( Thread .java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77) *********************************************************************************************************** at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423) at hudson.remoting.UserRequest.deserialize(UserRequest.java:290) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) This is from the server side as the workflow plugin does not exist on the agent side.      

          I submitted a PR #505 which should allow to test whether the current class check is part of the problem without impacting normal operations.

          Vincent Latombe added a comment - I submitted a PR #505 which should allow to test whether the current class check is part of the problem without impacting normal operations.

          Donald Gobin added a comment -

          Hi vlatombe

          Is this fix for the server or agent side ?

           

          Donald Gobin added a comment - Hi vlatombe Is this fix for the server or agent side ?  

          dg424 this is the agent side.

          Vincent Latombe added a comment - dg424 this is the agent side.

          Donald Gobin added a comment -

          Hi vlatombe

          But I see the remoting stack is on both sides (remoting.jar is in the jenkins.war file as well) and the stack trace in my comment above shows classes (org.jenkinsci.plugins.workflow.log, jenkins.util.InterceptingScheduledExecutorService) that I cannot find in remoting.jar on the the agent side. So, I'm actually not sure where the OOM is happening; if your PR is to address only the agent side, then it means the root cause of the exception is on the agent side but the error shows up on the server side ? I'm confused

          Donald Gobin added a comment - Hi vlatombe But I see the remoting stack is on both sides (remoting.jar is in the jenkins.war file as well) and the stack trace in my comment above shows classes (org.jenkinsci.plugins.workflow.log, jenkins.util.InterceptingScheduledExecutorService) that I cannot find in remoting.jar on the the agent side. So, I'm actually not sure where the OOM is happening; if your PR is to address only the agent side, then it means the root cause of the exception is on the agent side but the error shows up on the server side ? I'm confused

          I see org.jenkinsci.plugins.workflow.log classes in these files on an agent:

          • remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar (workflow-api 1136.v7f5f1759dc16)
          • remoting/jarCache/AA/E8875DDC0E79929E944D30636208F6.jar (workflow-api 1108.v57edf648f5d4)
          • remoting/jarCache/EC/7A1A038FDCBC2456010A181E58E35B.jar (workflow-api 1122.v7a_916f363c86)

          I don't know whether those file names are hashes or just random. Anyway, it's conceivable that the agent could load org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream etc. from these files.

          Kalle Niemitalo added a comment - I see org.jenkinsci.plugins.workflow.log classes in these files on an agent: remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar (workflow-api 1136.v7f5f1759dc16) remoting/jarCache/AA/E8875DDC0E79929E944D30636208F6.jar (workflow-api 1108.v57edf648f5d4) remoting/jarCache/EC/7A1A038FDCBC2456010A181E58E35B.jar (workflow-api 1122.v7a_916f363c86) I don't know whether those file names are hashes or just random. Anyway, it's conceivable that the agent could load org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream etc. from these files.

          Kalle Niemitalo added a comment - - edited

          Oh, Checksum.java first computes an SHA-256 hash, but then splits that to two 128-bit parts and xors them together. It's not just a truncated hash as in section 5.1 of NIST Special Publication 800-107 Revision 1.

          $ sha256sum remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar
          38165eeaa9e20f4a5bdced3d142660b13ec55dfea343aba86da3775ee1ab5196 *remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar
          

          38165eeaa9e20f4a5bdced3d142660b1 xor 3ec55dfea343aba86da3775ee1ab5196 = 06D303140AA1A4E2367F9A63F58D3127

          Kalle Niemitalo added a comment - - edited Oh, Checksum.java first computes an SHA-256 hash, but then splits that to two 128-bit parts and xors them together. It's not just a truncated hash as in section 5.1 of NIST Special Publication 800-107 Revision 1. $ sha256sum remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar 38165eeaa9e20f4a5bdced3d142660b13ec55dfea343aba86da3775ee1ab5196 *remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar 38165eeaa9e20f4a5bdced3d142660b1 xor 3ec55dfea343aba86da3775ee1ab5196 = 06D303140AA1A4E2367F9A63F58D3127

          Donald Gobin added a comment -

          Hi kon,

          Thanks. I see it now. So, these classes are "shipped" to the agent at runtime. If I fire up the agent and do not start a job, the classes do not exist. Just trying to understand how the process works...

          Donald Gobin added a comment - Hi kon , Thanks. I see it now. So, these classes are "shipped" to the agent at runtime. If I fire up the agent and do not start a job, the classes do not exist. Just trying to understand how the process works...

          Donald Gobin added a comment - - edited

          Tried the PR on my stress test job and still get OOM with 
          -Dorg.jenkinsci.remoting.util.AnonymousClassWarnings.useSeparateThreadPool=true

          java.lang.OutOfMemoryError: unable to create new native thread
          	at java.lang.Thread.start0(Native Method)
          	at java.lang.Thread.start(Thread.java:717)
          	at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104)
          	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
          	at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:73)
          	at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:130)
          	at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290)
          	at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
          	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
          	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
          	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
          	at hudson.remoting.Command.writeTo(Command.java:111)
          	at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287)
          	at hudson.remoting.Channel.send(Channel.java:766)
          	at hudson.remoting.Request.callAsync(Request.java:238)
          	at hudson.remoting.Channel.callAsync(Channel.java:1030)
          	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
          	at com.sun.proxy.$Proxy3.notifyJarPresence(Unknown Source)
          	at hudson.remoting.FileSystemJarCache.lookInCache(FileSystemJarCache.java:80)
          	at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:49)
          	at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93)
          	at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45)
          	at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284)
          	at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264)
          	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
          	at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:173)
          	at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:154)
          	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3317)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
          	at hudson.remoting.Request$2.run(Request.java:376)
          	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) 

          One thing I forgot to mention that is important - for us, this failure only occurs WHEN the pipeline has a git checkout stage.

          Donald Gobin added a comment - - edited Tried the PR on my stress test job and still get OOM with  -Dorg.jenkinsci.remoting.util.AnonymousClassWarnings.useSeparateThreadPool=true java.lang.OutOfMemoryError: unable to create new native thread at java.lang. Thread .start0(Native Method) at java.lang. Thread .start( Thread .java:717) at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:73) at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:130) at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290) at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at hudson.remoting.Command.writeTo(Command.java:111) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287) at hudson.remoting.Channel.send(Channel.java:766) at hudson.remoting.Request.callAsync(Request.java:238) at hudson.remoting.Channel.callAsync(Channel.java:1030) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285) at com.sun.proxy.$Proxy3.notifyJarPresence(Unknown Source) at hudson.remoting.FileSystemJarCache.lookInCache(FileSystemJarCache.java:80) at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:49) at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93) at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45) at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223) at java.lang. ClassLoader .loadClass( ClassLoader .java:418) at java.lang. ClassLoader .loadClass( ClassLoader .java:351) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:173) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:154) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3317) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) One thing I forgot to mention that is important - for us, this failure only occurs WHEN the pipeline has a git checkout stage.

          Allan BURDAJEWICZ added a comment - - edited

          I did a couple of testing recently to troubleshoot this problem.

          Was able to capture a thread dump / heapdump by catching the OOM in remoting and generating dumps.

          The JVM heap is very low (< 10 MB). Thread dumps reveal that there are only ~50 threads when the issue happens. With a default stack size of 256KB, I doubt that this has much of an effect. So I am not convinced this is due to those pools spawning too many threads.

          Note that ulimits are very high within the jnlp container:

          $ ulimit -a
          core file size          (blocks, -c) unlimited
          data seg size           (kbytes, -d) unlimited
          scheduling priority             (-e) 0
          file size               (blocks, -f) unlimited
          pending signals                 (-i) 128448
          max locked memory       (kbytes, -l) 16384
          max memory size         (kbytes, -m) unlimited
          open files                      (-n) 1048576
          pipe size            (512 bytes, -p) 8
          POSIX message queues     (bytes, -q) 819200
          real-time priority              (-r) 0
          stack size              (kbytes, -s) 8192
          cpu time               (seconds, -t) unlimited
          max user processes              (-u) unlimited
          virtual memory          (kbytes, -v) unlimited
          file locks                      (-x) unlimited
          $ cat /proc/sys/kernel/threads-max
          256897
          $ cat /sys/fs/cgroup/pids/pids.max
          max
          

          Now maybe I am looking at this wrong. Giving such a high limit, maybe the jnlp container of another pod that is not failing is consuming those limits and impacting other containers. That being said, the external system that I have in place (GKE and Datadog) do not show spike in PIDs or anything explicit .

          So I thought that maybe this is an off heap memory issue or just an isolation behavior ,that happens during remoting class loading.

          Now I have enabled NMT tracking and things gets interesting though I am not too familiar with memory management at that level. What I see is that despite giving the jnlp container some limits - of for example 500Mi - the reserved memory is higher than I expected:

          Total: reserved=1538742KB, committed=133862KB
          

          Most of it coming from the Metaspace:

          -                     Class (reserved=1070101KB, committed=22805KB)
                                      (classes #3545)
                                      (malloc=1045KB #4970) 
                                      (mmap: reserved=1069056KB, committed=21760KB)
          

          When I don't set container limit, the reserved memory is quite higher (for k8s node with a capacity of 16G):

          Total: reserved=4600892KB, committed=310052KB
          

          I am not sure if this is part of the problem or not. That being said, that Class/Metaspace size seems to be a constant ~1GB. When raising the container limit to 2Mi or more, the reserved memory for Class/Metaspace is similar and the total amount of reserved memory lower than the container limit!
          I don't have enough knowledge in this area to know if that is related but maybe someone here does. That area of the off heap memory can be controlled with -XX:MaxMetaspaceSize and XX:CompressedClassSpaceSize. Maybe setting those help mitigate the problem though I would not know the impact or the right value -XX:MaxMetaspaceSize=100m -XX:CompressedClassSpaceSize=100m.

          Still investigating...

          Allan BURDAJEWICZ added a comment - - edited I did a couple of testing recently to troubleshoot this problem. Was able to capture a thread dump / heapdump by catching the OOM in remoting and generating dumps . The JVM heap is very low (< 10 MB). Thread dumps reveal that there are only ~50 threads when the issue happens. With a default stack size of 256KB, I doubt that this has much of an effect. So I am not convinced this is due to those pools spawning too many threads. Note that ulimits are very high within the jnlp container: $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 128448 max locked memory (kbytes, -l) 16384 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ cat /proc/sys/kernel/threads-max 256897 $ cat /sys/fs/cgroup/pids/pids.max max Now maybe I am looking at this wrong. Giving such a high limit, maybe the jnlp container of another pod that is not failing is consuming those limits and impacting other containers. That being said, the external system that I have in place (GKE and Datadog) do not show spike in PIDs or anything explicit . So I thought that maybe this is an off heap memory issue or just an isolation behavior ,that happens during remoting class loading. Now I have enabled NMT tracking and things gets interesting though I am not too familiar with memory management at that level. What I see is that despite giving the jnlp container some limits - of for example 500Mi - the reserved memory is higher than I expected: Total: reserved=1538742KB, committed=133862KB Most of it coming from the Metaspace: - Class (reserved=1070101KB, committed=22805KB) (classes #3545) (malloc=1045KB #4970) (mmap: reserved=1069056KB, committed=21760KB) When I don't set container limit, the reserved memory is quite higher (for k8s node with a capacity of 16G): Total: reserved=4600892KB, committed=310052KB I am not sure if this is part of the problem or not. That being said, that Class/Metaspace size seems to be a constant ~1GB. When raising the container limit to 2Mi or more, the reserved memory for Class/Metaspace is similar and the total amount of reserved memory lower than the container limit! I don't have enough knowledge in this area to know if that is related but maybe someone here does. That area of the off heap memory can be controlled with -XX:MaxMetaspaceSize and XX:CompressedClassSpaceSize . Maybe setting those help mitigate the problem though I would not know the impact or the right value -XX:MaxMetaspaceSize=100m -XX:CompressedClassSpaceSize=100m . Still investigating...

          Vincent Latombe added a comment - - edited

          Default JVM ergonomics is (https://docs.oracle.com/en/java/javase/11/gctuning/ergonomics.html)

          I would tend to believe that the container would get OOMKilled if it went over the defined container limit which isn't the case here.

          Vincent Latombe added a comment - - edited Default JVM ergonomics is ( https://docs.oracle.com/en/java/javase/11/gctuning/ergonomics.html ) Maximum heap size of 1/4 of physical memory (accounts for container limits) Metaspace size defaults to 1g ( https://docs.oracle.com/en/java/javase/11/gctuning/other-considerations.html#GUID-B29C9153-3530-4C15-9154-E74F44E3DAD9 ) I would tend to believe that the container would get OOMKilled if it went over the defined container limit which isn't the case here.

          Donald Gobin added a comment -

          Yes, I've already tried tweaking the pod spec to set the memory limit, stack size (reducing it), more cpu, and none of these worked - still got the OOM eventually. ulimits are fine, so it's not this as well. Given the "random nature" of the error, it looks like some race condition situation where the code just spins out of control. Also note the related PR by Vincent here - https://github.com/jenkinsci/remoting/pull/505 - that I've been putting under my stress test setup.

           

          Donald Gobin added a comment - Yes, I've already tried tweaking the pod spec to set the memory limit, stack size (reducing it), more cpu, and none of these worked - still got the OOM eventually. ulimits are fine, so it's not this as well. Given the "random nature" of the error, it looks like some race condition situation where the code just spins out of control. Also note the related PR by Vincent here - https://github.com/jenkinsci/remoting/pull/505  - that I've been putting under my stress test setup.  

          rhinoceros.xn added a comment - - edited

          rhinoceros.xn added a comment - - edited Hello dg424  , Did this PR solve the problem? https://github.com/jenkinsci/remoting/pull/505#issuecomment-1046913986

          Donald Gobin added a comment -

          Hi rhinoceros

          No, it did not  Still looking for a solution ...

          Donald Gobin added a comment - Hi rhinoceros ,  No, it did not  Still looking for a solution ...

          dg424 would you be able to share the Jenkinsfile you use to reproduce the problem? Or is it too specific?

          Vincent Latombe added a comment - dg424  would you be able to share the Jenkinsfile you use to reproduce the problem? Or is it too specific?

          Donald Gobin added a comment -

          I can share the same stress test job layout and you can put your own settings to reproduce.

          pipeline {
              agent {
                  kubernetes {
                      inheritFrom 'k8s-default'
                      containerTemplate {
                          name 'mycontainer'
                          image "someimage:latest"
                          privileged false
                          alwaysPullImage false
                          workingDir '/home/jenkins'
                          ttyEnabled true
                          command 'cat'
                          args ''
                      }
                      defaultContainer 'mycontainer'
                  }
              }
              // run until we get 65873 error
              triggers {
                  cron "* * * * *"
              }    
              options {
                  disableConcurrentBuilds()
              }     
              stages {
                  stage('Checkout SCM') {
                      steps {
                          checkout([$class: 'GitSCM', branches: [[name: "FETCH_HEAD"]], doGenerateSubmoduleConfigurations: false, extensions: [[$class: 'SubmoduleOption', disableSubmodules: false, parentCredentials: true, recursiveSubmodules: true, reference: '', trackingSubmodules: false]], submoduleCfg: [], userRemoteConfigs: [[credentialsId: 'mygit-cred', url: 'ssh://git@mycompany.net/test.git']]])
                      }
                  }
                  stage('First stage') {
                      steps {
                          script {
                              echo "Inside first stage"
                          }
                      }
                  }
              }
              post {
                  failure {
          // we ALWAYS get here eventually as a result of 65873 issue
                      echo "Failure!"
                      script {
          // we got the error, disable job and send email
          Jenkins.instance.getItemByFullName(env.JOB_NAME).doDisable()
          emailext body: "${BUILD_URL}",
          subject: "[Jenkins]: ${JOB_NAME} build failed",
          to: 'foo@bar.com'            
                      }
                  }
              }
          } 

           

          Donald Gobin added a comment - I can share the same stress test job layout and you can put your own settings to reproduce. pipeline { agent { kubernetes { inheritFrom 'k8s- default ' containerTemplate { name 'mycontainer' image "someimage:latest" privileged false alwaysPullImage false workingDir '/home/jenkins' ttyEnabled true command 'cat' args '' } defaultContainer 'mycontainer' } } // run until we get 65873 error triggers { cron "* * * * *" } options { disableConcurrentBuilds() } stages { stage( 'Checkout SCM' ) { steps { checkout([$class: 'GitSCM' , branches: [[name: "FETCH_HEAD" ]], doGenerateSubmoduleConfigurations: false , extensions: [[$class: 'SubmoduleOption' , disableSubmodules: false , parentCredentials: true , recursiveSubmodules: true , reference: '', trackingSubmodules: false ]], submoduleCfg: [], userRemoteConfigs: [[credentialsId: ' mygit-cred ', url: ' ssh: //git@mycompany.net/test.git']]]) } } stage( 'First stage' ) { steps { script { echo "Inside first stage" } } } } post { failure { // we ALWAYS get here eventually as a result of 65873 issue echo "Failure!" script { // we got the error, disable job and send email Jenkins.instance.getItemByFullName(env.JOB_NAME).doDisable() emailext body: "${BUILD_URL}" , subject: "[Jenkins]: ${JOB_NAME} build failed" , to: 'foo@bar.com' } } } }  

          dg424 Do you have any limit defined at infrastructure level that would apply cpu/memory limits to the pod and containers?

          Vincent Latombe added a comment - dg424  Do you have any limit defined at infrastructure level that would apply cpu/memory limits to the pod and containers?

          Donald Gobin added a comment -

          Yea, we do, but ... I'm not sure if this is causing the issue as the pipeline would run for a very long time before getting the error (i.e. using the same resources that k8s has assigned to it on each run). Also, as you can see, the pipeline really doesn't do anything much and it checks out an empty git project, so in terms of resources, it uses very little directly. But the key stage for reproduction is the one that does the checkout. Without this, the problem is not reproducible. I think this is why the other comment on github suggested a downgrade of git prior to when this ticket was opened.

          Donald Gobin added a comment - Yea, we do, but ... I'm not sure if this is causing the issue as the pipeline would run for a very long time before getting the error (i.e. using the same resources that k8s has assigned to it on each run). Also, as you can see, the pipeline really doesn't do anything much and it checks out an empty git project, so in terms of resources, it uses very little directly. But the key stage for reproduction is the one that does the checkout. Without this, the problem is not reproducible. I think this is why the other comment on github suggested a downgrade of git prior to when this ticket was opened.

          dg424 Yes, I think some thing part of the class loading that is done in preparation for the git checkout triggers the problem. But it could be more apparent in low memory environments, so ideally I'd like to set up a reproducer that is as close as possible to something that can trigger the problem.

          Vincent Latombe added a comment - dg424  Yes, I think some thing part of the class loading that is done in preparation for the git checkout triggers the problem. But it could be more apparent in low memory environments, so ideally I'd like to set up a reproducer that is as close as possible to something that can trigger the problem.

          Donald Gobin added a comment -

          Since that test pipeline uses almost no resources, you should be able to quickly setup one directly on the jenkins master to see if the problem exists in the area you suspect as I don't think it matters in that case whether you're using a k8s agent or running directly on the master ? If there is some kind of leak in the class loading area, it should eventually hit the problem ?

          Donald Gobin added a comment - Since that test pipeline uses almost no resources, you should be able to quickly setup one directly on the jenkins master to see if the problem exists in the area you suspect as I don't think it matters in that case whether you're using a k8s agent or running directly on the master ? If there is some kind of leak in the class loading area, it should eventually hit the problem ?

          Donald Gobin added a comment -

          I created a related ticket here - https://issues.jenkins.io/browse/JENKINS-68199 - for the git client side since vlatombe is thinking that the Jenkins side might be ok.

          Donald Gobin added a comment - I created a related ticket here - https://issues.jenkins.io/browse/JENKINS-68199  - for the git client side since vlatombe  is thinking that the Jenkins side might be ok.

          Vincent Latombe added a comment - - edited

          On the agent side on jdk11, the following is printed when the problem occurs

          [14.901s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. 

          This led me to https://stackoverflow.com/a/47082934, then this kernel bug from 2016, still not fixed.

          I also found an interesting issue on JDK issue tracker that led to https://bugs.openjdk.java.net/browse/JDK-8268773, and a fix in jdk18 that is doing retries, so it could possibly workaround the issue.

          Anyone cares to attempt to backport this to say, jdk11?

          Vincent Latombe added a comment - - edited On the agent side on jdk11, the following is printed when the problem occurs [14.901s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. This led me to https://stackoverflow.com/a/47082934 , then this kernel bug from 2016, still not fixed. I also found an interesting issue on JDK issue tracker that led to https://bugs.openjdk.java.net/browse/JDK-8268773 , and a fix in jdk18 that is doing retries, so it could possibly workaround the issue. Anyone cares to attempt to backport this to say, jdk11?

          Donald Gobin added a comment -

          So, this is saying that this "can" happen at any time on "any" Java program that uses multiple threads ? The fact that we only see this when our pipeline contains a git checkout stage is purely coincidental ?

          I would request the backport, but I don't have an openjdk account.

           

          Donald Gobin added a comment - So, this is saying that this "can" happen at any time on "any" Java program that uses multiple threads ? The fact that we only see this when our pipeline contains a git checkout stage is purely coincidental ? I would request the backport, but I don't have an openjdk account.  

          basil maybe?

          Vincent Latombe added a comment - basil maybe?

          Basil Crow added a comment -

          Sorry Vincent, I have not read this thread. What is the reason you are pinging me specifically?

          Basil Crow added a comment - Sorry Vincent, I have not read this thread. What is the reason you are pinging me specifically?

          basil I think you submitted a backport on the JDK lately. I think the issue described here could be already fixed on jdk-18 and jdk-19 trees (this comment sums up my findings). Would you be able to build a backport on jdk-11 branch so that I could test it?

          Vincent Latombe added a comment - basil I think you submitted a backport on the JDK lately. I think the issue described here could be already fixed on jdk-18 and jdk-19 trees (this comment sums up my findings). Would you be able to build a backport on jdk-11 branch so that I could test it?

          Basil Crow added a comment - - edited

          I fail to follow the reasoning in this comment. I do not see why you would need me to build a backport - just cherry-pick the change onto https://github.com/openjdk/jdk17u-dev or https://github.com/openjdk/jdk11u-dev and see if it fixes your problem. If you have a clean backport patch along with steps to reproduce the problem that fail before the patch and succeed after the patch, along with clear written reasoning about why the patch fixes the original problem, and are simply looking for someone who has signed the Oracle Contributor Agreement (OCA) to propose it, I might be willing to help, but for now I am unwatching this issue so that I do not receive further notifications.

          Basil Crow added a comment - - edited I fail to follow the reasoning in this comment . I do not see why you would need me to build a backport - just cherry-pick the change onto https://github.com/openjdk/jdk17u-dev or https://github.com/openjdk/jdk11u-dev and see if it fixes your problem. If you have a clean backport patch along with steps to reproduce the problem that fail before the patch and succeed after the patch, along with clear written reasoning about why the patch fixes the original problem, and are simply looking for someone who has signed the Oracle Contributor Agreement (OCA) to propose it, I might be willing to help, but for now I am unwatching this issue so that I do not receive further notifications.

          Tim Jacomb added a comment -

          vlatombe why not just try on Java 18?

          Tim Jacomb added a comment - vlatombe why not just try on Java 18?

          I could have, however since Jenkins is only starting to support java 17 in preview, I don't know what kind of surprises could arise from trying out a version that is not battle tested in our context. In any case I have built a jdk11 with the referenced commit cherry-picked and I'm currently running my reproduction harness to check whether the problem is gone.

          Vincent Latombe added a comment - I could have, however since Jenkins is only starting to support java 17 in preview, I don't know what kind of surprises could arise from trying out a version that is not battle tested in our context. In any case I have built a jdk11 with the referenced commit cherry-picked and I'm currently running my reproduction harness to check whether the problem is gone.

          Cherry-picking https://github.com/openjdk/jdk/commit/e35005d5ce383ddd108096a3079b17cb0bcf76f1 on jdk11 and running the harness overnight shows a very significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds)

          Vincent Latombe added a comment - Cherry-picking https://github.com/openjdk/jdk/commit/e35005d5ce383ddd108096a3079b17cb0bcf76f1 on jdk11 and running the harness overnight shows a very significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds)

          Basil Crow added a comment -

          jenkinsci/remoting#523 has been released in Remoting 4.14 and Jenkins 2.348. This helps alleviate the problem to some degree, but it does not eliminate the problem.

          Backporting JDK-8268773 / openjdk/jdk@e35005d5ce3 showed a significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds). JDK-8268773 / openjdk/jdk@e35005d5ce3 has been backported to jdk11u-dev in JDK-8286753 / openjdk/jdk11u-dev#1074 and to jdk17u-dev in JDK-8286629 / openjdk/jdk17u-dev#390.

          Basil Crow added a comment - jenkinsci/remoting#523 has been released in Remoting 4.14 and Jenkins 2.348 . This helps alleviate the problem to some degree, but it does not eliminate the problem. Backporting JDK-8268773 / openjdk/jdk@ e35005d5ce3 showed a significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds). JDK-8268773 / openjdk/jdk@ e35005d5ce3 has been backported to jdk11u-dev in JDK-8286753 / openjdk/jdk11u-dev#1074 and to jdk17u-dev in JDK-8286629 / openjdk/jdk17u-dev#390 .

          Thank you basil!

          Vincent Latombe added a comment - Thank you basil !

          rhinoceros.xn added a comment - - edited

          After adding sleep(10) before git checkout this problem no longer occurs.

          Maybe sleep(10) before git or checkout step is a workaround.

          wasimj  dg424 

           

          sleep(10)
          checkout changelog: false, poll: false, scm: ........
          
          OR
          
          sleep(10)
          git branch: 'master', credentialsId: '******', url: 'git@git.yourcomampy.com:xx/zz.git'

          rhinoceros.xn added a comment - - edited After adding sleep(10) before git checkout this problem no longer occurs. Maybe sleep(10) before git or checkout step is a workaround. wasimj   dg424     sleep(10) checkout changelog: false , poll: false , scm: ........ OR sleep(10) git branch: 'master' , credentialsId: '******' , url: 'git@git.yourcomampy.com:xx/zz.git'

          Matthew Gomes added a comment -

          rhinoceros Where did you set the git sleep? is this the git plugin

          Matthew Gomes added a comment - rhinoceros Where did you set the git sleep? is this the git plugin

          rhinoceros.xn added a comment -

          matthewrgomes 

           

          before git OR checkout step:

          // code placeholder
              node(label) {
                  stage('xxzzxasd') {
                      container('xxxx') {
                                  stage('git clone') {
          
                                      sleep(10) //** HERE: adding sleep(10) before git checkout **
          
                                      git branch: 'master', credentialsId: '', url: 'git@xxx.com:xxx/xxxxxxx.git'
           

          rhinoceros.xn added a comment - matthewrgomes     before git OR checkout step: // code placeholder node(label) { stage( 'xxzzxasd' ) { container( 'xxxx' ) { stage( 'git clone' ) { sleep(10) //** HERE: adding sleep(10) before git checkout ** git branch: 'master' , credentialsId: '', url: ' git@xxx.com:xxx/xxxxxxx.git'

          Donald Gobin added a comment -

          rhinoceros so if this is true, then it indicates that the issue is with some kind of race condition within the jenkins pipeline flow that is causing this to happen. your sleep essentially looks like waiting for all jenkins threads for that pipeline to all complete before continuing ?

          Donald Gobin added a comment - rhinoceros so if this is true, then it indicates that the issue is with some kind of race condition within the jenkins pipeline flow that is causing this to happen. your sleep essentially looks like waiting for all jenkins threads for that pipeline to all complete before continuing ?

          rhinoceros.xn added a comment -

          dg424 yes. I think so.

           

          I had updated agent.jar two weeks ago with https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/remoting/4.14-rc3000.5949ea_7370a_f/remoting-4.14-rc3000.5949ea_7370a_f.jar for hours ,  the problem had not diminished.

           

          I found that the problem is only at git checkout, so try to add sleep before git checkout

          rhinoceros.xn added a comment - dg424 yes. I think so.   I had updated agent.jar two weeks ago with https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/remoting/4.14-rc3000.5949ea_7370a_f/remoting-4.14-rc3000.5949ea_7370a_f.jar for hours ,  the problem had not diminished.   I found that the problem is only at git checkout, so try to add sleep before git checkout

          Matthew Gomes added a comment -

          Issue is only reproducible on Agent/ECS jobs that use Git plug-in https://plugins.jenkins.io/git/

           

          Matthew Gomes added a comment - Issue is only reproducible on Agent/ECS jobs that use Git plug-in https://plugins.jenkins.io/git/  

          Kevin added a comment - - edited

          I upgraded to 2.363-jdk11 w/ agents on 4.10-3-jdk11 and still getting the same error sporadically:

          Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.32.11.76/10.32.11.76:34066
          at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1784)
          at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
          at hudson.remoting.Channel.call(Channel.java:1000)
          at hudson.FilePath.act(FilePath.java:1186)
          at hudson.FilePath.act(FilePath.java:1175)
          at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:140)
          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916)
          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847)
          at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1297)
          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129)
          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97)
          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84)
          at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:829)
          java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
          at java.base/java.lang.Thread.start0(Native Method)
          at java.base/java.lang.Thread.start(Unknown Source)
          at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104)
          at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
          at org.jenkinsci.remoting.util.ExecutorServiceUtils.submitAsync(ExecutorServiceUtils.java:58)
          at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:66)
          at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93)
          at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45)
          at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284)
          at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264)
          at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223)
          at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
          at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
          at jenkins.util.Timer.get(Timer.java:47)
          at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:74)
          at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:70)
          at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:79)
          at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
          at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
          at java.base/java.lang.reflect.Method.invoke(Unknown Source)
          at java.base/java.io.ObjectStreamClass.invokeReadResolve(Unknown Source)
          at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
          at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
          at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source)
          at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
          at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
          at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
          at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source)
          at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
          at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
          at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
          at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source)
          at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
          at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
          at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
          at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source)
          at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
          at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
          at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
          at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
          at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
          at hudson.remoting.UserRequest.deserialize(UserRequest.java:289)
          at hudson.remoting.UserRequest.perform(UserRequest.java:189)
          at hudson.remoting.UserRequest.perform(UserRequest.java:54)
          at hudson.remoting.Request$2.run(Request.java:376)
          at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
          at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
          at java.base/java.lang.Thread.run(Unknown Source)
          Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.32.11.76/10.32.11.76:34066 failed
          at hudson.remoting.Channel.call(Channel.java:1004)
          at hudson.FilePath.act(FilePath.java:1186)
          at hudson.FilePath.act(FilePath.java:1175)
          at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:140)
          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916)
          at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847)
          at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1297)
          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129)
          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97)
          at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84)
          at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:829)
          Finished: FAILURE
          

          Kevin added a comment - - edited I upgraded to 2.363-jdk11 w/ agents on 4.10-3-jdk11 and still getting the same error sporadically: Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.32.11.76/10.32.11.76:34066 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1784) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) at hudson.remoting.Channel.call(Channel.java:1000) at hudson.FilePath.act(FilePath.java:1186) at hudson.FilePath.act(FilePath.java:1175) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:140) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1297) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:829) java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang. Thread .start0(Native Method) at java.base/java.lang. Thread .start(Unknown Source) at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104) at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source) at org.jenkinsci.remoting.util.ExecutorServiceUtils.submitAsync(ExecutorServiceUtils.java:58) at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:66) at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93) at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45) at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223) at java.base/java.lang. ClassLoader .loadClass(Unknown Source) at java.base/java.lang. ClassLoader .loadClass(Unknown Source) at jenkins.util.Timer.get(Timer.java:47) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:74) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:70) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:79) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at java.base/java.io.ObjectStreamClass.invokeReadResolve(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.readObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject(Unknown Source) at hudson.remoting.UserRequest.deserialize(UserRequest.java:289) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at java.base/java.lang. Thread .run(Unknown Source) Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.32.11.76/10.32.11.76:34066 failed at hudson.remoting.Channel.call(Channel.java:1004) at hudson.FilePath.act(FilePath.java:1186) at hudson.FilePath.act(FilePath.java:1175) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:140) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1297) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:829) Finished: FAILURE

          Basil Crow added a comment -

          I upgraded to […] 4.10-3-jdk11

          kryan90 You aren't running with the fix. The fix is in Remoting 4.13.3 (LTS line) and Remoting 3044.vb_940a_a_e4f72e (weekly line). Java 11.0.16 is also recommended.

          Basil Crow added a comment - I upgraded to […] 4.10-3-jdk11 kryan90 You aren't running with the fix. The fix is in Remoting 4.13.3 (LTS line) and Remoting 3044.vb_940a_a_e4f72e (weekly line). Java 11.0.16 is also recommended.

          Donald Gobin added a comment -

          basil Definitely interested in this fix. We run the Docker containers for the controller and agent side, can you share which tags of these include this fix ? For instance, I don't see a 4.13 release for the agent side here - https://hub.docker.com/r/jenkins/inbound-agent/tags?page=1&name=jdk11. Is an update required on both the controller and agent side ? As Kevin tried above assuming that the fix is on the controller side only and used the latest available tagged image for the agent - 4.10. So, just need some clarity on which Docker tags we need to use here. Thanks.

          Donald Gobin added a comment - basil Definitely interested in this fix. We run the Docker containers for the controller and agent side, can you share which tags of these include this fix ? For instance, I don't see a 4.13 release for the agent side here - https://hub.docker.com/r/jenkins/inbound-agent/tags?page=1&name=jdk11 . Is an update required on both the controller and agent side ? As Kevin tried above assuming that the fix is on the controller side only and used the latest available tagged image for the agent - 4.10. So, just need some clarity on which Docker tags we need to use here. Thanks.

          Basil Crow added a comment -

          I don't know anything about how the Docker images for agents are built or what versions of Remoting they include in them.

          Basil Crow added a comment - I don't know anything about how the Docker images for agents are built or what versions of Remoting they include in them.

          Donald Gobin added a comment - - edited

          basil Another question - do both the controller and agent have to have this fix/version ? Also, which component do I raise a ticket for to address the inbound-agent Docker image side of this ?

          Donald Gobin added a comment - - edited basil Another question - do both the controller and agent have to have this fix/version ? Also, which component do I raise a ticket for to address the inbound-agent Docker image side of this ?

          Basil Crow added a comment -

          The fix is in remoting.jar which is the main JAR for running the agent process. That having been said, that JAR file gets shipped over from the controller in some scenarios (e.g. SSH Build Agents, where the controller uses SSH to connect to the agent, ship over its copy of remoting.jar, and then start the agent process), so in some sense the controller version is relevant in the sense that it does bundle remoting.jar to be used in certain agent scenarios. That having been said I think your Docker image use case bundles its own copy of remoting.jar completely separate from the controller's version. I believe the maintainers of the Docker images use GitHub issues on the corresponding GitHub repositories to track issues.

          Basil Crow added a comment - The fix is in remoting.jar which is the main JAR for running the agent process. That having been said, that JAR file gets shipped over from the controller in some scenarios (e.g. SSH Build Agents, where the controller uses SSH to connect to the agent, ship over its copy of remoting.jar , and then start the agent process), so in some sense the controller version is relevant in the sense that it does bundle remoting.jar to be used in certain agent scenarios. That having been said I think your Docker image use case bundles its own copy of remoting.jar completely separate from the controller's version. I believe the maintainers of the Docker images use GitHub issues on the corresponding GitHub repositories to track issues.

            jglick Jesse Glick
            wasimj Wasim
            Votes:
            9 Vote for this issue
            Watchers:
            30 Start watching this issue

              Created:
              Updated:
              Resolved: