Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65873

java.lang.OutOfMemoryError: unable to create new native thread

    XMLWordPrintable

Details

    Description

      We regularly see issues with the jenkins/inbound-agent in our Jenkins logs on Kubernetes. It seems to occur in around 1% of all jobs.

      The error message is below.

      Whilst the error message refers to java.lang.OutOfMemoryError: and unable to create new native thread we have checked the pods and nodes in the cluster and there is always sufficient memory or threads available at the time of the error.

      The specific versions for this specific error message are:

      jenkins/inbound-agent:4.3-4

      Jenkins 2.263.4

      However we have also seen this error occur with different versions of both the inbound-agent and Jenkins.

      Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-100-64-244-120.eu-west-1.compute.internal/100.64.244.120:39138
      	at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1800)
      	at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
      	at hudson.remoting.Channel.call(Channel.java:1001)
      	at hudson.FilePath.act(FilePath.java:1157)
      	at hudson.FilePath.act(FilePath.java:1146)
      	at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:121)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:904)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:835)
      	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1288)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
      	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      java.lang.OutOfMemoryError: unable to create new native thread
      	at java.lang.Thread.start0(Native Method)
      	at java.lang.Thread.start(Thread.java:717)
      	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
      	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
      	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
      	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
      	at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49)
      	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72)
      	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68)
      	at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
      	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)
      	at hudson.remoting.UserRequest.deserialize(UserRequest.java:290)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:189)
      	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
      	at hudson.remoting.Request$2.run(Request.java:369)
      	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
      Caused: java.io.IOException: Remote call on JNLP4-connect connection from ip-100-64-244-120.eu-west-1.compute.internal/100.64.244.120:39138 failed
      	at hudson.remoting.Channel.call(Channel.java:1007)
      	at hudson.FilePath.act(FilePath.java:1157)
      	at hudson.FilePath.act(FilePath.java:1146)
      	at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:121)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:904)
      	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:835)
      	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1288)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
      	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
      	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      Attachments

        Issue Links

          Activity

            basil Basil Crow added a comment -

            FWIW I've seen the same error sporadically since 2019, but with Swarm on the client side. It seems to occur once every few months in my company, so less than 1% of all builds (we do thousands of builds a month). In case it's helpful I've attached the stack trace from our internal bug tracker. Needless to say this is not reproducible. I do wonder if setting -Xmx or -Xms might help but I've never tried. I also wonder if this might be helped by JENKINS-61103.

            ERROR: Execution failed
            Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 1.2.3.4/1.2.3.4:34136
            		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
            		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
            		at hudson.remoting.Channel.call(Channel.java:957)
            		at hudson.FilePath.act(FilePath.java:1072)
            		at hudson.FilePath.act(FilePath.java:1061)
            		at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:144)
            		at hudson.plugins.git.GitSCM.createClient(GitSCM.java:822)
            		at hudson.plugins.git.GitSCM.createClient(GitSCM.java:813)
            		at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186)
            		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124)
            		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
            		at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
            		at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
            		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            java.lang.OutOfMemoryError: unable to create new native thread
            	at java.lang.Thread.start0(Native Method)
            	at java.lang.Thread.start(Thread.java:717)
            	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
            	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
            	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
            	at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:42)
            	at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:46)
            	at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:41)
            	at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:65)
            	at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:121)
            	at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290)
            	at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
            	at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1294)
            	at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
            	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
            	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
            	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
            	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
            	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
            	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
            	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
            	at hudson.remoting.Command.writeTo(Command.java:109)
            	at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287)
            	at hudson.remoting.Channel.send(Channel.java:723)
            	at hudson.remoting.Request.callAsync(Request.java:238)
            	at hudson.remoting.Channel.callAsync(Channel.java:987)
            	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:282)
            	at com.sun.proxy.$Proxy3.notifyJarPresence(Unknown Source)
            	at hudson.remoting.FileSystemJarCache.lookInCache(FileSystemJarCache.java:80)
            	at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:46)
            	at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:90)
            	at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:43)
            	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:304)
            	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
            	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
            	at java.lang.Class.getDeclaredFields0(Native Method)
            	at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
            	at java.lang.Class.getDeclaredField(Class.java:2068)
            	at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1857)
            	at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:79)
            	at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:506)
            	at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
            	at java.security.AccessController.doPrivileged(Native Method)
            	at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:494)
            	at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
            	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681)
            	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
            	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
            	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
            	at java.util.ArrayList.readObject(ArrayList.java:797)
            	at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
            	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.lang.reflect.Method.invoke(Method.java:498)
            	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
            	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
            	at hudson.remoting.UserRequest.deserialize(UserRequest.java:291)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:190)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
            	at hudson.remoting.Request$2.run(Request.java:369)
            	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
            Caused: java.io.IOException: Remote call on JNLP4-connect connection from 1.2.3.4/1.2.3.4:34136 failed
            	at hudson.remoting.Channel.call(Channel.java:963)
            	at hudson.FilePath.act(FilePath.java:1072)
            	at hudson.FilePath.act(FilePath.java:1061)
            	at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:144)
            	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:822)
            	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:813)
            	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186)
            	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124)
            	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
            	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
            	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at java.lang.Thread.run(Thread.java:748)
            
            basil Basil Crow added a comment - FWIW I've seen the same error sporadically since 2019, but with Swarm on the client side. It seems to occur once every few months in my company, so less than 1% of all builds (we do thousands of builds a month). In case it's helpful I've attached the stack trace from our internal bug tracker. Needless to say this is not reproducible. I do wonder if setting -Xmx or -Xms might help but I've never tried. I also wonder if this might be helped by JENKINS-61103 . ERROR: Execution failed Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 1.2.3.4/1.2.3.4:34136 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:957) at hudson.FilePath.act(FilePath.java:1072) at hudson.FilePath.act(FilePath.java:1061) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:144) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:822) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:813) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:42) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:46) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:41) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:65) at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:121) at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290) at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231) at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1294) at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at hudson.remoting.Command.writeTo(Command.java:109) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287) at hudson.remoting.Channel.send(Channel.java:723) at hudson.remoting.Request.callAsync(Request.java:238) at hudson.remoting.Channel.callAsync(Channel.java:987) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:282) at com.sun.proxy.$Proxy3.notifyJarPresence(Unknown Source) at hudson.remoting.FileSystemJarCache.lookInCache(FileSystemJarCache.java:80) at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:46) at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:90) at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:43) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:304) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2583) at java.lang.Class.getDeclaredField(Class.java:2068) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1857) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:79) at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:506) at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:494) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391) at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) at java.util.ArrayList.readObject(ArrayList.java:797) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) at hudson.remoting.UserRequest.deserialize(UserRequest.java:291) at hudson.remoting.UserRequest.perform(UserRequest.java:190) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93) Caused: java.io.IOException: Remote call on JNLP4-connect connection from 1.2.3.4/1.2.3.4:34136 failed at hudson.remoting.Channel.call(Channel.java:963) at hudson.FilePath.act(FilePath.java:1072) at hudson.FilePath.act(FilePath.java:1061) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:144) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:822) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:813) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
            kryan90 Kevin added a comment -

            We have been seeing this as well at about the same rates. Makes it extremely difficult to test fixes since only a very small portion of jobs fail. It appears to always happen during the checkout scm stage.

            Here is generally how our pod templates / jobs are built using a shared library

            public void myTemplate(body) {
                def name = "my-node"
                def label = "${name}-${uuid}"
                podTemplate(
                    cloud: 'my-cloud',
                    label: label,
                    name: name,
                    containers: [
                        containerTemplate(
                            name: name,
                            image: "myimage",
                            command: 'cat',
                            ttyEnabled: true,
                            workingDir: '/home/jenkins/agent',
                            alwaysPullImage: false,
                            resourceRequestCpu: '500m',
                            resourceRequestMemory: '512Mi',
                        )
                    ],
                    workspaceVolume: emptyDirWorkspaceVolume(false),
                ) {
                    node(label) {
                        container(name) {
                            body.call()
                        }
                    }
                }
            }
            

            and calling it like this:

            import org.myorg.PodTemplates
            
            agentTemplates = new PodTemplates()
            agentTemplates.myTemplate{
                checkout scm
                stage('Do some stuff') {
                  sh 'echo test'
                }
            }
            

            In this case the checkout scm command is being run in the context of the container block. One thing I plan on trying is to move that stage to run at the node level vs the container level.

            kryan90 Kevin added a comment - We have been seeing this as well at about the same rates. Makes it extremely difficult to test fixes since only a very small portion of jobs fail. It appears to always happen during the checkout scm stage. Here is generally how our pod templates / jobs are built using a shared library public void myTemplate(body) { def name = "my-node" def label = "${name}-${uuid}" podTemplate( cloud: 'my-cloud' , label: label, name: name, containers: [ containerTemplate( name: name, image: "myimage" , command: 'cat' , ttyEnabled: true , workingDir: '/home/jenkins/agent' , alwaysPullImage: false , resourceRequestCpu: '500m' , resourceRequestMemory: '512Mi' , ) ], workspaceVolume: emptyDirWorkspaceVolume( false ), ) { node(label) { container(name) { body.call() } } } } and calling it like this: import org.myorg.PodTemplates agentTemplates = new PodTemplates() agentTemplates.myTemplate{ checkout scm stage( 'Do some stuff' ) { sh 'echo test' } } In this case the checkout scm command is being run in the context of the container block. One thing I plan on trying is to move that stage to run at the node level vs the container level.
            jthompson Jeff Thompson added a comment -

            I have no ideas or suggestions to help anyone out here. We would need more reproducibility or troubleshooting info to do anything.

            I will note that sometimes thinks like this are because of the behavior of some plugin or interactions between multiple ones.

            jthompson Jeff Thompson added a comment - I have no ideas or suggestions to help anyone out here. We would need more reproducibility or troubleshooting info to do anything. I will note that sometimes thinks like this are because of the behavior of some plugin or interactions between multiple ones.
            simepo Simon added a comment -

            Updated to Jenkins to LTS 2.289.1 which includes:

            We have also updated the version of the inboud-agent image to 4.7, and are no longer seeing this issue.

             

            simepo Simon added a comment - Updated to Jenkins to LTS 2.289.1 which includes: Upgrade from Remoting 4.6 to Remoting 4.7 with bugfixes and dependency updates. ( pull 5292 ,  Remoting 4.7 changelog ) We have also updated the version of the inboud-agent image to 4.7, and are no longer seeing this issue.  
            johnny25 John added a comment -

            Simon - Good to know that latest version resolving issue.
            I have been seeing similar issue for long time. Can you please share following parameters?

            • OS ulimit (any other tuning parameters to consider)
            • Jenkins JVM parameters
            • Jenkins slave (any new changes)
            johnny25 John added a comment - Simon - Good to know that latest version resolving issue. I have been seeing similar issue for long time. Can you please share following parameters? OS ulimit (any other tuning parameters to consider) Jenkins JVM parameters Jenkins slave (any new changes)
            simepo Simon added a comment -
            • OS ulimit - we are running the Docker image from Docker Hub in AWS EKS service, and have not made any changes from the defaults provided by those services.
            • JVM parameters - -XX:MaxRAMPercentage=50.0
            • The only changes to the slave in K8s has been the adoption of the new inbound-agent version 4.7-1
            simepo Simon added a comment - OS ulimit - we are running the Docker image from Docker Hub in AWS EKS service, and have not made any changes from the defaults provided by those services. JVM parameters - -XX:MaxRAMPercentage=50.0 The only changes to the slave in K8s has been the adoption of the new inbound-agent version 4.7-1
            wasimj Wasim added a comment -

            This error was seen again today with 
            inbound-agent:4.7-1

            wasimj Wasim added a comment - This error was seen again today with  inbound-agent:4.7-1
            kryan90 Kevin added a comment -

            I have tested up through inbound-agent:4.9-1 w/ controller versions 2.289.2 and 2.304 – the issue is still happening at the same rate.

            I created a job on a dev controller deployment that only checks out scm. I have it set to run once per minute and usually get ~1 failure per hour.

            kryan90 Kevin added a comment - I have tested up through inbound-agent:4.9-1 w/ controller versions 2.289.2 and 2.304 – the issue is still happening at the same rate. I created a job on a dev controller deployment that only checks out scm. I have it set to run once per minute and usually get ~1 failure per hour.
            cpholt Christopher added a comment - - edited

            I have this problem frequently.  It is very frustrating.  I have attached a jstack thread capture of the jenkins agent that shows thousands of threads similar to this:

             

            "pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000]"pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000]   java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for  <0x14d9c070> (a java.util.concurrent.SynchronousQueue$TransferStack) at
            java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) at
            java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Unknown Source) at
            java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source) at
            java.util.concurrent.SynchronousQueue.poll(Unknown Source) at
            java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) at
            java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at
            java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at
            hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at
            hudson.remoting.Engine$1$$Lambda$10/10833120.run(Unknown Source) at
            java.lang.Thread.run(Unknown Source)
            

             

             

            My OutOfMemory exception usually looks like this:

             

             

            java.lang.OutOfMemoryError: unable to create new native thread
             at java.lang.Thread.start0(Native Method)
             at java.lang.Thread.start(Unknown Source)
             at java.util.concurrent.ThreadPoolExecutor.addWorker(Unknown Source)
             at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
             at java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
             at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51)
             at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50)
             at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44)
             at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66)
             at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:122)
             at java.io.ObjectOutputStream.writeNonProxyDesc(Unknown Source)
             at java.io.ObjectOutputStream.writeClassDesc(Unknown Source)
             at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
             at java.io.ObjectOutputStream.writeObject0(Unknown Source)
             at java.io.ObjectOutputStream.writeObject(Unknown Source)
             at hudson.remoting.Command.writeTo(Command.java:111)
             at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:286)
             at hudson.remoting.Channel.send(Channel.java:766)
             at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:158)
             at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:117)
             at hudson.util.StreamCopyThread.run(StreamCopyThread.java:71)
             
            

             

            After reading the code for 20 minutes, my hunch is that whenever .flush() is called on an output stream somewhere, that ends up creating a new worker thread to perform the flush.  for some reason those thread accumulate and dont recycle fast enough under some condition.  

             

            I created a script to capture a stack dump once per minute.  The number of pool-1-thread-nnnnn things would surge up and down over the course of the 90 minutes that my job takes to run.  most of the time the pool thread seem to recycle fast enough that it works.  however, sometimes they can't and thus the OOME.

             

            It is probably worth noting that my job generates a lot of output.  250000-350000 lines for 35M-50M in output.  Also, the jenkins agent is running on an AWS instance and the jenkins master is in a separate datacenter.  So perhaps the flushing threads take just long enough over the distant network connection that it contributes to the problem.

             

            I don't remember having this issue (as much) before the agent and the master where farther apart, but I could be wrong.

             

            cpholt Christopher added a comment - - edited I have this problem frequently.  It is very frustrating.  I have attached a jstack thread capture of the jenkins agent that shows thousands of threads similar to this:   "pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000] "pool-1-thread-11368" #11392 daemon prio=5 os_prio=0 tid=0x6408f800 nid=0x271c waiting on condition [0x7d9bf000]   java.lang. Thread .State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for   <0x14d9c070> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Unknown Source) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source) at java.util.concurrent.SynchronousQueue.poll(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at hudson.remoting.Engine$1$$Lambda$10/10833120.run(Unknown Source) at java.lang. Thread .run(Unknown Source)     My OutOfMemory exception usually looks like this:     java.lang.OutOfMemoryError: unable to create new native thread at java.lang. Thread .start0(Native Method) at java.lang. Thread .start(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.addWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at java.util.concurrent.AbstractExecutorService.submit(Unknown Source) at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66) at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:122) at java.io.ObjectOutputStream.writeNonProxyDesc(Unknown Source) at java.io.ObjectOutputStream.writeClassDesc(Unknown Source) at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source) at java.io.ObjectOutputStream.writeObject0(Unknown Source) at java.io.ObjectOutputStream.writeObject(Unknown Source) at hudson.remoting.Command.writeTo(Command.java:111) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:286) at hudson.remoting.Channel.send(Channel.java:766) at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:158) at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:117) at hudson.util.StreamCopyThread.run(StreamCopyThread.java:71)     After reading the code for 20 minutes, my hunch is that whenever .flush() is called on an output stream somewhere, that ends up creating a new worker thread to perform the flush.  for some reason those thread accumulate and dont recycle fast enough under some condition.     I created a script to capture a stack dump once per minute.  The number of pool-1-thread-nnnnn things would surge up and down over the course of the 90 minutes that my job takes to run.  most of the time the pool thread seem to recycle fast enough that it works.  however, sometimes they can't and thus the OOME.   It is probably worth noting that my job generates a lot of output.  250000-350000 lines for 35M-50M in output.  Also, the jenkins agent is running on an AWS instance and the jenkins master is in a separate datacenter.  So perhaps the flushing threads take just long enough over the distant network connection that it contributes to the problem.   I don't remember having this issue (as much) before the agent and the master where farther apart, but I could be wrong.  
            cpholt Christopher added a comment -

            Upon further investigation, the number of pool-1-thread-nnnnnn threads varies directly with the pace of log output. 

             

            In a quiet period (several minutes with no output, there are only 2 of these threads.  At the end of that quiet period when a lot (10000+ lines of logs are produced suddenly) the number of these threads jumps as high as 460.  After the log volume slows down, that number of thread slowly drops as well back down to 50 or so.  Then as logs are produced at a slower pace, the number of threads remains around there or will drop to as low as 10-15.  As log volume spikes back up during noisier tests, the number of threads spikes back up as well.

             

            The highest number I managed to capture was around 2500 of these threads.  Obviously having 2500 threads to flush log streams points to a problem somewhere. 

             

            My Speculation:  Perhaps the idea of a worker thread to flush the log was to eliminate some performance bottleneck somewhere in the past?  Maybe flushes were slow in some scenario and having them be async was a win.  But it seems as though in this case the complexity is causing the problem.  Perhaps there needs to be a limit to the number of threads allowed in a worker pool somewhere?

             

             

            cpholt Christopher added a comment - Upon further investigation, the number of pool-1-thread-nnnnnn threads varies directly with the pace of log output.    In a quiet period (several minutes with no output, there are only 2 of these threads.  At the end of that quiet period when a lot (10000+ lines of logs are produced suddenly) the number of these threads jumps as high as 460.  After the log volume slows down, that number of thread slowly drops as well back down to 50 or so.  Then as logs are produced at a slower pace, the number of threads remains around there or will drop to as low as 10-15.  As log volume spikes back up during noisier tests, the number of threads spikes back up as well.   The highest number I managed to capture was around 2500 of these threads.  Obviously having 2500 threads to flush log streams points to a problem somewhere.    My Speculation:  Perhaps the idea of a worker thread to flush the log was to eliminate some performance bottleneck somewhere in the past?  Maybe flushes were slow in some scenario and having them be async was a win.  But it seems as though in this case the complexity is causing the problem.  Perhaps there needs to be a limit to the number of threads allowed in a worker pool somewhere?    
            cpholt Christopher added a comment -

            After reading the code a bit more and playing with some of the unit test, I think I have a solution:

            After: "git clone https://github.com/jenkinsci/remoting.git", I made this change:

             

            --- a/src/main/java/hudson/remoting/Engine.java
            +++ b/src/main/java/hudson/remoting/Engine.java
            @@ -113,7 +113,7 @@
            {{ /**}}
            {{ * Thread pool that sets {@link #CURRENT}.}}
            {{ */}}
            - private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() {
            + private final ExecutorService executor = Executors.newFixedThreadPool(100,new ThreadFactory() {
            {{ private final ThreadFactory defaultFactory = Executors.defaultThreadFactory();}}
            {{ @Override}}
            {{ public Thread newThread(@Nonnull final Runnable r) {}}

             

            I'm running this now and (so far) the problem has not happened again.  blame showed that this {{Executors.newCachedThreadPool }}call has been there for a long time (13 years or so, probably longer).  I don't know this code well enough to understand what other implications this might have, but it does seem to address the thread creation overflow.  

             

            After reading the code in Executors.newCachedThreadPool, it basically creates a thread pool where any new request immediately creates a new thread if there are no currently idle threads to service that request.  The number of threads is unbounded (effectively).   The Executors.newFixedThreadPool uses a different queue for work request and will simply queue them up and have no more than X (100 here) worker threads processing that queue.  After discovering how those 2 Executors work internally, the Executors.newCachedThreadPool seems like a bomb waiting to go off.  It's unbounded approach to thread creation (2^29 or something is it's actual limit) sets up a scenario where even a tiny delay in the time spent in each task can quickly blow up the number of threads.  As an experiment, I added a Thread.sleep(1), for a 1ms sleep, to doCheck in AnonymousClassWarnings.  That caused the ProxyWriterTest to create 1000 threads instead of it's usual 20 to 30.

             

            cpholt Christopher added a comment - After reading the code a bit more and playing with some of the unit test, I think I have a solution: After: "git clone https://github.com/jenkinsci/remoting.git ", I made this change:   --- a/src/main/java/hudson/remoting/Engine.java +++ b/src/main/java/hudson/remoting/Engine.java @@ -113,7 +113,7 @@ {{ /**}} {{ * Thread pool that sets {@link #CURRENT}.}} {{ */}} - private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() { + private final ExecutorService executor = Executors.newFixedThreadPool(100,new ThreadFactory() { {{ private final ThreadFactory defaultFactory = Executors.defaultThreadFactory();}} {{ @Override}} {{ public Thread newThread(@Nonnull final Runnable r) {}}   I'm running this now and (so far) the problem has not happened again.  blame showed that this {{Executors.newCachedThreadPool }}call has been there for a long time (13 years or so, probably longer).  I don't know this code well enough to understand what other implications this might have, but it does seem to address the thread creation overflow.     After reading the code in Executors.newCachedThreadPool, it basically creates a thread pool where any new request immediately creates a new thread if there are no currently idle threads to service that request.  The number of threads is unbounded (effectively).   The Executors.newFixedThreadPool uses a different queue for work request and will simply queue them up and have no more than X (100 here) worker threads processing that queue.  After discovering how those 2 Executors work internally, the Executors.newCachedThreadPool seems like a bomb waiting to go off.  It's unbounded approach to thread creation (2^29 or something is it's actual limit) sets up a scenario where even a tiny delay in the time spent in each task can quickly blow up the number of threads.  As an experiment, I added a Thread.sleep(1), for a 1ms sleep, to doCheck in AnonymousClassWarnings.  That caused the ProxyWriterTest to create 1000 threads instead of it's usual 20 to 30.  
            basil Basil Crow added a comment -

            Wow, thanks for providing this analysis, cpholt! To give some background, I'm not a maintainer of Remoting (nor do I really understand how it works), but I am a user who has been frustrated with this bug for years, and I maintain the Swarm plugin (a thin wrapper on top of Remoting). I've posted my stack trace earlier in the comments: while I don't have a call to flush() as you do, some other aspects of my setup are similar to yours. For example, I also have my controller in an on-prem datacenter separate from the agents running in AWS.

            I think the key insight you've provided is that this error may be a symptom of thread exhaustion. I didn't consider this as a possibility, but with an unbounded number of threads I can see how they could become exhausted. It's not clear to me what causes these threads to remain active for a long period of time in some cases and to be recycled quickly in other cases. Most systems software that I've seen (e.g., the Linux NFS server) has a configurable thread limit defaulting to some sensible number (e.g. 64 threads) and allowing the user to tune the option if necessary. Perhaps the same should be done here - defaulting to something like 64 and allowing the user to customize the value with a system property. I would encourage you to open a pull request with such a change and see what the maintainers think. I bet they would be receptive to it, though I can't know for sure. Also note that there's another instance of Executors.newCachedThreadPool in the Launcher class that might be susceptible to the same problem.

            Now if you'll allow me to speculate, and with the caveat that I really don't understand how Remoting works, one of the things I've noticed is that the open source Remoting supports both a BIONetworkLayer (blocking I/O) and a NIONetworkLayer (non-blocking I/O), but as far as I can tell, only the BIONetworkLayer is ever exposed in the open-source version of Jenkins. The NIONetworkLayer only seems to be exposed to users in the UI in the commercial version of CloudBees CI (which I've never used) and is documented as follows:

            The non-blocking I/O connector limits the number of threads that are used to maintain the SSH channel: when there are a large number of channels (that is, many SSH agents) the non-blocking connector uses fewer threads. This permits the Jenkins UI to remain more responsive than with the standard SSH agent connector.

            I find it interesting that they talk about the non-blocking connector using fewer threads. I'm not sure if it's relevant at all to the problem we're dealing with, but it's certainly interesting background information. If you're willing to hack around with Remoting, you can get the non-blocking I/O behavior by commenting out .withPreferNonBlockingIO(false) in Engine. I'm a bit curious if this would make a difference in your use case - with the caveat that I don't really have a solid theory here, since I am too unfamiliar with all of this to be able to begin to reason about things clearly.

            basil Basil Crow added a comment - Wow, thanks for providing this analysis, cpholt ! To give some background, I'm not a maintainer of Remoting (nor do I really understand how it works), but I am a user who has been frustrated with this bug for years, and I maintain the Swarm plugin (a thin wrapper on top of Remoting). I've posted my stack trace earlier in the comments: while I don't have a call to flush() as you do, some other aspects of my setup are similar to yours. For example, I also have my controller in an on-prem datacenter separate from the agents running in AWS. I think the key insight you've provided is that this error may be a symptom of thread exhaustion. I didn't consider this as a possibility, but with an unbounded number of threads I can see how they could become exhausted. It's not clear to me what causes these threads to remain active for a long period of time in some cases and to be recycled quickly in other cases. Most systems software that I've seen (e.g., the Linux NFS server) has a configurable thread limit defaulting to some sensible number (e.g. 64 threads) and allowing the user to tune the option if necessary. Perhaps the same should be done here - defaulting to something like 64 and allowing the user to customize the value with a system property. I would encourage you to open a pull request with such a change and see what the maintainers think. I bet they would be receptive to it, though I can't know for sure. Also note that there's another instance of Executors.newCachedThreadPool in the Launcher class that might be susceptible to the same problem. Now if you'll allow me to speculate, and with the caveat that I really don't understand how Remoting works, one of the things I've noticed is that the open source Remoting supports both a BIONetworkLayer (blocking I/O) and a NIONetworkLayer (non-blocking I/O), but as far as I can tell, only the BIONetworkLayer is ever exposed in the open-source version of Jenkins. The NIONetworkLayer only seems to be exposed to users in the UI in the commercial version of CloudBees CI (which I've never used) and is documented as follows : The non-blocking I/O connector limits the number of threads that are used to maintain the SSH channel: when there are a large number of channels (that is, many SSH agents) the non-blocking connector uses fewer threads. This permits the Jenkins UI to remain more responsive than with the standard SSH agent connector. I find it interesting that they talk about the non-blocking connector using fewer threads. I'm not sure if it's relevant at all to the problem we're dealing with, but it's certainly interesting background information. If you're willing to hack around with Remoting, you can get the non-blocking I/O behavior by commenting out .withPreferNonBlockingIO(false) in Engine . I'm a bit curious if this would make a difference in your use case - with the caveat that I don't really have a solid theory here, since I am too unfamiliar with all of this to be able to begin to reason about things clearly.
            cpholt Christopher added a comment -

            basil:  I saw the other spot in Launcher and it was my first attempt at a fix, but Engine proved to be the code path I needed.  I'm guessing that someone who understands this code would fix both.  

             

            The threads remain active for 60 seconds.  Controlled by a parameter in Executors.newCachedThreadPool.  They get reused if another task is needed within that time.   It's when a massive flood of tasks hits all at once that the number of threads can spike.  I think in this case it's when a line of output is written (or flushed?).  My Jenkins jobs that trip this bug tend to accumulate large amounts of logs in internal buffers and dump them on certain events.  So the sudden spike in demand leads to thread exhaustion, when a thread limited approach can easily handle the spike in writes.

             

            The documentation you reference is probably related but exploring that avenue is beyond the amount of time I can invest in this, unless my simple fix fails.

             

            Hopefully one of the Remoting maintainers can chime in and shed some light on this...

            cpholt Christopher added a comment - basil :  I saw the other spot in Launcher and it was my first attempt at a fix, but Engine proved to be the code path I needed.  I'm guessing that someone who understands this code would fix both.     The threads remain active for 60 seconds.  Controlled by a parameter in Executors.newCachedThreadPool.  They get reused if another task is needed within that time.   It's when a massive flood of tasks hits all at once that the number of threads can spike.  I think in this case it's when a line of output is written (or flushed?).  My Jenkins jobs that trip this bug tend to accumulate large amounts of logs in internal buffers and dump them on certain events.  So the sudden spike in demand leads to thread exhaustion, when a thread limited approach can easily handle the spike in writes.   The documentation you reference is probably related but exploring that avenue is beyond the amount of time I can invest in this, unless my simple fix fails.   Hopefully one of the Remoting maintainers can chime in and shed some light on this...
            basil Basil Crow added a comment -

            I put together a local reproducer for this bug. First, I created a Python script to create a large burst of output:

            #!/usr/bin/python3
            
            lipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur mollis, sem in aliquet consectetur, diam lacus faucibus leo, ut tincidunt diam elit id justo. Nulla ac libero ut felis iaculis suscipit in in massa. Etiam consectetur suscipit ornare. Pellentesque eu diam tempus, lobortis est non, vulputate nulla. Fusce sagittis sodales turpis, sit amet imperdiet lorem lobortis quis. Cras ac ex nisi. Sed in nisl cursus, consectetur enim non, ultrices libero. In egestas malesuada erat, sit amet consectetur sapien. Nulla massa augue, cursus vitae malesuada ac, tincidunt eu ex. Aliquam vitae mi euismod, placerat sapien a, luctus sapien. Vestibulum at libero pulvinar, vestibulum purus ac, cursus erat. Phasellus vitae orci id ante maximus fermentum. Fusce posuere tincidunt leo, eget placerat sapien fringilla quis. Sed cursus mauris odio, ac interdum felis auctor vel. In ut aliquam massa. Praesent porttitor euismod urna. Suspendisse potenti. In porta libero vel interdum iaculis. Fusce non volutpat lacus. Proin arcu tortor, placerat a sagittis eget, commodo vel ante. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur sem sem, aliquet at varius vel, dapibus eu lacus. Mauris vel ipsum neque. Etiam elit erat, auctor non sagittis a, volutpat sed nisi. Vivamus tellus dui, tincidunt a imperdiet et, mollis sed orci. Sed euismod, mauris at finibus luctus, diam nisl scelerisque orci, ut efficitur tellus augue nec ante. Quisque commodo ipsum quis nunc dapibus vehicula. Pellentesque dignissim ultricies tortor et euismod. Proin feugiat iaculis nunc sed aliquet. Suspendisse fringilla turpis egestas neque fringilla, at malesuada lacus rutrum. Nam eu venenatis orci. Praesent bibendum dictum dictum. Quisque rhoncus turpis a neque sollicitudin blandit. Donec eget magna ultricies nisi tempor aliquam nec eget neque. Aenean sagittis nunc nec est vehicula suscipit. Sed vitae bibendum quam. Fusce at mi arcu. Ut eget diam quis enim commodo consequat. Aliquam pulvinar erat sit amet mi sollicitudin, eget mollis dui blandit. Nam varius, mi eget interdum consectetur, nibh nulla venenatis orci, sit amet vulputate leo odio at nulla. Donec nunc elit, auctor eget molestie vitae, fermentum a lacus. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Curabitur massa enim, vulputate in ipsum nec, blandit lacinia mi. Nulla dignissim est eget congue suscipit. Phasellus sit amet porttitor urna. In ut lacinia sapien Vivamus dapibus consectetur massa, et vehicula ex molestie vitae. Duis efficitur ut sapien eu euismod. Donec id lorem dignissim, aliquam odio id, suscipit lorem. Pellentesque sit amet vulputate sem, ac blandit nunc. Pellentesque faucibus augue sed cursus molestie. Mauris quis nulla erat. In et nulla vel ex fringilla lacinia quis sit amet risus. Nunc in erat quis nisi laoreet iaculis. Pellentesque lobortis pulvinar justo, imperdiet gravida justo ultricies eget. Pellentesque vehicula purus et metus hendrerit, sed placerat metus tincidunt. Proin lacinia hendrerit quam, eu pharetra urna ullamcorper id. Cras mattis eu sem sed facilisis. Vestibulum sit amet libero sit amet eros condimentum congue. Suspendisse et ultricies ante, in rutrum magna. Etiam et fringilla mi, non eleifend arcu. Vestibulum sit amet tristique felis, at congue odio. Ut posuere interdum justo at.\n"
            
            mystr = ""
            
            for x in range(0, 999999):
                mystr += lipsum
            
            print(mystr)
            

            I put this script in /tmp/lipsum.py.

            Then I built Remoting with a 100 millisecond sleep:

            diff --git a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java
            index a4912aa8..10928c25 100644
            --- a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java
            +++ b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java
            @@ -71,6 +71,11 @@ public class AnonymousClassWarnings {
                 }
             
                 private static void doCheck(@Nonnull Class<?> c) {
            +        try {
            +            Thread.sleep(100);
            +        } catch (Throwable t) {
            +            // do nothing
            +        }
                     if (Enum.class.isAssignableFrom(c)) { // e.g., com.cloudbees.plugins.credentials.CredentialsScope$1 ~ CredentialsScope.SYSTEM
                         // ignore, enums serialize specially
                     } else if (c.isAnonymousClass()) { // e.g., pkg.Outer$1
            

            I installed this with mvn clean install -DskipTests.

            In Jenkins core I used this patch:

            diff --git a/pom.xml b/pom.xml
            index 234651e2bb..44ec2d9423 100644
            --- a/pom.xml
            +++ b/pom.xml
            @@ -91,7 +91,7 @@ THE SOFTWARE.
                 <changelog.url>https://www.jenkins.io/changelog</changelog.url>
             
                 <!-- Bundled Remoting version -->
            -    <remoting.version>4.10</remoting.version>
            +    <remoting.version>4.11-SNAPSHOT</remoting.version>
                 <!-- Minimum Remoting version, which is tested for API compatibility -->
                 <remoting.minimum.supported.version>3.14</remoting.minimum.supported.version>
             
            diff --git a/test/src/test/java/hudson/model/ProjectTest.java b/test/src/test/java/hudson/model/ProjectTest.java
            index a63bda9445..c6cd2890ca 100644
            --- a/test/src/test/java/hudson/model/ProjectTest.java
            +++ b/test/src/test/java/hudson/model/ProjectTest.java
            @@ -33,6 +33,8 @@ import hudson.Launcher;
             import hudson.Util;
             import hudson.model.queue.QueueTaskFuture;
             import hudson.security.AccessDeniedException3;
            +import hudson.slaves.RetentionStrategy;
            +import hudson.slaves.SlaveComputer;
             import hudson.tasks.ArtifactArchiver;
             import hudson.tasks.BatchFile;
             import hudson.tasks.BuildTrigger;
            @@ -78,6 +80,7 @@ import hudson.security.ACLContext;
             import hudson.slaves.Cloud;
             import hudson.slaves.DumbSlave;
             import hudson.slaves.NodeProvisioner;
            +import org.jvnet.hudson.test.SimpleCommandLauncher;
             import org.jvnet.hudson.test.TestExtension;
             import java.util.List;
             import java.util.ArrayList;
            @@ -154,7 +157,29 @@ public class ProjectTest {
                     assertNotNull("Project should have Transient Action TransientAction.", p.getAction(TransientAction.class));
                     createAction = false;
                 }
            -    
            +
            +    @Test
            +    public void testRemoting() throws Exception {
            +        FreeStyleProject p = j.createFreeStyleProject("project");
            +        int sz = j.jenkins.getNodes().size();
            +        SimpleCommandLauncher launcher = new SimpleCommandLauncher(
            +                String.format("\"%s/bin/java\" -Djava.awt.headless=true -Xmx1g -Xms1g -jar \"%s\"",
            +                        System.getProperty("java.home"),
            +                        new File(j.jenkins.getJnlpJars("agent.jar").getURL().toURI()).getAbsolutePath()));
            +        Slave agent = new DumbSlave("agent" + sz, "description", j.createTmpDir().getPath(), "1", Node.Mode.NORMAL, "", launcher, RetentionStrategy.NOOP, Collections.emptyList());
            +        j.jenkins.addNode(agent);
            +        j.waitOnline(agent);
            +        SlaveComputer computer = (SlaveComputer) agent.toComputer();
            +        System.err.println(computer.getLog());
            +        p.setAssignedNode(agent);
            +        p.getBuildersList().add(new Shell("python3 /tmp/lipsum.py"));
            +        try {
            +            j.buildAndAssertSuccess(p);
            +        } finally {
            +            System.err.println(computer.getLog());
            +        }
            +    }
            +
                 @Test
                 public void testGetEnvironment() throws Exception{
                     FreeStyleProject p = j.createFreeStyleProject("project");
            
            

            Running the above with MAVEN_OPTS=-Xmx4g mvn clean verify -Dspotbugs.skip=true -Dcheckstyle.skip=true -Dtest=hudson.model.ProjectTest#testRemoting, the test passes. Watching the thread count, I get up to 15,500 threads for the agent process. This is a lot of threads, but not enough to trigger an out of memory error on my system.

            Next I needed a way to trigger the error. I'm on a Linux desktop with about 1,500 threads running at idle, so tried putting various numbers in /proc/sys/kernel/threads-max to limit the maximum number of threads on my system. By default the limit was over 250,000 threads, which didn't result in an OOM. A limit of 23,000 threads still wasn't enough to trigger an OOM. But a limit of 22,000 threads was enough to consistently trigger this:

            SEVERE: Unexpected error in channel channel
            java.lang.OutOfMemoryError: unable to create new native thread
                    at java.lang.Thread.start0(Native Method)
                    at java.lang.Thread.start(Thread.java:717)
                    at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
                    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
                    at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
                    at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51)
                    at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50)
                    at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44)
                    at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66)
                    at hudson.remoting.ClassFilter$RegExpClassFilter.isBlacklisted(ClassFilter.java:304)
                    at hudson.remoting.ClassFilter$1.isBlacklisted(ClassFilter.java:123)
                    at hudson.remoting.ClassFilter.check(ClassFilter.java:78)
                    at hudson.remoting.ObjectInputStreamEx.resolveClass(ObjectInputStreamEx.java:61)
                    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
                    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
                    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160)
                    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
                    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
                    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
                    at hudson.remoting.Command.readFromObjectStream(Command.java:155)
                    at hudson.remoting.Command.readFrom(Command.java:142)
                    at hudson.remoting.Command.readFrom(Command.java:128)
                    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
                    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
            

            Finding a baseline was important: in the passing scenario, my regular desktop applications were using about 1,500 threads, the agent was using about 15,500 threads, and the other test machinery (e.g. the Jenkins controller process and JUnit) must have been using about ~5,000 threads. Based on this, I added this patch to Remoting:

            diff --git a/src/main/java/hudson/remoting/Engine.java b/src/main/java/hudson/remoting/Engine.java
            index f62d556b..59404346 100644
            --- a/src/main/java/hudson/remoting/Engine.java
            +++ b/src/main/java/hudson/remoting/Engine.java
            @@ -113,7 +113,7 @@ public class Engine extends Thread {
                 /**
                  * Thread pool that sets {@link #CURRENT}.
                  */
            -    private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() {
            +    private final ExecutorService executor = Executors.newFixedThreadPool(5000, new ThreadFactory() {
                     private final ThreadFactory defaultFactory = Executors.defaultThreadFactory();
                     @Override
                     public Thread newThread(@Nonnull final Runnable r) {
            diff --git a/src/main/java/hudson/remoting/Launcher.java b/src/main/java/hudson/remoting/Launcher.java
            index 15742223..8823a1bf 100644
            --- a/src/main/java/hudson/remoting/Launcher.java
            +++ b/src/main/java/hudson/remoting/Launcher.java
            @@ -748,7 +748,7 @@ public class Launcher {
                  * @since 2.24
                  */
                 public static void main(InputStream is, OutputStream os, Mode mode, boolean performPing, @CheckForNull JarCache cache) throws IOException, InterruptedException {
            -        ExecutorService executor = Executors.newCachedThreadPool();
            +        ExecutorService executor = Executors.newFixedThreadPool(5000);
                     ChannelBuilder cb = new ChannelBuilder("channel", executor)
                             .withMode(mode)
                             .withJarCacheOrDefault(cache);
            

            I wanted these numbers to be as high as possible in order for the test to finish in a reasonable amount of time with the 100 millisecond sleep in Remoting while still being low enough to demonstrate that bounding the thread pools works to get the test to pass in a 22,000-thread-limited environment (which was failing with newCachedThreadPool). My theory was that this would cap the agent process at 10,000 threads (rather than the old 15,500), which (along with the 1,500 threads for my desktop applications and the test machinery's 5,000 - 6,000 threads) should still be well under the system's 22,000 thread limit. Sure enough, the fix worked! The test passed again. No more OOM.

            I think this demonstrates that putting an upper bound on the number of threads in Remoting will solve this problem.

            basil Basil Crow added a comment - I put together a local reproducer for this bug. First, I created a Python script to create a large burst of output: #!/usr/bin/python3 lipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur mollis, sem in aliquet consectetur, diam lacus faucibus leo, ut tincidunt diam elit id justo. Nulla ac libero ut felis iaculis suscipit in in massa. Etiam consectetur suscipit ornare. Pellentesque eu diam tempus, lobortis est non, vulputate nulla. Fusce sagittis sodales turpis, sit amet imperdiet lorem lobortis quis. Cras ac ex nisi. Sed in nisl cursus, consectetur enim non, ultrices libero. In egestas malesuada erat, sit amet consectetur sapien. Nulla massa augue, cursus vitae malesuada ac, tincidunt eu ex. Aliquam vitae mi euismod, placerat sapien a, luctus sapien. Vestibulum at libero pulvinar, vestibulum purus ac, cursus erat. Phasellus vitae orci id ante maximus fermentum. Fusce posuere tincidunt leo, eget placerat sapien fringilla quis. Sed cursus mauris odio, ac interdum felis auctor vel. In ut aliquam massa. Praesent porttitor euismod urna. Suspendisse potenti. In porta libero vel interdum iaculis. Fusce non volutpat lacus. Proin arcu tortor, placerat a sagittis eget, commodo vel ante. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur sem sem, aliquet at varius vel, dapibus eu lacus. Mauris vel ipsum neque. Etiam elit erat, auctor non sagittis a, volutpat sed nisi. Vivamus tellus dui, tincidunt a imperdiet et, mollis sed orci. Sed euismod, mauris at finibus luctus, diam nisl scelerisque orci, ut efficitur tellus augue nec ante. Quisque commodo ipsum quis nunc dapibus vehicula. Pellentesque dignissim ultricies tortor et euismod. Proin feugiat iaculis nunc sed aliquet. Suspendisse fringilla turpis egestas neque fringilla, at malesuada lacus rutrum. Nam eu venenatis orci. Praesent bibendum dictum dictum. Quisque rhoncus turpis a neque sollicitudin blandit. Donec eget magna ultricies nisi tempor aliquam nec eget neque. Aenean sagittis nunc nec est vehicula suscipit. Sed vitae bibendum quam. Fusce at mi arcu. Ut eget diam quis enim commodo consequat. Aliquam pulvinar erat sit amet mi sollicitudin, eget mollis dui blandit. Nam varius, mi eget interdum consectetur, nibh nulla venenatis orci, sit amet vulputate leo odio at nulla. Donec nunc elit, auctor eget molestie vitae, fermentum a lacus. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Curabitur massa enim, vulputate in ipsum nec, blandit lacinia mi. Nulla dignissim est eget congue suscipit. Phasellus sit amet porttitor urna. In ut lacinia sapien Vivamus dapibus consectetur massa, et vehicula ex molestie vitae. Duis efficitur ut sapien eu euismod. Donec id lorem dignissim, aliquam odio id, suscipit lorem. Pellentesque sit amet vulputate sem, ac blandit nunc. Pellentesque faucibus augue sed cursus molestie. Mauris quis nulla erat. In et nulla vel ex fringilla lacinia quis sit amet risus. Nunc in erat quis nisi laoreet iaculis. Pellentesque lobortis pulvinar justo, imperdiet gravida justo ultricies eget. Pellentesque vehicula purus et metus hendrerit, sed placerat metus tincidunt. Proin lacinia hendrerit quam, eu pharetra urna ullamcorper id. Cras mattis eu sem sed facilisis. Vestibulum sit amet libero sit amet eros condimentum congue. Suspendisse et ultricies ante, in rutrum magna. Etiam et fringilla mi, non eleifend arcu. Vestibulum sit amet tristique felis, at congue odio. Ut posuere interdum justo at.\n" mystr = "" for x in range(0, 999999): mystr += lipsum print(mystr) I put this script in /tmp/lipsum.py . Then I built Remoting with a 100 millisecond sleep: diff --git a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java index a4912aa8..10928c25 100644 --- a/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java +++ b/src/main/java/org/jenkinsci/remoting/util/AnonymousClassWarnings.java @@ -71,6 +71,11 @@ public class AnonymousClassWarnings { } private static void doCheck(@Nonnull Class<?> c) { + try { + Thread.sleep(100); + } catch (Throwable t) { + // do nothing + } if (Enum.class.isAssignableFrom(c)) { // e.g., com.cloudbees.plugins.credentials.CredentialsScope$1 ~ CredentialsScope.SYSTEM // ignore, enums serialize specially } else if (c.isAnonymousClass()) { // e.g., pkg.Outer$1 I installed this with mvn clean install -DskipTests . In Jenkins core I used this patch: diff --git a/pom.xml b/pom.xml index 234651e2bb..44ec2d9423 100644 --- a/pom.xml +++ b/pom.xml @@ -91,7 +91,7 @@ THE SOFTWARE. <changelog.url>https://www.jenkins.io/changelog</changelog.url> <!-- Bundled Remoting version --> - <remoting.version>4.10</remoting.version> + <remoting.version>4.11-SNAPSHOT</remoting.version> <!-- Minimum Remoting version, which is tested for API compatibility --> <remoting.minimum.supported.version>3.14</remoting.minimum.supported.version> diff --git a/test/src/test/java/hudson/model/ProjectTest.java b/test/src/test/java/hudson/model/ProjectTest.java index a63bda9445..c6cd2890ca 100644 --- a/test/src/test/java/hudson/model/ProjectTest.java +++ b/test/src/test/java/hudson/model/ProjectTest.java @@ -33,6 +33,8 @@ import hudson.Launcher; import hudson.Util; import hudson.model.queue.QueueTaskFuture; import hudson.security.AccessDeniedException3; +import hudson.slaves.RetentionStrategy; +import hudson.slaves.SlaveComputer; import hudson.tasks.ArtifactArchiver; import hudson.tasks.BatchFile; import hudson.tasks.BuildTrigger; @@ -78,6 +80,7 @@ import hudson.security.ACLContext; import hudson.slaves.Cloud; import hudson.slaves.DumbSlave; import hudson.slaves.NodeProvisioner; +import org.jvnet.hudson.test.SimpleCommandLauncher; import org.jvnet.hudson.test.TestExtension; import java.util.List; import java.util.ArrayList; @@ -154,7 +157,29 @@ public class ProjectTest { assertNotNull("Project should have Transient Action TransientAction.", p.getAction(TransientAction.class)); createAction = false; } - + + @Test + public void testRemoting() throws Exception { + FreeStyleProject p = j.createFreeStyleProject("project"); + int sz = j.jenkins.getNodes().size(); + SimpleCommandLauncher launcher = new SimpleCommandLauncher( + String.format("\"%s/bin/java\" -Djava.awt.headless=true -Xmx1g -Xms1g -jar \"%s\"", + System.getProperty("java.home"), + new File(j.jenkins.getJnlpJars("agent.jar").getURL().toURI()).getAbsolutePath())); + Slave agent = new DumbSlave("agent" + sz, "description", j.createTmpDir().getPath(), "1", Node.Mode.NORMAL, "", launcher, RetentionStrategy.NOOP, Collections.emptyList()); + j.jenkins.addNode(agent); + j.waitOnline(agent); + SlaveComputer computer = (SlaveComputer) agent.toComputer(); + System.err.println(computer.getLog()); + p.setAssignedNode(agent); + p.getBuildersList().add(new Shell("python3 /tmp/lipsum.py")); + try { + j.buildAndAssertSuccess(p); + } finally { + System.err.println(computer.getLog()); + } + } + @Test public void testGetEnvironment() throws Exception{ FreeStyleProject p = j.createFreeStyleProject("project"); Running the above with MAVEN_OPTS=-Xmx4g mvn clean verify -Dspotbugs.skip=true -Dcheckstyle.skip=true -Dtest=hudson.model.ProjectTest#testRemoting , the test passes. Watching the thread count, I get up to 15,500 threads for the agent process. This is a lot of threads, but not enough to trigger an out of memory error on my system. Next I needed a way to trigger the error. I'm on a Linux desktop with about 1,500 threads running at idle, so tried putting various numbers in /proc/sys/kernel/threads-max to limit the maximum number of threads on my system. By default the limit was over 250,000 threads, which didn't result in an OOM. A limit of 23,000 threads still wasn't enough to trigger an OOM. But a limit of 22,000 threads was enough to consistently trigger this: SEVERE: Unexpected error in channel channel java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:51) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:50) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:44) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66) at hudson.remoting.ClassFilter$RegExpClassFilter.isBlacklisted(ClassFilter.java:304) at hudson.remoting.ClassFilter$1.isBlacklisted(ClassFilter.java:123) at hudson.remoting.ClassFilter.check(ClassFilter.java:78) at hudson.remoting.ObjectInputStreamEx.resolveClass(ObjectInputStreamEx.java:61) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at hudson.remoting.Command.readFromObjectStream(Command.java:155) at hudson.remoting.Command.readFrom(Command.java:142) at hudson.remoting.Command.readFrom(Command.java:128) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61) Finding a baseline was important: in the passing scenario, my regular desktop applications were using about 1,500 threads, the agent was using about 15,500 threads, and the other test machinery (e.g. the Jenkins controller process and JUnit) must have been using about ~5,000 threads. Based on this, I added this patch to Remoting: diff --git a/src/main/java/hudson/remoting/Engine.java b/src/main/java/hudson/remoting/Engine.java index f62d556b..59404346 100644 --- a/src/main/java/hudson/remoting/Engine.java +++ b/src/main/java/hudson/remoting/Engine.java @@ -113,7 +113,7 @@ public class Engine extends Thread { /** * Thread pool that sets {@link #CURRENT}. */ - private final ExecutorService executor = Executors.newCachedThreadPool(new ThreadFactory() { + private final ExecutorService executor = Executors.newFixedThreadPool(5000, new ThreadFactory() { private final ThreadFactory defaultFactory = Executors.defaultThreadFactory(); @Override public Thread newThread(@Nonnull final Runnable r) { diff --git a/src/main/java/hudson/remoting/Launcher.java b/src/main/java/hudson/remoting/Launcher.java index 15742223..8823a1bf 100644 --- a/src/main/java/hudson/remoting/Launcher.java +++ b/src/main/java/hudson/remoting/Launcher.java @@ -748,7 +748,7 @@ public class Launcher { * @since 2.24 */ public static void main(InputStream is, OutputStream os, Mode mode, boolean performPing, @CheckForNull JarCache cache) throws IOException, InterruptedException { - ExecutorService executor = Executors.newCachedThreadPool(); + ExecutorService executor = Executors.newFixedThreadPool(5000); ChannelBuilder cb = new ChannelBuilder("channel", executor) .withMode(mode) .withJarCacheOrDefault(cache); I wanted these numbers to be as high as possible in order for the test to finish in a reasonable amount of time with the 100 millisecond sleep in Remoting while still being low enough to demonstrate that bounding the thread pools works to get the test to pass in a 22,000-thread-limited environment (which was failing with newCachedThreadPool ). My theory was that this would cap the agent process at 10,000 threads (rather than the old 15,500), which (along with the 1,500 threads for my desktop applications and the test machinery's 5,000 - 6,000 threads) should still be well under the system's 22,000 thread limit. Sure enough, the fix worked! The test passed again. No more OOM. I think this demonstrates that putting an upper bound on the number of threads in Remoting will solve this problem.
            cpholt Christopher added a comment -

            The limit you chose of 5000 would still have failed for me.  My Jenkins agent runs on a modestly powered windows instance.  Ideally this limit would come from a parameter or system property.  System property via -D in the agent launcher script?  I'm not sure what the usual project standard is for such settings.  Is it part of the Node config in the jenkins UI?

             

             

            cpholt Christopher added a comment - The limit you chose of 5000 would still have failed for me.  My Jenkins agent runs on a modestly powered windows instance.  Ideally this limit would come from a parameter or system property.  System property via -D in the agent launcher script?  I'm not sure what the usual project standard is for such settings.  Is it part of the Node config in the jenkins UI?    
            basil Basil Crow added a comment -

            The limit you chose of 5000 would still have failed for me.

            Right - this wasn't intended to be the actual production limit, but just a way to get the problem to reproduce locally.

            System property via -D in the agent launcher script? I'm not sure what the usual project standard is for such settings. Is it part of the Node config in the jenkins UI?

            I'm not too sure either. This is where we'd really need some design guidance from maintainers. jthompson are you still maintaining Remoting?

            basil Basil Crow added a comment - The limit you chose of 5000 would still have failed for me. Right - this wasn't intended to be the actual production limit, but just a way to get the problem to reproduce locally. System property via -D in the agent launcher script? I'm not sure what the usual project standard is for such settings. Is it part of the Node config in the jenkins UI? I'm not too sure either. This is where we'd really need some design guidance from maintainers. jthompson are you still maintaining Remoting?
            jeffret Jeff Thompson added a comment -

            Yes, basil , I am still maintaining Remoting, but on a completely volunteer basis these days. This month and next my free time is very limited so I'm not going to have much time for prepping a change and testing. I'd be happy to look over a PR, though, especially if it came with sufficient explanation and testing. I'm not sure how much of this area I understand already.

            jeffret Jeff Thompson added a comment - Yes, basil , I am still maintaining Remoting, but on a completely volunteer basis these days. This month and next my free time is very limited so I'm not going to have much time for prepping a change and testing. I'd be happy to look over a PR, though, especially if it came with sufficient explanation and testing. I'm not sure how much of this area I understand already.

            Reformatted the stack trace in the description to make it not display so many pairs of braces.

            kon Kalle Niemitalo added a comment - Reformatted the stack trace in the description to make it not display so many pairs of braces.

            We have been seeing same issue in our environment. basil / cpholt Do you have docker image with fix to try ?

            grayudu Gangadhar Rayudu added a comment - We have been seeing same issue in our environment. basil  / cpholt Do you have docker image with fix to try ?
            ewalther Enrico Walther added a comment - - edited

            We use a self prepared Docker image based on ubuntu focal and remoting-4.10.jar. We observe the same issue running as pod in our K8S Cluster.

             

            apiVersion: "v1"
            kind: "Pod"
            metadata:
              labels:
                jenkins: "slave"
                jenkins/label-digest: "168c12f11d09a233175f435329c242e1f2f941f9"
                jenkins/label: "jenkins-slave-simple"
              name: "jenkins-slave-simple-w4z4f"
            spec:
              containers:
              - env:
                - name: "JENKINS_SECRET"
                  value: "********"
                - name: "JENKINS_AGENT_NAME"
                  value: "jenkins-slave-simple-w4z4f"
                - name: "JENKINS_NAME"
                  value: "jenkins-slave-simple-w4z4f"
                - name: "JENKINS_AGENT_WORKDIR"
                  value: "/home/jenkins"
                - name: "JENKINS_URL"
                  value: "https://<xxx>"
                image: "registry<xxx>/jenkins-slave-simple:4.10"
                imagePullPolicy: "Always"
                name: "jnlp"
                resources:
                  limits:
                    memory: "1024Mi"
                    cpu: "500m"
                  requests:
                    memory: "512Mi"
                    cpu: "100m"
                tty: true
                volumeMounts:
                - mountPath: "/home/jenkins"
                  name: "workspace-volume"
                  readOnly: false
                workingDir: "/home/jenkins"
              hostNetwork: false
              imagePullSecrets:
              - name: "registry-gitlab"
              nodeSelector:
                kubernetes.io/os: "linux"
              restartPolicy: "Never"
              volumes:
              - emptyDir:
                  medium: ""
                name: "workspace-volume"
            
            Running on jenkins-slave-simple-w4z4f in /home/jenkins/workspace/<xxx>
            [Pipeline] {
            [Pipeline] stage
            [Pipeline] { (Checkout)
            [Pipeline] deleteDir
            [Pipeline] withCredentials
            Masking supported pattern matches of $BBUser
            [Pipeline] {
            [Pipeline] sh
            [Pipeline] }
            [Pipeline] // withCredentials
            [Pipeline] }
            [Pipeline] // stage
            [Pipeline] emailext
            Request made to compress build log
            #648811 is still in progress; ignoring for purposes of comparison
            Sending email to: <xxx>
            [Pipeline] }
            [Pipeline] // node
            [Pipeline] End of Pipeline
            Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994
            		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1795)
            		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
            		at hudson.remoting.Channel.call(Channel.java:1001)
            		at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123)
            		at hudson.Launcher$ProcStarter.start(Launcher.java:508)
            		at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)
            		at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136)
            		at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320)
            		at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)
            		at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)
            		at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
            		at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source)
            		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            		at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
            		at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
            		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
            		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
            		at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
            		at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
            		at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
            		at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)
            		at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
            		at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158)
            		at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)
            		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)
            		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
            		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
            		at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
            		at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
            		at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
            		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
            		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
            		at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
            		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
            		at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)
            		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
            		at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
            		at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
            		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
            		at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76)
            		at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
            		at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66)
            		at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source)
            		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            		at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
            		at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
            		at com.cloudbees.groovy.cps.Next.step(Next.java:83)
            		at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
            		at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
            		at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
            		at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
            		at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
            		at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
            		at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
            		at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
            		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
            		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
            		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
            		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
            		at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
            		at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
            		at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
            		at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            		at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
            		at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
            		at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
            		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            		at java.base/java.lang.Thread.run(Unknown Source)
            java.lang.OutOfMemoryError: unable to create new native thread
            	at java.lang.Thread.start0(Native Method)
            	at java.lang.Thread.start(Thread.java:717)
            	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
            	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
            	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
            	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
            	at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49)
            	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72)
            	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68)
            	at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77)
            	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.lang.reflect.Method.invoke(Method.java:498)
            	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
            	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
            	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
            	at hudson.remoting.UserRequest.deserialize(UserRequest.java:289)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:189)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
            	at hudson.remoting.Request$2.run(Request.java:376)
            	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
            	at java.lang.Thread.run(Thread.java:748)
            Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 failed
            	at hudson.remoting.Channel.call(Channel.java:1005)
            	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123)
            	at hudson.Launcher$ProcStarter.start(Launcher.java:508)
            	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)
            	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136)
            	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320)
            	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319)
            	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193)
            	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
            	at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
            	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
            	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
            	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
            	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
            	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
            	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
            	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)
            	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
            	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158)
            	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)
            	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)
            	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
            	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
            	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
            	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
            	at WorkflowScript.run(WorkflowScript:155)
            	at ___cps.transform___(Native Method)
            	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
            	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
            	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
            	at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
            	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89)
            	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
            	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
            	at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
            	at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76)
            	at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
            	at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66)
            	at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source)
            	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
            	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
            	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
            	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
            	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
            	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
            	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
            	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
            	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
            	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
            	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
            	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
            	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
            	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
            	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
            	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
            	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
            	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
            	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
            	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
            	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            	at java.base/java.lang.Thread.run(Unknown Source)
            Finished: FAILURE
            
            ewalther Enrico Walther added a comment - - edited We use a self prepared Docker image based on ubuntu focal and remoting-4.10.jar . We observe the same issue running as pod in our K8S Cluster.   apiVersion: "v1" kind: "Pod" metadata: labels: jenkins: "slave" jenkins/label-digest: "168c12f11d09a233175f435329c242e1f2f941f9" jenkins/label: "jenkins-slave-simple" name: "jenkins-slave-simple-w4z4f" spec: containers: - env: - name: "JENKINS_SECRET" value: "********" - name: "JENKINS_AGENT_NAME" value: "jenkins-slave-simple-w4z4f" - name: "JENKINS_NAME" value: "jenkins-slave-simple-w4z4f" - name: "JENKINS_AGENT_WORKDIR" value: "/home/jenkins" - name: "JENKINS_URL" value: "https: //<xxx>" image: "registry<xxx>/jenkins-slave-simple:4.10" imagePullPolicy: "Always" name: "jnlp" resources: limits: memory: "1024Mi" cpu: "500m" requests: memory: "512Mi" cpu: "100m" tty: true volumeMounts: - mountPath: "/home/jenkins" name: "workspace-volume" readOnly: false workingDir: "/home/jenkins" hostNetwork: false imagePullSecrets: - name: "registry-gitlab" nodeSelector: kubernetes.io/os: "linux" restartPolicy: "Never" volumes: - emptyDir: medium: "" name: "workspace-volume" Running on jenkins-slave-simple-w4z4f in /home/jenkins/workspace/<xxx> [Pipeline] { [Pipeline] stage [Pipeline] { (Checkout) [Pipeline] deleteDir [Pipeline] withCredentials Masking supported pattern matches of $BBUser [Pipeline] { [Pipeline] sh [Pipeline] } [Pipeline] // withCredentials [Pipeline] } [Pipeline] // stage [Pipeline] emailext Request made to compress build log #648811 is still in progress; ignoring for purposes of comparison Sending email to: <xxx> [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1795) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) at hudson.remoting.Channel.call(Channel.java:1001) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123) at hudson.Launcher$ProcStarter.start(Launcher.java:508) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76) at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66) at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang. Thread .run(Unknown Source) java.lang.OutOfMemoryError: unable to create new native thread at java.lang. Thread .start0(Native Method) at java.lang. Thread .start( Thread .java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at hudson.remoting.UserRequest.deserialize(UserRequest.java:289) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at java.lang. Thread .run( Thread .java:748) Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 failed at hudson.remoting.Channel.call(Channel.java:1005) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123) at hudson.Launcher$ProcStarter.start(Launcher.java:508) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at WorkflowScript.run(WorkflowScript:155) at ___cps.transform___(Native Method) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76) at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66) at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang. Thread .run(Unknown Source) Finished: FAILURE
            dg424 Donald Gobin added a comment -

            The problem seems to be with the Jenkins core itself and the way it is spawning threads to log messages as can seen from the stack trace:

             

            java.lang.OutOfMemoryError: unable to create new native thread
            ***********************************************************************************************************
            	at java.lang.Thread.start0(Native Method)
            	at java.lang.Thread.start(Thread.java:717)
            	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
            	at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
            	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
            	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
            	at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49)
            	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72)
            	at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68)
            	at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77)
            *********************************************************************************************************** 
            	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            	at java.lang.reflect.Method.invoke(Method.java:498)
            	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
            	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342)
            	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266)
            	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124)
            	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
            	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)
            	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)
            	at hudson.remoting.UserRequest.deserialize(UserRequest.java:290)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:189)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
            	at hudson.remoting.Request$2.run(Request.java:369)
            	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
            

            This is from the server side as the workflow plugin does not exist on the agent side.

             

             

             

            dg424 Donald Gobin added a comment - The problem seems to be with the Jenkins core itself and the way it is spawning threads to log messages as can seen from the stack trace:   java.lang.OutOfMemoryError: unable to create new native thread *********************************************************************************************************** at java.lang. Thread .start0(Native Method) at java.lang. Thread .start( Thread .java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77) *********************************************************************************************************** at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423) at hudson.remoting.UserRequest.deserialize(UserRequest.java:290) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) This is from the server side as the workflow plugin does not exist on the agent side.      

            I submitted a PR #505 which should allow to test whether the current class check is part of the problem without impacting normal operations.

            vlatombe Vincent Latombe added a comment - I submitted a PR #505 which should allow to test whether the current class check is part of the problem without impacting normal operations.
            dg424 Donald Gobin added a comment -

            Hi vlatombe

            Is this fix for the server or agent side ?

             

            dg424 Donald Gobin added a comment - Hi vlatombe Is this fix for the server or agent side ?  

            dg424 this is the agent side.

            vlatombe Vincent Latombe added a comment - dg424 this is the agent side.
            dg424 Donald Gobin added a comment -

            Hi vlatombe

            But I see the remoting stack is on both sides (remoting.jar is in the jenkins.war file as well) and the stack trace in my comment above shows classes (org.jenkinsci.plugins.workflow.log, jenkins.util.InterceptingScheduledExecutorService) that I cannot find in remoting.jar on the the agent side. So, I'm actually not sure where the OOM is happening; if your PR is to address only the agent side, then it means the root cause of the exception is on the agent side but the error shows up on the server side ? I'm confused

            dg424 Donald Gobin added a comment - Hi vlatombe But I see the remoting stack is on both sides (remoting.jar is in the jenkins.war file as well) and the stack trace in my comment above shows classes (org.jenkinsci.plugins.workflow.log, jenkins.util.InterceptingScheduledExecutorService) that I cannot find in remoting.jar on the the agent side. So, I'm actually not sure where the OOM is happening; if your PR is to address only the agent side, then it means the root cause of the exception is on the agent side but the error shows up on the server side ? I'm confused

            I see org.jenkinsci.plugins.workflow.log classes in these files on an agent:

            • remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar (workflow-api 1136.v7f5f1759dc16)
            • remoting/jarCache/AA/E8875DDC0E79929E944D30636208F6.jar (workflow-api 1108.v57edf648f5d4)
            • remoting/jarCache/EC/7A1A038FDCBC2456010A181E58E35B.jar (workflow-api 1122.v7a_916f363c86)

            I don't know whether those file names are hashes or just random. Anyway, it's conceivable that the agent could load org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream etc. from these files.

            kon Kalle Niemitalo added a comment - I see org.jenkinsci.plugins.workflow.log classes in these files on an agent: remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar (workflow-api 1136.v7f5f1759dc16) remoting/jarCache/AA/E8875DDC0E79929E944D30636208F6.jar (workflow-api 1108.v57edf648f5d4) remoting/jarCache/EC/7A1A038FDCBC2456010A181E58E35B.jar (workflow-api 1122.v7a_916f363c86) I don't know whether those file names are hashes or just random. Anyway, it's conceivable that the agent could load org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream etc. from these files.
            kon Kalle Niemitalo added a comment - - edited

            Oh, Checksum.java first computes an SHA-256 hash, but then splits that to two 128-bit parts and xors them together. It's not just a truncated hash as in section 5.1 of NIST Special Publication 800-107 Revision 1.

            $ sha256sum remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar
            38165eeaa9e20f4a5bdced3d142660b13ec55dfea343aba86da3775ee1ab5196 *remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar
            

            38165eeaa9e20f4a5bdced3d142660b1 xor 3ec55dfea343aba86da3775ee1ab5196 = 06D303140AA1A4E2367F9A63F58D3127

            kon Kalle Niemitalo added a comment - - edited Oh, Checksum.java first computes an SHA-256 hash, but then splits that to two 128-bit parts and xors them together. It's not just a truncated hash as in section 5.1 of NIST Special Publication 800-107 Revision 1. $ sha256sum remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar 38165eeaa9e20f4a5bdced3d142660b13ec55dfea343aba86da3775ee1ab5196 *remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar 38165eeaa9e20f4a5bdced3d142660b1 xor 3ec55dfea343aba86da3775ee1ab5196 = 06D303140AA1A4E2367F9A63F58D3127
            dg424 Donald Gobin added a comment -

            Hi kon,

            Thanks. I see it now. So, these classes are "shipped" to the agent at runtime. If I fire up the agent and do not start a job, the classes do not exist. Just trying to understand how the process works...

            dg424 Donald Gobin added a comment - Hi kon , Thanks. I see it now. So, these classes are "shipped" to the agent at runtime. If I fire up the agent and do not start a job, the classes do not exist. Just trying to understand how the process works...
            dg424 Donald Gobin added a comment - - edited

            Tried the PR on my stress test job and still get OOM with 
            -Dorg.jenkinsci.remoting.util.AnonymousClassWarnings.useSeparateThreadPool=true

            java.lang.OutOfMemoryError: unable to create new native thread
            	at java.lang.Thread.start0(Native Method)
            	at java.lang.Thread.start(Thread.java:717)
            	at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104)
            	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
            	at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:73)
            	at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:130)
            	at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290)
            	at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
            	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
            	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
            	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
            	at hudson.remoting.Command.writeTo(Command.java:111)
            	at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287)
            	at hudson.remoting.Channel.send(Channel.java:766)
            	at hudson.remoting.Request.callAsync(Request.java:238)
            	at hudson.remoting.Channel.callAsync(Channel.java:1030)
            	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
            	at com.sun.proxy.$Proxy3.notifyJarPresence(Unknown Source)
            	at hudson.remoting.FileSystemJarCache.lookInCache(FileSystemJarCache.java:80)
            	at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:49)
            	at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93)
            	at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45)
            	at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284)
            	at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264)
            	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223)
            	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
            	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
            	at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:173)
            	at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:154)
            	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3317)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
            	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
            	at hudson.remoting.Request$2.run(Request.java:376)
            	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) 

            One thing I forgot to mention that is important - for us, this failure only occurs WHEN the pipeline has a git checkout stage.

            dg424 Donald Gobin added a comment - - edited Tried the PR on my stress test job and still get OOM with  -Dorg.jenkinsci.remoting.util.AnonymousClassWarnings.useSeparateThreadPool=true java.lang.OutOfMemoryError: unable to create new native thread at java.lang. Thread .start0(Native Method) at java.lang. Thread .start( Thread .java:717) at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:73) at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:130) at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290) at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at hudson.remoting.Command.writeTo(Command.java:111) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287) at hudson.remoting.Channel.send(Channel.java:766) at hudson.remoting.Request.callAsync(Request.java:238) at hudson.remoting.Channel.callAsync(Channel.java:1030) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285) at com.sun.proxy.$Proxy3.notifyJarPresence(Unknown Source) at hudson.remoting.FileSystemJarCache.lookInCache(FileSystemJarCache.java:80) at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:49) at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93) at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45) at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223) at java.lang. ClassLoader .loadClass( ClassLoader .java:418) at java.lang. ClassLoader .loadClass( ClassLoader .java:351) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:173) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:154) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3317) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121) One thing I forgot to mention that is important - for us, this failure only occurs WHEN the pipeline has a git checkout stage.
            allan_burdajewicz Allan BURDAJEWICZ added a comment - - edited

            I did a couple of testing recently to troubleshoot this problem.

            Was able to capture a thread dump / heapdump by catching the OOM in remoting and generating dumps.

            The JVM heap is very low (< 10 MB). Thread dumps reveal that there are only ~50 threads when the issue happens. With a default stack size of 256KB, I doubt that this has much of an effect. So I am not convinced this is due to those pools spawning too many threads.

            Note that ulimits are very high within the jnlp container:

            $ ulimit -a
            core file size          (blocks, -c) unlimited
            data seg size           (kbytes, -d) unlimited
            scheduling priority             (-e) 0
            file size               (blocks, -f) unlimited
            pending signals                 (-i) 128448
            max locked memory       (kbytes, -l) 16384
            max memory size         (kbytes, -m) unlimited
            open files                      (-n) 1048576
            pipe size            (512 bytes, -p) 8
            POSIX message queues     (bytes, -q) 819200
            real-time priority              (-r) 0
            stack size              (kbytes, -s) 8192
            cpu time               (seconds, -t) unlimited
            max user processes              (-u) unlimited
            virtual memory          (kbytes, -v) unlimited
            file locks                      (-x) unlimited
            $ cat /proc/sys/kernel/threads-max
            256897
            $ cat /sys/fs/cgroup/pids/pids.max
            max
            

            Now maybe I am looking at this wrong. Giving such a high limit, maybe the jnlp container of another pod that is not failing is consuming those limits and impacting other containers. That being said, the external system that I have in place (GKE and Datadog) do not show spike in PIDs or anything explicit .

            So I thought that maybe this is an off heap memory issue or just an isolation behavior ,that happens during remoting class loading.

            Now I have enabled NMT tracking and things gets interesting though I am not too familiar with memory management at that level. What I see is that despite giving the jnlp container some limits - of for example 500Mi - the reserved memory is higher than I expected:

            Total: reserved=1538742KB, committed=133862KB
            

            Most of it coming from the Metaspace:

            -                     Class (reserved=1070101KB, committed=22805KB)
                                        (classes #3545)
                                        (malloc=1045KB #4970) 
                                        (mmap: reserved=1069056KB, committed=21760KB)
            

            When I don't set container limit, the reserved memory is quite higher (for k8s node with a capacity of 16G):

            Total: reserved=4600892KB, committed=310052KB
            

            I am not sure if this is part of the problem or not. That being said, that Class/Metaspace size seems to be a constant ~1GB. When raising the container limit to 2Mi or more, the reserved memory for Class/Metaspace is similar and the total amount of reserved memory lower than the container limit!
            I don't have enough knowledge in this area to know if that is related but maybe someone here does. That area of the off heap memory can be controlled with -XX:MaxMetaspaceSize and XX:CompressedClassSpaceSize. Maybe setting those help mitigate the problem though I would not know the impact or the right value -XX:MaxMetaspaceSize=100m -XX:CompressedClassSpaceSize=100m.

            Still investigating...

            allan_burdajewicz Allan BURDAJEWICZ added a comment - - edited I did a couple of testing recently to troubleshoot this problem. Was able to capture a thread dump / heapdump by catching the OOM in remoting and generating dumps . The JVM heap is very low (< 10 MB). Thread dumps reveal that there are only ~50 threads when the issue happens. With a default stack size of 256KB, I doubt that this has much of an effect. So I am not convinced this is due to those pools spawning too many threads. Note that ulimits are very high within the jnlp container: $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 128448 max locked memory (kbytes, -l) 16384 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ cat /proc/sys/kernel/threads-max 256897 $ cat /sys/fs/cgroup/pids/pids.max max Now maybe I am looking at this wrong. Giving such a high limit, maybe the jnlp container of another pod that is not failing is consuming those limits and impacting other containers. That being said, the external system that I have in place (GKE and Datadog) do not show spike in PIDs or anything explicit . So I thought that maybe this is an off heap memory issue or just an isolation behavior ,that happens during remoting class loading. Now I have enabled NMT tracking and things gets interesting though I am not too familiar with memory management at that level. What I see is that despite giving the jnlp container some limits - of for example 500Mi - the reserved memory is higher than I expected: Total: reserved=1538742KB, committed=133862KB Most of it coming from the Metaspace: - Class (reserved=1070101KB, committed=22805KB) (classes #3545) (malloc=1045KB #4970) (mmap: reserved=1069056KB, committed=21760KB) When I don't set container limit, the reserved memory is quite higher (for k8s node with a capacity of 16G): Total: reserved=4600892KB, committed=310052KB I am not sure if this is part of the problem or not. That being said, that Class/Metaspace size seems to be a constant ~1GB. When raising the container limit to 2Mi or more, the reserved memory for Class/Metaspace is similar and the total amount of reserved memory lower than the container limit! I don't have enough knowledge in this area to know if that is related but maybe someone here does. That area of the off heap memory can be controlled with -XX:MaxMetaspaceSize and XX:CompressedClassSpaceSize . Maybe setting those help mitigate the problem though I would not know the impact or the right value -XX:MaxMetaspaceSize=100m -XX:CompressedClassSpaceSize=100m . Still investigating...
            vlatombe Vincent Latombe added a comment - - edited

            Default JVM ergonomics is (https://docs.oracle.com/en/java/javase/11/gctuning/ergonomics.html)

            I would tend to believe that the container would get OOMKilled if it went over the defined container limit which isn't the case here.

            vlatombe Vincent Latombe added a comment - - edited Default JVM ergonomics is ( https://docs.oracle.com/en/java/javase/11/gctuning/ergonomics.html ) Maximum heap size of 1/4 of physical memory (accounts for container limits) Metaspace size defaults to 1g ( https://docs.oracle.com/en/java/javase/11/gctuning/other-considerations.html#GUID-B29C9153-3530-4C15-9154-E74F44E3DAD9 ) I would tend to believe that the container would get OOMKilled if it went over the defined container limit which isn't the case here.
            dg424 Donald Gobin added a comment -

            Yes, I've already tried tweaking the pod spec to set the memory limit, stack size (reducing it), more cpu, and none of these worked - still got the OOM eventually. ulimits are fine, so it's not this as well. Given the "random nature" of the error, it looks like some race condition situation where the code just spins out of control. Also note the related PR by Vincent here - https://github.com/jenkinsci/remoting/pull/505 - that I've been putting under my stress test setup.

             

            dg424 Donald Gobin added a comment - Yes, I've already tried tweaking the pod spec to set the memory limit, stack size (reducing it), more cpu, and none of these worked - still got the OOM eventually. ulimits are fine, so it's not this as well. Given the "random nature" of the error, it looks like some race condition situation where the code just spins out of control. Also note the related PR by Vincent here - https://github.com/jenkinsci/remoting/pull/505  - that I've been putting under my stress test setup.  
            rhinoceros rhinoceros.xn added a comment - - edited
            rhinoceros rhinoceros.xn added a comment - - edited Hello dg424  , Did this PR solve the problem? https://github.com/jenkinsci/remoting/pull/505#issuecomment-1046913986
            dg424 Donald Gobin added a comment -

            Hi rhinoceros

            No, it did not  Still looking for a solution ...

            dg424 Donald Gobin added a comment - Hi rhinoceros ,  No, it did not  Still looking for a solution ...

            dg424 would you be able to share the Jenkinsfile you use to reproduce the problem? Or is it too specific?

            vlatombe Vincent Latombe added a comment - dg424  would you be able to share the Jenkinsfile you use to reproduce the problem? Or is it too specific?
            dg424 Donald Gobin added a comment -

            I can share the same stress test job layout and you can put your own settings to reproduce.

            pipeline {
                agent {
                    kubernetes {
                        inheritFrom 'k8s-default'
                        containerTemplate {
                            name 'mycontainer'
                            image "someimage:latest"
                            privileged false
                            alwaysPullImage false
                            workingDir '/home/jenkins'
                            ttyEnabled true
                            command 'cat'
                            args ''
                        }
                        defaultContainer 'mycontainer'
                    }
                }
                // run until we get 65873 error
                triggers {
                    cron "* * * * *"
                }    
                options {
                    disableConcurrentBuilds()
                }     
                stages {
                    stage('Checkout SCM') {
                        steps {
                            checkout([$class: 'GitSCM', branches: [[name: "FETCH_HEAD"]], doGenerateSubmoduleConfigurations: false, extensions: [[$class: 'SubmoduleOption', disableSubmodules: false, parentCredentials: true, recursiveSubmodules: true, reference: '', trackingSubmodules: false]], submoduleCfg: [], userRemoteConfigs: [[credentialsId: 'mygit-cred', url: 'ssh://git@mycompany.net/test.git']]])
                        }
                    }
                    stage('First stage') {
                        steps {
                            script {
                                echo "Inside first stage"
                            }
                        }
                    }
                }
                post {
                    failure {
            // we ALWAYS get here eventually as a result of 65873 issue
                        echo "Failure!"
                        script {
            // we got the error, disable job and send email
            Jenkins.instance.getItemByFullName(env.JOB_NAME).doDisable()
            emailext body: "${BUILD_URL}",
            subject: "[Jenkins]: ${JOB_NAME} build failed",
            to: 'foo@bar.com'            
                        }
                    }
                }
            } 

             

            dg424 Donald Gobin added a comment - I can share the same stress test job layout and you can put your own settings to reproduce. pipeline { agent { kubernetes { inheritFrom 'k8s- default ' containerTemplate { name 'mycontainer' image "someimage:latest" privileged false alwaysPullImage false workingDir '/home/jenkins' ttyEnabled true command 'cat' args '' } defaultContainer 'mycontainer' } } // run until we get 65873 error triggers { cron "* * * * *" } options { disableConcurrentBuilds() } stages { stage( 'Checkout SCM' ) { steps { checkout([$class: 'GitSCM' , branches: [[name: "FETCH_HEAD" ]], doGenerateSubmoduleConfigurations: false , extensions: [[$class: 'SubmoduleOption' , disableSubmodules: false , parentCredentials: true , recursiveSubmodules: true , reference: '', trackingSubmodules: false ]], submoduleCfg: [], userRemoteConfigs: [[credentialsId: ' mygit-cred ', url: ' ssh: //git@mycompany.net/test.git']]]) } } stage( 'First stage' ) { steps { script { echo "Inside first stage" } } } } post { failure { // we ALWAYS get here eventually as a result of 65873 issue echo "Failure!" script { // we got the error, disable job and send email Jenkins.instance.getItemByFullName(env.JOB_NAME).doDisable() emailext body: "${BUILD_URL}" , subject: "[Jenkins]: ${JOB_NAME} build failed" , to: 'foo@bar.com' } } } }  

            dg424 Do you have any limit defined at infrastructure level that would apply cpu/memory limits to the pod and containers?

            vlatombe Vincent Latombe added a comment - dg424  Do you have any limit defined at infrastructure level that would apply cpu/memory limits to the pod and containers?
            dg424 Donald Gobin added a comment -

            Yea, we do, but ... I'm not sure if this is causing the issue as the pipeline would run for a very long time before getting the error (i.e. using the same resources that k8s has assigned to it on each run). Also, as you can see, the pipeline really doesn't do anything much and it checks out an empty git project, so in terms of resources, it uses very little directly. But the key stage for reproduction is the one that does the checkout. Without this, the problem is not reproducible. I think this is why the other comment on github suggested a downgrade of git prior to when this ticket was opened.

            dg424 Donald Gobin added a comment - Yea, we do, but ... I'm not sure if this is causing the issue as the pipeline would run for a very long time before getting the error (i.e. using the same resources that k8s has assigned to it on each run). Also, as you can see, the pipeline really doesn't do anything much and it checks out an empty git project, so in terms of resources, it uses very little directly. But the key stage for reproduction is the one that does the checkout. Without this, the problem is not reproducible. I think this is why the other comment on github suggested a downgrade of git prior to when this ticket was opened.

            dg424 Yes, I think some thing part of the class loading that is done in preparation for the git checkout triggers the problem. But it could be more apparent in low memory environments, so ideally I'd like to set up a reproducer that is as close as possible to something that can trigger the problem.

            vlatombe Vincent Latombe added a comment - dg424  Yes, I think some thing part of the class loading that is done in preparation for the git checkout triggers the problem. But it could be more apparent in low memory environments, so ideally I'd like to set up a reproducer that is as close as possible to something that can trigger the problem.
            dg424 Donald Gobin added a comment -

            Since that test pipeline uses almost no resources, you should be able to quickly setup one directly on the jenkins master to see if the problem exists in the area you suspect as I don't think it matters in that case whether you're using a k8s agent or running directly on the master ? If there is some kind of leak in the class loading area, it should eventually hit the problem ?

            dg424 Donald Gobin added a comment - Since that test pipeline uses almost no resources, you should be able to quickly setup one directly on the jenkins master to see if the problem exists in the area you suspect as I don't think it matters in that case whether you're using a k8s agent or running directly on the master ? If there is some kind of leak in the class loading area, it should eventually hit the problem ?
            dg424 Donald Gobin added a comment -

            I created a related ticket here - https://issues.jenkins.io/browse/JENKINS-68199 - for the git client side since vlatombe is thinking that the Jenkins side might be ok.

            dg424 Donald Gobin added a comment - I created a related ticket here - https://issues.jenkins.io/browse/JENKINS-68199  - for the git client side since vlatombe  is thinking that the Jenkins side might be ok.
            vlatombe Vincent Latombe added a comment - - edited

            On the agent side on jdk11, the following is printed when the problem occurs

            [14.901s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. 

            This led me to https://stackoverflow.com/a/47082934, then this kernel bug from 2016, still not fixed.

            I also found an interesting issue on JDK issue tracker that led to https://bugs.openjdk.java.net/browse/JDK-8268773, and a fix in jdk18 that is doing retries, so it could possibly workaround the issue.

            Anyone cares to attempt to backport this to say, jdk11?

            vlatombe Vincent Latombe added a comment - - edited On the agent side on jdk11, the following is printed when the problem occurs [14.901s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. This led me to https://stackoverflow.com/a/47082934 , then this kernel bug from 2016, still not fixed. I also found an interesting issue on JDK issue tracker that led to https://bugs.openjdk.java.net/browse/JDK-8268773 , and a fix in jdk18 that is doing retries, so it could possibly workaround the issue. Anyone cares to attempt to backport this to say, jdk11?
            dg424 Donald Gobin added a comment -

            So, this is saying that this "can" happen at any time on "any" Java program that uses multiple threads ? The fact that we only see this when our pipeline contains a git checkout stage is purely coincidental ?

            I would request the backport, but I don't have an openjdk account.

             

            dg424 Donald Gobin added a comment - So, this is saying that this "can" happen at any time on "any" Java program that uses multiple threads ? The fact that we only see this when our pipeline contains a git checkout stage is purely coincidental ? I would request the backport, but I don't have an openjdk account.  

            basil maybe?

            vlatombe Vincent Latombe added a comment - basil maybe?
            basil Basil Crow added a comment -

            Sorry Vincent, I have not read this thread. What is the reason you are pinging me specifically?

            basil Basil Crow added a comment - Sorry Vincent, I have not read this thread. What is the reason you are pinging me specifically?

            basil I think you submitted a backport on the JDK lately. I think the issue described here could be already fixed on jdk-18 and jdk-19 trees (this comment sums up my findings). Would you be able to build a backport on jdk-11 branch so that I could test it?

            vlatombe Vincent Latombe added a comment - basil I think you submitted a backport on the JDK lately. I think the issue described here could be already fixed on jdk-18 and jdk-19 trees (this comment sums up my findings). Would you be able to build a backport on jdk-11 branch so that I could test it?
            basil Basil Crow added a comment - - edited

            I fail to follow the reasoning in this comment. I do not see why you would need me to build a backport - just cherry-pick the change onto https://github.com/openjdk/jdk17u-dev or https://github.com/openjdk/jdk11u-dev and see if it fixes your problem. If you have a clean backport patch along with steps to reproduce the problem that fail before the patch and succeed after the patch, along with clear written reasoning about why the patch fixes the original problem, and are simply looking for someone who has signed the Oracle Contributor Agreement (OCA) to propose it, I might be willing to help, but for now I am unwatching this issue so that I do not receive further notifications.

            basil Basil Crow added a comment - - edited I fail to follow the reasoning in this comment . I do not see why you would need me to build a backport - just cherry-pick the change onto https://github.com/openjdk/jdk17u-dev or https://github.com/openjdk/jdk11u-dev and see if it fixes your problem. If you have a clean backport patch along with steps to reproduce the problem that fail before the patch and succeed after the patch, along with clear written reasoning about why the patch fixes the original problem, and are simply looking for someone who has signed the Oracle Contributor Agreement (OCA) to propose it, I might be willing to help, but for now I am unwatching this issue so that I do not receive further notifications.
            timja Tim Jacomb added a comment -

            vlatombe why not just try on Java 18?

            timja Tim Jacomb added a comment - vlatombe why not just try on Java 18?

            I could have, however since Jenkins is only starting to support java 17 in preview, I don't know what kind of surprises could arise from trying out a version that is not battle tested in our context. In any case I have built a jdk11 with the referenced commit cherry-picked and I'm currently running my reproduction harness to check whether the problem is gone.

            vlatombe Vincent Latombe added a comment - I could have, however since Jenkins is only starting to support java 17 in preview, I don't know what kind of surprises could arise from trying out a version that is not battle tested in our context. In any case I have built a jdk11 with the referenced commit cherry-picked and I'm currently running my reproduction harness to check whether the problem is gone.

            Cherry-picking https://github.com/openjdk/jdk/commit/e35005d5ce383ddd108096a3079b17cb0bcf76f1 on jdk11 and running the harness overnight shows a very significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds)

            vlatombe Vincent Latombe added a comment - Cherry-picking https://github.com/openjdk/jdk/commit/e35005d5ce383ddd108096a3079b17cb0bcf76f1 on jdk11 and running the harness overnight shows a very significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds)
            basil Basil Crow added a comment -

            jenkinsci/remoting#523 has been released in Remoting 4.14 and Jenkins 2.348. This helps alleviate the problem to some degree, but it does not eliminate the problem.

            Backporting JDK-8268773 / openjdk/jdk@e35005d5ce3 showed a significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds). JDK-8268773 / openjdk/jdk@e35005d5ce3 has been backported to jdk11u-dev in JDK-8286753 / openjdk/jdk11u-dev#1074 and to jdk17u-dev in JDK-8286629 / openjdk/jdk17u-dev#390.

            basil Basil Crow added a comment - jenkinsci/remoting#523 has been released in Remoting 4.14 and Jenkins 2.348 . This helps alleviate the problem to some degree, but it does not eliminate the problem. Backporting JDK-8268773 / openjdk/jdk@ e35005d5ce3 showed a significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds). JDK-8268773 / openjdk/jdk@ e35005d5ce3 has been backported to jdk11u-dev in JDK-8286753 / openjdk/jdk11u-dev#1074 and to jdk17u-dev in JDK-8286629 / openjdk/jdk17u-dev#390 .

            Thank you basil!

            vlatombe Vincent Latombe added a comment - Thank you basil !
            rhinoceros rhinoceros.xn added a comment - - edited

            After adding sleep(10) before git checkout this problem no longer occurs.

            Maybe sleep(10) before git or checkout step is a workaround.

            wasimj  dg424 

             

            sleep(10)
            checkout changelog: false, poll: false, scm: ........
            
            OR
            
            sleep(10)
            git branch: 'master', credentialsId: '******', url: 'git@git.yourcomampy.com:xx/zz.git'
            rhinoceros rhinoceros.xn added a comment - - edited After adding sleep(10) before git checkout this problem no longer occurs. Maybe sleep(10) before git or checkout step is a workaround. wasimj   dg424     sleep(10) checkout changelog: false , poll: false , scm: ........ OR sleep(10) git branch: 'master' , credentialsId: '******' , url: 'git@git.yourcomampy.com:xx/zz.git'
            matthewrgomes Matthew Gomes added a comment -

            rhinoceros Where did you set the git sleep? is this the git plugin

            matthewrgomes Matthew Gomes added a comment - rhinoceros Where did you set the git sleep? is this the git plugin
            rhinoceros rhinoceros.xn added a comment -

            matthewrgomes 

             

            before git OR checkout step:

            // code placeholder
                node(label) {
                    stage('xxzzxasd') {
                        container('xxxx') {
                                    stage('git clone') {
            
                                        sleep(10) //** HERE: adding sleep(10) before git checkout **
            
                                        git branch: 'master', credentialsId: '', url: 'git@xxx.com:xxx/xxxxxxx.git'
             
            rhinoceros rhinoceros.xn added a comment - matthewrgomes     before git OR checkout step: // code placeholder node(label) { stage( 'xxzzxasd' ) { container( 'xxxx' ) { stage( 'git clone' ) { sleep(10) //** HERE: adding sleep(10) before git checkout ** git branch: 'master' , credentialsId: '', url: ' git@xxx.com:xxx/xxxxxxx.git'
            dg424 Donald Gobin added a comment -

            rhinoceros so if this is true, then it indicates that the issue is with some kind of race condition within the jenkins pipeline flow that is causing this to happen. your sleep essentially looks like waiting for all jenkins threads for that pipeline to all complete before continuing ?

            dg424 Donald Gobin added a comment - rhinoceros so if this is true, then it indicates that the issue is with some kind of race condition within the jenkins pipeline flow that is causing this to happen. your sleep essentially looks like waiting for all jenkins threads for that pipeline to all complete before continuing ?
            rhinoceros rhinoceros.xn added a comment -

            dg424 yes. I think so.

             

            I had updated agent.jar two weeks ago with https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/remoting/4.14-rc3000.5949ea_7370a_f/remoting-4.14-rc3000.5949ea_7370a_f.jar for hours ,  the problem had not diminished.

             

            I found that the problem is only at git checkout, so try to add sleep before git checkout

            rhinoceros rhinoceros.xn added a comment - dg424 yes. I think so.   I had updated agent.jar two weeks ago with https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/remoting/4.14-rc3000.5949ea_7370a_f/remoting-4.14-rc3000.5949ea_7370a_f.jar for hours ,  the problem had not diminished.   I found that the problem is only at git checkout, so try to add sleep before git checkout
            matthewrgomes Matthew Gomes added a comment -

            Issue is only reproducible on Agent/ECS jobs that use Git plug-in https://plugins.jenkins.io/git/

             

            matthewrgomes Matthew Gomes added a comment - Issue is only reproducible on Agent/ECS jobs that use Git plug-in https://plugins.jenkins.io/git/  

            People

              vlatombe Vincent Latombe
              wasimj Wasim
              Votes:
              9 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: