-
Bug
-
Resolution: Fixed
-
Major
-
Powered by SuggestiMate -
2.362
We regularly see issues with the jenkins/inbound-agent in our Jenkins logs on Kubernetes. It seems to occur in around 1% of all jobs.
The error message is below.
Whilst the error message refers to java.lang.OutOfMemoryError: and unable to create new native thread we have checked the pods and nodes in the cluster and there is always sufficient memory or threads available at the time of the error.
The specific versions for this specific error message are:
jenkins/inbound-agent:4.3-4
Jenkins 2.263.4
However we have also seen this error occur with different versions of both the inbound-agent and Jenkins.
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-100-64-244-120.eu-west-1.compute.internal/100.64.244.120:39138 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1800) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:1001) at hudson.FilePath.act(FilePath.java:1157) at hudson.FilePath.act(FilePath.java:1146) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:121) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:904) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:835) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1288) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423) at hudson.remoting.UserRequest.deserialize(UserRequest.java:290) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) Caused: java.io.IOException: Remote call on JNLP4-connect connection from ip-100-64-244-120.eu-west-1.compute.internal/100.64.244.120:39138 failed at hudson.remoting.Channel.call(Channel.java:1007) at hudson.FilePath.act(FilePath.java:1157) at hudson.FilePath.act(FilePath.java:1146) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:121) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:904) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:835) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1288) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
- is duplicated by
-
JENKINS-70047 k8s cloud slave pod unable to create native thread: possibly out of memory or process/resource limits reached
-
- Closed
-
- is related to
-
JENKINS-64904 kubernetes agent: sometimes, java.lang.OutOfMemoryError: unable to create new native thread
-
- Open
-
- relates to
-
JENKINS-68199 java.lang.OutOfMemoryError: unable to create new native thread
-
- Closed
-
-
JENKINS-68954 Agent with remoting 4.14+ hangs up or show response time >5s
-
- Fixed but Unreleased
-
- links to
[JENKINS-65873] java.lang.OutOfMemoryError: unable to create new native thread
The limit you chose of 5000 would still have failed for me. My Jenkins agent runs on a modestly powered windows instance. Ideally this limit would come from a parameter or system property. System property via -D in the agent launcher script? I'm not sure what the usual project standard is for such settings. Is it part of the Node config in the jenkins UI?
The limit you chose of 5000 would still have failed for me.
Right - this wasn't intended to be the actual production limit, but just a way to get the problem to reproduce locally.
System property via -D in the agent launcher script? I'm not sure what the usual project standard is for such settings. Is it part of the Node config in the jenkins UI?
I'm not too sure either. This is where we'd really need some design guidance from maintainers. jthompson are you still maintaining Remoting?
Yes, basil , I am still maintaining Remoting, but on a completely volunteer basis these days. This month and next my free time is very limited so I'm not going to have much time for prepping a change and testing. I'd be happy to look over a PR, though, especially if it came with sufficient explanation and testing. I'm not sure how much of this area I understand already.
Reformatted the stack trace in the description to make it not display so many pairs of braces.
We have been seeing same issue in our environment. basil / cpholt Do you have docker image with fix to try ?
We use a self prepared Docker image based on ubuntu focal and remoting-4.10.jar. We observe the same issue running as pod in our K8S Cluster.
apiVersion: "v1" kind: "Pod" metadata: labels: jenkins: "slave" jenkins/label-digest: "168c12f11d09a233175f435329c242e1f2f941f9" jenkins/label: "jenkins-slave-simple" name: "jenkins-slave-simple-w4z4f" spec: containers: - env: - name: "JENKINS_SECRET" value: "********" - name: "JENKINS_AGENT_NAME" value: "jenkins-slave-simple-w4z4f" - name: "JENKINS_NAME" value: "jenkins-slave-simple-w4z4f" - name: "JENKINS_AGENT_WORKDIR" value: "/home/jenkins" - name: "JENKINS_URL" value: "https://<xxx>" image: "registry<xxx>/jenkins-slave-simple:4.10" imagePullPolicy: "Always" name: "jnlp" resources: limits: memory: "1024Mi" cpu: "500m" requests: memory: "512Mi" cpu: "100m" tty: true volumeMounts: - mountPath: "/home/jenkins" name: "workspace-volume" readOnly: false workingDir: "/home/jenkins" hostNetwork: false imagePullSecrets: - name: "registry-gitlab" nodeSelector: kubernetes.io/os: "linux" restartPolicy: "Never" volumes: - emptyDir: medium: "" name: "workspace-volume" Running on jenkins-slave-simple-w4z4f in /home/jenkins/workspace/<xxx> [Pipeline] { [Pipeline] stage [Pipeline] { (Checkout) [Pipeline] deleteDir [Pipeline] withCredentials Masking supported pattern matches of $BBUser [Pipeline] { [Pipeline] sh [Pipeline] } [Pipeline] // withCredentials [Pipeline] } [Pipeline] // stage [Pipeline] emailext Request made to compress build log #648811 is still in progress; ignoring for purposes of comparison Sending email to: <xxx> [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1795) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) at hudson.remoting.Channel.call(Channel.java:1001) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123) at hudson.Launcher$ProcStarter.start(Launcher.java:508) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76) at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66) at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at hudson.remoting.UserRequest.deserialize(UserRequest.java:289) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at java.lang.Thread.run(Thread.java:748) Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.42.2.0/10.42.2.0:35994 failed at hudson.remoting.Channel.call(Channel.java:1005) at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1123) at hudson.Launcher$ProcStarter.start(Launcher.java:508) at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176) at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:136) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:320) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at jdk.internal.reflect.GeneratedMethodAccessor42730.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022) at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163) at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23) at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:158) at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135) at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17) at WorkflowScript.run(WorkflowScript:155) at ___cps.transform___(Native Method) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:89) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113) at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83) at jdk.internal.reflect.GeneratedMethodAccessor518.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:76) at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30) at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66) at jdk.internal.reflect.GeneratedMethodAccessor609.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72) at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21) at com.cloudbees.groovy.cps.Next.step(Next.java:83) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174) at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163) at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129) at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268) at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18) at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Finished: FAILURE
The problem seems to be with the Jenkins core itself and the way it is spawning threads to log messages as can seen from the stack trace:
java.lang.OutOfMemoryError: unable to create new native thread *********************************************************************************************************** at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at jenkins.util.InterceptingScheduledExecutorService.schedule(InterceptingScheduledExecutorService.java:49) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:72) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:68) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:77) *********************************************************************************************************** at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2133) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2342) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2266) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2124) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423) at hudson.remoting.UserRequest.deserialize(UserRequest.java:290) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
This is from the server side as the workflow plugin does not exist on the agent side.
I submitted a PR #505 which should allow to test whether the current class check is part of the problem without impacting normal operations.
Hi vlatombe
But I see the remoting stack is on both sides (remoting.jar is in the jenkins.war file as well) and the stack trace in my comment above shows classes (org.jenkinsci.plugins.workflow.log, jenkins.util.InterceptingScheduledExecutorService) that I cannot find in remoting.jar on the the agent side. So, I'm actually not sure where the OOM is happening; if your PR is to address only the agent side, then it means the root cause of the exception is on the agent side but the error shows up on the server side ? I'm confused
I see org.jenkinsci.plugins.workflow.log classes in these files on an agent:
- remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar (workflow-api 1136.v7f5f1759dc16)
- remoting/jarCache/AA/E8875DDC0E79929E944D30636208F6.jar (workflow-api 1108.v57edf648f5d4)
- remoting/jarCache/EC/7A1A038FDCBC2456010A181E58E35B.jar (workflow-api 1122.v7a_916f363c86)
I don't know whether those file names are hashes or just random. Anyway, it's conceivable that the agent could load org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream etc. from these files.
Oh, Checksum.java first computes an SHA-256 hash, but then splits that to two 128-bit parts and xors them together. It's not just a truncated hash as in section 5.1 of NIST Special Publication 800-107 Revision 1.
$ sha256sum remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar 38165eeaa9e20f4a5bdced3d142660b13ec55dfea343aba86da3775ee1ab5196 *remoting/jarCache/06/D303140AA1A4E2367F9A63F58D3127.jar
38165eeaa9e20f4a5bdced3d142660b1 xor 3ec55dfea343aba86da3775ee1ab5196 = 06D303140AA1A4E2367F9A63F58D3127
Hi kon,
Thanks. I see it now. So, these classes are "shipped" to the agent at runtime. If I fire up the agent and do not start a job, the classes do not exist. Just trying to understand how the process works...
Tried the PR on my stress test job and still get OOM with
-Dorg.jenkinsci.remoting.util.AnonymousClassWarnings.useSeparateThreadPool=true
java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:73) at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:130) at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290) at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at hudson.remoting.Command.writeTo(Command.java:111) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287) at hudson.remoting.Channel.send(Channel.java:766) at hudson.remoting.Request.callAsync(Request.java:238) at hudson.remoting.Channel.callAsync(Channel.java:1030) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285) at com.sun.proxy.$Proxy3.notifyJarPresence(Unknown Source) at hudson.remoting.FileSystemJarCache.lookInCache(FileSystemJarCache.java:80) at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:49) at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93) at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45) at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:173) at org.jenkinsci.plugins.gitclient.Git$GitAPIMasterToSlaveFileCallable.invoke(Git.java:154) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3317) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
One thing I forgot to mention that is important - for us, this failure only occurs WHEN the pipeline has a git checkout stage.
I did a couple of testing recently to troubleshoot this problem.
Was able to capture a thread dump / heapdump by catching the OOM in remoting and generating dumps.
The JVM heap is very low (< 10 MB). Thread dumps reveal that there are only ~50 threads when the issue happens. With a default stack size of 256KB, I doubt that this has much of an effect. So I am not convinced this is due to those pools spawning too many threads.
Note that ulimits are very high within the jnlp container:
$ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 128448 max locked memory (kbytes, -l) 16384 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ cat /proc/sys/kernel/threads-max 256897 $ cat /sys/fs/cgroup/pids/pids.max max
Now maybe I am looking at this wrong. Giving such a high limit, maybe the jnlp container of another pod that is not failing is consuming those limits and impacting other containers. That being said, the external system that I have in place (GKE and Datadog) do not show spike in PIDs or anything explicit .
So I thought that maybe this is an off heap memory issue or just an isolation behavior ,that happens during remoting class loading.
Now I have enabled NMT tracking and things gets interesting though I am not too familiar with memory management at that level. What I see is that despite giving the jnlp container some limits - of for example 500Mi - the reserved memory is higher than I expected:
Total: reserved=1538742KB, committed=133862KB
Most of it coming from the Metaspace:
- Class (reserved=1070101KB, committed=22805KB)
(classes #3545)
(malloc=1045KB #4970)
(mmap: reserved=1069056KB, committed=21760KB)
When I don't set container limit, the reserved memory is quite higher (for k8s node with a capacity of 16G):
Total: reserved=4600892KB, committed=310052KB
I am not sure if this is part of the problem or not. That being said, that Class/Metaspace size seems to be a constant ~1GB. When raising the container limit to 2Mi or more, the reserved memory for Class/Metaspace is similar and the total amount of reserved memory lower than the container limit!
I don't have enough knowledge in this area to know if that is related but maybe someone here does. That area of the off heap memory can be controlled with -XX:MaxMetaspaceSize and XX:CompressedClassSpaceSize. Maybe setting those help mitigate the problem though I would not know the impact or the right value -XX:MaxMetaspaceSize=100m -XX:CompressedClassSpaceSize=100m.
Still investigating...
Default JVM ergonomics is (https://docs.oracle.com/en/java/javase/11/gctuning/ergonomics.html)
- Maximum heap size of 1/4 of physical memory (accounts for container limits)
- Metaspace size defaults to 1g (https://docs.oracle.com/en/java/javase/11/gctuning/other-considerations.html#GUID-B29C9153-3530-4C15-9154-E74F44E3DAD9)
I would tend to believe that the container would get OOMKilled if it went over the defined container limit which isn't the case here.
Yes, I've already tried tweaking the pod spec to set the memory limit, stack size (reducing it), more cpu, and none of these worked - still got the OOM eventually. ulimits are fine, so it's not this as well. Given the "random nature" of the error, it looks like some race condition situation where the code just spins out of control. Also note the related PR by Vincent here - https://github.com/jenkinsci/remoting/pull/505 - that I've been putting under my stress test setup.
Hello dg424 , Did this PR solve the problem?
https://github.com/jenkinsci/remoting/pull/505#issuecomment-1046913986
dg424 would you be able to share the Jenkinsfile you use to reproduce the problem? Or is it too specific?
I can share the same stress test job layout and you can put your own settings to reproduce.
pipeline { agent { kubernetes { inheritFrom 'k8s-default' containerTemplate { name 'mycontainer' image "someimage:latest" privileged false alwaysPullImage false workingDir '/home/jenkins' ttyEnabled true command 'cat' args '' } defaultContainer 'mycontainer' } } // run until we get 65873 error triggers { cron "* * * * *" } options { disableConcurrentBuilds() } stages { stage('Checkout SCM') { steps { checkout([$class: 'GitSCM', branches: [[name: "FETCH_HEAD"]], doGenerateSubmoduleConfigurations: false, extensions: [[$class: 'SubmoduleOption', disableSubmodules: false, parentCredentials: true, recursiveSubmodules: true, reference: '', trackingSubmodules: false]], submoduleCfg: [], userRemoteConfigs: [[credentialsId: 'mygit-cred', url: 'ssh://git@mycompany.net/test.git']]]) } } stage('First stage') { steps { script { echo "Inside first stage" } } } } post { failure { // we ALWAYS get here eventually as a result of 65873 issue echo "Failure!" script { // we got the error, disable job and send email Jenkins.instance.getItemByFullName(env.JOB_NAME).doDisable() emailext body: "${BUILD_URL}", subject: "[Jenkins]: ${JOB_NAME} build failed", to: 'foo@bar.com' } } } }
dg424 Do you have any limit defined at infrastructure level that would apply cpu/memory limits to the pod and containers?
Yea, we do, but ... I'm not sure if this is causing the issue as the pipeline would run for a very long time before getting the error (i.e. using the same resources that k8s has assigned to it on each run). Also, as you can see, the pipeline really doesn't do anything much and it checks out an empty git project, so in terms of resources, it uses very little directly. But the key stage for reproduction is the one that does the checkout. Without this, the problem is not reproducible. I think this is why the other comment on github suggested a downgrade of git prior to when this ticket was opened.
dg424 Yes, I think some thing part of the class loading that is done in preparation for the git checkout triggers the problem. But it could be more apparent in low memory environments, so ideally I'd like to set up a reproducer that is as close as possible to something that can trigger the problem.
Since that test pipeline uses almost no resources, you should be able to quickly setup one directly on the jenkins master to see if the problem exists in the area you suspect as I don't think it matters in that case whether you're using a k8s agent or running directly on the master ? If there is some kind of leak in the class loading area, it should eventually hit the problem ?
I created a related ticket here - https://issues.jenkins.io/browse/JENKINS-68199 - for the git client side since vlatombe is thinking that the Jenkins side might be ok.
On the agent side on jdk11, the following is printed when the problem occurs
[14.901s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
This led me to https://stackoverflow.com/a/47082934, then this kernel bug from 2016, still not fixed.
I also found an interesting issue on JDK issue tracker that led to https://bugs.openjdk.java.net/browse/JDK-8268773, and a fix in jdk18 that is doing retries, so it could possibly workaround the issue.
Anyone cares to attempt to backport this to say, jdk11?
So, this is saying that this "can" happen at any time on "any" Java program that uses multiple threads ? The fact that we only see this when our pipeline contains a git checkout stage is purely coincidental ?
I would request the backport, but I don't have an openjdk account.
Sorry Vincent, I have not read this thread. What is the reason you are pinging me specifically?
basil I think you submitted a backport on the JDK lately. I think the issue described here could be already fixed on jdk-18 and jdk-19 trees (this comment sums up my findings). Would you be able to build a backport on jdk-11 branch so that I could test it?
I fail to follow the reasoning in this comment. I do not see why you would need me to build a backport - just cherry-pick the change onto https://github.com/openjdk/jdk17u-dev or https://github.com/openjdk/jdk11u-dev and see if it fixes your problem. If you have a clean backport patch along with steps to reproduce the problem that fail before the patch and succeed after the patch, along with clear written reasoning about why the patch fixes the original problem, and are simply looking for someone who has signed the Oracle Contributor Agreement (OCA) to propose it, I might be willing to help, but for now I am unwatching this issue so that I do not receive further notifications.
I could have, however since Jenkins is only starting to support java 17 in preview, I don't know what kind of surprises could arise from trying out a version that is not battle tested in our context. In any case I have built a jdk11 with the referenced commit cherry-picked and I'm currently running my reproduction harness to check whether the problem is gone.
Cherry-picking https://github.com/openjdk/jdk/commit/e35005d5ce383ddd108096a3079b17cb0bcf76f1 on jdk11 and running the harness overnight shows a very significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds)
jenkinsci/remoting#523 has been released in Remoting 4.14 and Jenkins 2.348. This helps alleviate the problem to some degree, but it does not eliminate the problem.
Backporting JDK-8268773 / openjdk/jdk@e35005d5ce3 showed a significant reduction of the number of occurrences of the problem (0.04% instead of 0.2% over 5000-6000 builds). JDK-8268773 / openjdk/jdk@e35005d5ce3 has been backported to jdk11u-dev in JDK-8286753 / openjdk/jdk11u-dev#1074 and to jdk17u-dev in JDK-8286629 / openjdk/jdk17u-dev#390.
After adding sleep(10) before git checkout this problem no longer occurs.
Maybe sleep(10) before git or checkout step is a workaround.
sleep(10) checkout changelog: false, poll: false, scm: ........ OR sleep(10) git branch: 'master', credentialsId: '******', url: 'git@git.yourcomampy.com:xx/zz.git'
before git OR checkout step:
// code placeholder node(label) { stage('xxzzxasd') { container('xxxx') { stage('git clone') { sleep(10) //** HERE: adding sleep(10) before git checkout ** git branch: 'master', credentialsId: '', url: 'git@xxx.com:xxx/xxxxxxx.git'
rhinoceros so if this is true, then it indicates that the issue is with some kind of race condition within the jenkins pipeline flow that is causing this to happen. your sleep essentially looks like waiting for all jenkins threads for that pipeline to all complete before continuing ?
dg424 yes. I think so.
I had updated agent.jar two weeks ago with https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/remoting/4.14-rc3000.5949ea_7370a_f/remoting-4.14-rc3000.5949ea_7370a_f.jar for hours , the problem had not diminished.
I found that the problem is only at git checkout, so try to add sleep before git checkout
Issue is only reproducible on Agent/ECS jobs that use Git plug-in https://plugins.jenkins.io/git/
I upgraded to 2.363-jdk11 w/ agents on 4.10-3-jdk11 and still getting the same error sporadically:
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.32.11.76/10.32.11.76:34066 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1784) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) at hudson.remoting.Channel.call(Channel.java:1000) at hudson.FilePath.act(FilePath.java:1186) at hudson.FilePath.act(FilePath.java:1175) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:140) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1297) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang.Thread.start0(Native Method) at java.base/java.lang.Thread.start(Unknown Source) at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104) at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source) at org.jenkinsci.remoting.util.ExecutorServiceUtils.submitAsync(ExecutorServiceUtils.java:58) at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:66) at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93) at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45) at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:284) at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:264) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:223) at java.base/java.lang.ClassLoader.loadClass(Unknown Source) at java.base/java.lang.ClassLoader.loadClass(Unknown Source) at jenkins.util.Timer.get(Timer.java:47) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.reschedule(DelayBufferedOutputStream.java:74) at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream.<init>(DelayBufferedOutputStream.java:70) at org.jenkinsci.plugins.workflow.log.BufferedBuildListener$Replacement.readResolve(BufferedBuildListener.java:79) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at java.base/java.io.ObjectStreamClass.invokeReadResolve(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.readObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject(Unknown Source) at hudson.remoting.UserRequest.deserialize(UserRequest.java:289) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122) at java.base/java.lang.Thread.run(Unknown Source) Caused: java.io.IOException: Remote call on JNLP4-connect connection from 10.32.11.76/10.32.11.76:34066 failed at hudson.remoting.Channel.call(Channel.java:1004) at hudson.FilePath.act(FilePath.java:1186) at hudson.FilePath.act(FilePath.java:1175) at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:140) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:916) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:847) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1297) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:97) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:84) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Finished: FAILURE
I upgraded to […] 4.10-3-jdk11
kryan90 You aren't running with the fix. The fix is in Remoting 4.13.3 (LTS line) and Remoting 3044.vb_940a_a_e4f72e (weekly line). Java 11.0.16 is also recommended.
basil Definitely interested in this fix. We run the Docker containers for the controller and agent side, can you share which tags of these include this fix ? For instance, I don't see a 4.13 release for the agent side here - https://hub.docker.com/r/jenkins/inbound-agent/tags?page=1&name=jdk11. Is an update required on both the controller and agent side ? As Kevin tried above assuming that the fix is on the controller side only and used the latest available tagged image for the agent - 4.10. So, just need some clarity on which Docker tags we need to use here. Thanks.
I don't know anything about how the Docker images for agents are built or what versions of Remoting they include in them.
basil Another question - do both the controller and agent have to have this fix/version ? Also, which component do I raise a ticket for to address the inbound-agent Docker image side of this ?
The fix is in remoting.jar which is the main JAR for running the agent process. That having been said, that JAR file gets shipped over from the controller in some scenarios (e.g. SSH Build Agents, where the controller uses SSH to connect to the agent, ship over its copy of remoting.jar, and then start the agent process), so in some sense the controller version is relevant in the sense that it does bundle remoting.jar to be used in certain agent scenarios. That having been said I think your Docker image use case bundles its own copy of remoting.jar completely separate from the controller's version. I believe the maintainers of the Docker images use GitHub issues on the corresponding GitHub repositories to track issues.
I put together a local reproducer for this bug. First, I created a Python script to create a large burst of output:
I put this script in /tmp/lipsum.py.
Then I built Remoting with a 100 millisecond sleep:
I installed this with mvn clean install -DskipTests.
In Jenkins core I used this patch:
Running the above with MAVEN_OPTS=-Xmx4g mvn clean verify -Dspotbugs.skip=true -Dcheckstyle.skip=true -Dtest=hudson.model.ProjectTest#testRemoting, the test passes. Watching the thread count, I get up to 15,500 threads for the agent process. This is a lot of threads, but not enough to trigger an out of memory error on my system.
Next I needed a way to trigger the error. I'm on a Linux desktop with about 1,500 threads running at idle, so tried putting various numbers in /proc/sys/kernel/threads-max to limit the maximum number of threads on my system. By default the limit was over 250,000 threads, which didn't result in an OOM. A limit of 23,000 threads still wasn't enough to trigger an OOM. But a limit of 22,000 threads was enough to consistently trigger this:
Finding a baseline was important: in the passing scenario, my regular desktop applications were using about 1,500 threads, the agent was using about 15,500 threads, and the other test machinery (e.g. the Jenkins controller process and JUnit) must have been using about ~5,000 threads. Based on this, I added this patch to Remoting:
I wanted these numbers to be as high as possible in order for the test to finish in a reasonable amount of time with the 100 millisecond sleep in Remoting while still being low enough to demonstrate that bounding the thread pools works to get the test to pass in a 22,000-thread-limited environment (which was failing with newCachedThreadPool). My theory was that this would cap the agent process at 10,000 threads (rather than the old 15,500), which (along with the 1,500 threads for my desktop applications and the test machinery's 5,000 - 6,000 threads) should still be well under the system's 22,000 thread limit. Sure enough, the fix worked! The test passed again. No more OOM.
I think this demonstrates that putting an upper bound on the number of threads in Remoting will solve this problem.