-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
CentOS 7.6.1810 (Core)
Linux 3.10.0-957.27.2.el7.x86_64
OpenJDK Runtime Environment (build 1.8.0_222-b10)
Jenkins agent launched using this command: /etc/alternatives/jre/bin/java -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/nlu/projects/nlu_jenkins_here/jnlp_slaves/nlu_jenkins_here-slurm-jnlp-agent-0007rh038xtnw -cp /nlu/projects/nlu_jenkins_here/agent.jar hudson.remoting.jnlp.Main -headless -tunnel :50003 -url https://resources-ci-master-jenkins.grid.hosting.cerence.net -workDir /nlu/projects/here-work/resci-jenkins ef232d034300c5efad11577d284bec05ccc6079e4e4aba7c9cbf7b5adb0b9197 nlu_jenkins_here-slurm-jnlp-agent-0007rh038xtnw
...it's an agent connected using JNLP to the master, spawned through a Docker "cloud", but the Java process eventually does _not_ run in a container.
Jenkins master does run in a container (the jenkinsci/blueocean image, on a version from 2020-09-09) behind a reverse proxy.
Jenkins version: 2.235.5
Remoting version: 4.3CentOS 7.6.1810 (Core) Linux 3.10.0-957.27.2.el7.x86_64 OpenJDK Runtime Environment (build 1.8.0_222-b10) Jenkins agent launched using this command: /etc/alternatives/jre/bin/java -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/nlu/projects/nlu_jenkins_here/jnlp_slaves/nlu_jenkins_here-slurm-jnlp-agent-0007rh038xtnw -cp /nlu/projects/nlu_jenkins_here/agent.jar hudson.remoting.jnlp.Main -headless -tunnel :50003 -url https://resources-ci-master-jenkins.grid.hosting.cerence.net -workDir /nlu/projects/here-work/resci-jenkins ef232d034300c5efad11577d284bec05ccc6079e4e4aba7c9cbf7b5adb0b9197 nlu_jenkins_here-slurm-jnlp-agent-0007rh038xtnw ...it's an agent connected using JNLP to the master, spawned through a Docker "cloud", but the Java process eventually does _not_ run in a container. Jenkins master does run in a container (the jenkinsci/blueocean image, on a version from 2020-09-09) behind a reverse proxy. Jenkins version: 2.235.5 Remoting version: 4.3
Our JNLP agents often get stuck – the process is still running, Jenkins thinks they are still executing the build that was assigned to them, but the build is not progressing anywhere for days.
When we started investigating the state of such stuck agents, the first thing we noticed was the following message in the agent's log:
Sep 17, 2020 2:17:11 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Sep 17, 2020 2:17:30 PM hudson.remoting.RemoteInvocationHandler$Unexporter run SEVERE: Couldn't clean up oid=2 from null java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) at hudson.remoting.DelegatingExecutorService.submit(DelegatingExecutorService.java:42) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:46) at hudson.remoting.InterceptingExecutorService.submit(InterceptingExecutorService.java:41) at org.jenkinsci.remoting.util.AnonymousClassWarnings.check(AnonymousClassWarnings.java:66) at org.jenkinsci.remoting.util.AnonymousClassWarnings$1.annotateClass(AnonymousClassWarnings.java:122) at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1290) at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) at java.lang.Throwable.writeObject(Throwable.java:1014) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at hudson.remoting.Command.writeTo(Command.java:111) at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:287) at hudson.remoting.Channel.send(Channel.java:764) at hudson.remoting.RemoteInvocationHandler$PhantomReferenceImpl.cleanup(RemoteInvocationHandler.java:395) at hudson.remoting.RemoteInvocationHandler$PhantomReferenceImpl.access$1000(RemoteInvocationHandler.java:354) at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:612) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:111) at java.lang.Thread.run(Thread.java:748)
Because the process is still running, we can obtain the thread dump, which contains 3653 lines for threads like this:
"pool-1-thread-157 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=1138934" #172 daemon prio=5 os_prio=0 tid=0x00007f2a1c002800 nid=0x3d3a waiting for monitor entry [0x00007f2a49c81000] "pool-1-thread-155 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=1138866" #170 daemon prio=5 os_prio=0 tid=0x00007f29d8002000 nid=0x3d38 waiting for monitor entry [0x00007f2a4ad92000] "pool-1-thread-148 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=1138261" #163 daemon prio=5 os_prio=0 tid=0x00007f2a8c043800 nid=0x131b8 waiting for monitor entry [0x00007f2a4a388000]
Each of those threads (by quickly looking through the thread dump) seems to be stuck in the same call:
"pool-1-thread-13682 for JNLP4-connect connection to resources-ci-master-jenkins.grid.hosting.cerence.net/10.179.225.4:50003 id=2766416" #13698 daemon prio=5 os_prio=0 tid=0x00007f2a8cba6800 nid=0x12b54 waiting for monitor entry [0x00007f1fdd8a0000] java.lang.Thread.State: BLOCKED (on object monitor) at sun.misc.Unsafe.defineClass(Native Method) at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63) at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399) at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394) at java.security.AccessController.doPrivileged(Native Method) at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:393) at sun.reflect.MethodAccessorGenerator.generateMethod(MethodAccessorGenerator.java:75) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:53) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2177) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2286) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2210) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430) at hudson.remoting.UserRequest.deserialize(UserRequest.java:290) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117) at hudson.remoting.Engine$1$$Lambda$3/231593000.run(Unknown Source) at java.lang.Thread.run(Thread.java:748)
This massive deadlock grinds the agent to a halt – I suppose that's because in this state, the user has reached its ulimit on number of (lightweight) processes (ulimit -u) of 4096.
It could be that the trigger was some error in transmission over the network between the Jenkins master and the agent – I don't trust the network much. We also have occasional issues with reliability of the file systems. Anyhow, I am filing this ticket already, expecting that maybe it's partially caused by some issue in Jenkins or Remoting and will come back again.