Details
Description
When we upgrade and reboot the Jenkins agents, sometimes they hang on startup. We have about 50 agents and we upgrade/reboot them twice a day. About 1/100 times an agent will get stuck on startup.
On the Jenkins master, we see this error message from the hung agent's logs:
ERROR: Connection terminated java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
From the hung agent, we see the attached jstack thread dump with deadlock found. It looks like two threads are waiting on each other which causes the deadlock. After encountering this deadlock, the agent never finishes connecting to the master. The master is unable to use the agent as a node when it reaches this hung state.
Could the fact that the java versions are different contribute to this problem? The master has version 1.8.0_252-8u252-b09-1~18.04-b09 whereas the agents have java version 1.8.0_265-8u265-b01-0.