Details
-
Type:
Bug
-
Status: Open (View Workflow)
-
Priority:
Major
-
Resolution: Unresolved
-
Component/s: core, kubernetes-plugin, remoting
-
Labels:
-
Environment:jenkins instance:
jenkins core 2.263.1
CentOS Linux 7 (Core)
kubernetes plugin 1.28.4
jenkins agent remoting VERSION=4.6
-websocket flag passed to jenkins agent
-
Similar Issues:
Description
I get intermittent agent disconnects while build is running. I'll try to provide as much info, let me know what else I can check.
- Jenkins master java version 11 (java-11-openjdk-11.0.5.10) started with hudson.slaves.ChannelPinger.pingIntervalSeconds 30 in order to avoid disconnects
- Nginx reverse proxy in use and ssl timeout is 5 minutes, which was too close to the default hudson.slaves.ChannelPinger.pingIntervalSeconds, so was reduced to 30 seconds with good results, and reduced the number of disconnects per day (stack trace was different and did not show a SIGHUP)
- jenkins masters are on premise
- jenkins agents are in GKE GCP kubernetes version 1.16.5
- jenkins agent container image has default java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) - remoting VERSION=4.6
- -websocket flag passed to jenkins agent via the k8s plugin extra cli command, I noticed afterwards there is a checkbox for websocket (in the kubernetes plugin config), but couldn't find docs to go with it, should I switch to using that?
- In terms of sizing, we peak to about 400 jenkins-agents / pods connected at a time, the limit is set to 500 in the jenkins kubernetes plugin configuration
- The issue happens even when load is low
The connection is established fine, but intermittently gets disconnected. Let me know what else I can look at.
Stack trace:
SignalException: SIGHUP FATAL: command execution failed java.nio.channels.ClosedChannelException at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141) at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91) at com.sun.proxy.$Proxy105.onWebSocketClose(Unknown Source) at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149) at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:316) at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86) at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359) at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288) at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280) at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293) at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381) at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264) at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905) at java.base/java.lang.Thread.run(Thread.java:834)
Thank you, it was easy enough to disable. Initially I was not expecting it to initiate a disconnect as opposed to making it offline. I have since removed all the monitoring actions, just to make sure but I have not seen any improvements to the original disconnect issue.
Since then I have added the support plugin in order to save Loggers to disk, and I am tracking hudson.slaves.ChannelPinger and hudson.remoting.PingThread thinking that maybe the pingthread would initiate the disconnect.
I am also able to reproduce the issue on a dedicated test jenkins server that I have moved back to using JNLP port 5006 for the channel instead of websocket.
For now, even under high load, all pings seem to respond in less than 1 second, with ping time set to check every 30s. The pingThread sees that the channel is getting closed at some point between two checks.
I have added tcpdump on both sides of the connection. From the agent perspective, the capture is incomplete as wireshark warns that the capture has been cut short in the middle of the packets. From the server I can clearly see a FIN ACK coming from the agent, to which we reply FIN ACK and we get the reply ACK. This is a normal orderly 3-way handshake TCP termination. I still dont know what causes it, but it appears that the smoking gun is not on the network side. Something higher level is perhaps terminating the process which in turn closes the connection but I have not found it.
Any other library I should try to add in the Loggers for investigation? The whole remoting?