[JENKINS-64598] Jenkins agent disconnects on k8s with SIGHUP / ClosedChannelException

Type: Bug
Resolution: Not A Defect
Priority: Major
Component/s: core, kubernetes-plugin, remoting
Labels:
- websocket
Environment:
jenkins instance:
jenkins core 2.263.1
CentOS Linux 7 (Core)
kubernetes plugin 1.28.4

jenkins agent remoting VERSION=4.6
-websocket flag passed to jenkins agent

Similar Issues:
Powered by SuggestiMate

Show

I get intermittent agent disconnects while build is running. I'll try to provide as much info, let me know what else I can check.

Jenkins master java version 11 (java-11-openjdk-11.0.5.10) started with hudson.slaves.ChannelPinger.pingIntervalSeconds 30 in order to avoid disconnects
Nginx reverse proxy in use and ssl timeout is 5 minutes, which was too close to the default hudson.slaves.ChannelPinger.pingIntervalSeconds, so was reduced to 30 seconds with good results, and reduced the number of disconnects per day (stack trace was different and did not show a SIGHUP)
jenkins masters are on premise
jenkins agents are in GKE GCP kubernetes version 1.16.5
jenkins agent container image has default java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
remoting VERSION=4.6
-websocket flag passed to jenkins agent via the k8s plugin extra cli command, I noticed afterwards there is a checkbox for websocket (in the kubernetes plugin config), but couldn't find docs to go with it, should I switch to using that?
In terms of sizing, we peak to about 400 jenkins-agents / pods connected at a time, the limit is set to 500 in the jenkins kubernetes plugin configuration
The issue happens even when load is low

The connection is established fine, but intermittently gets disconnected. Let me know what else I can look at.

Stack trace:

SignalException: SIGHUP
FATAL: command execution failed
java.nio.channels.ClosedChannelException
	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
	at com.sun.proxy.$Proxy105.onWebSocketClose(Unknown Source)
	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149)
	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:316)
	at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86)
	at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359)
	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288)
	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280)
	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293)
	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
	at java.base/java.lang.Thread.run(Thread.java:834)

relates to

JENKINS-62576 Websockets connection unstable since remoting 4.2.1 (LTS 2.222.4)

Open

Tim Jacomb added a comment - 2021-01-18 08:22

cc jglick

Tim Jacomb added a comment - 2021-01-18 08:22 cc jglick

Samuel Beaulieu added a comment - 2021-01-19 00:12

My next step is going to set the default java version for the jenkins agent container to java 11. I've also noticed higher CPU load and CPU spikes when we moved more traffic to k8s, and trying to track this down. Is the pingIntervalSeconds a heavy operation for the jenkins server in general? would k8s agent increase CPU usage?

Samuel Beaulieu added a comment - 2021-01-19 00:12 My next step is going to set the default java version for the jenkins agent container to java 11. I've also noticed higher CPU load and CPU spikes when we moved more traffic to k8s, and trying to track this down. Is the pingIntervalSeconds a heavy operation for the jenkins server in general? would k8s agent increase CPU usage?

Jesse Glick added a comment - 2021-01-20 04:46

As far as I know there is no straightforward way to track down why the connection gets broken. Possibly related to your nginx configuration.

Jesse Glick added a comment - 2021-01-20 04:46 As far as I know there is no straightforward way to track down why the connection gets broken. Possibly related to your nginx configuration.

Samuel Beaulieu added a comment - 2021-01-20 16:06 - edited

I have moved the agents to using java 11, but it did not help with the issue.

Would the nginx logs show something?

Some metrics it happens about 30-60 a day with a few hundred builds. It can happen at undertermined time, for example after 5 minutes, 60 minutes or 90+ minutes of successfully running. No other pattern is noticed.

Now, from the nginx docs on websockets:

Alternatively, the proxied server can be configured to periodically send WebSocket ping frames to reset the timeout and check if the connection is still alive.

I am assuming thats what I can use hudson.slaves.ChannelPinger.pingIntervalSeconds for? All the timeouts I have set are above the pingIntervalSeconds of 30 seconds. I'll try http://nginx.org/en/docs/http/ngx_http_core_module.html#lingering_close set to 'always' but I'm not convinced this is the root cause.

We also get a similar disconnect (although with different stack trace, as expected) without websockets and going to the defined jnlp port. Moving to websocket was an attempt to get rid of those disconnects, we hopped that standard web spec would be more resilient than the jenkins agent jnlp connection. Any preference at this point from cloudbees engineers?

Samuel Beaulieu added a comment - 2021-01-20 16:06 - edited I have moved the agents to using java 11, but it did not help with the issue. Would the nginx logs show something? Some metrics it happens about 30-60 a day with a few hundred builds. It can happen at undertermined time, for example after 5 minutes, 60 minutes or 90+ minutes of successfully running. No other pattern is noticed. Now, from the nginx docs on websockets: Alternatively, the proxied server can be configured to periodically send WebSocket ping frames to reset the timeout and check if the connection is still alive. I am assuming thats what I can use hudson.slaves.ChannelPinger.pingIntervalSeconds for? All the timeouts I have set are above the pingIntervalSeconds of 30 seconds. I'll try http://nginx.org/en/docs/http/ngx_http_core_module.html#lingering_close set to 'always' but I'm not convinced this is the root cause. We also get a similar disconnect (although with different stack trace, as expected) without websockets and going to the defined jnlp port. Moving to websocket was an attempt to get rid of those disconnects, we hopped that standard web spec would be more resilient than the jenkins agent jnlp connection. Any preference at this point from cloudbees engineers?

Jesse Glick added a comment - 2021-01-20 22:22

The Jenkins layer for WS sends WS ping packets automatically, whether or not ChannelPinger is active at a higher level.

Without knowing the cause of the disconnection I cannot offer any further advice I am afraid.

Jesse Glick added a comment - 2021-01-20 22:22 The Jenkins layer for WS sends WS ping packets automatically, whether or not ChannelPinger is active at a higher level. Without knowing the cause of the disconnection I cannot offer any further advice I am afraid.

Samuel Beaulieu added a comment - 2021-01-21 19:14

jglick thank you. Is SIGHUP in that context expected and should it be caught and just trigger a retry instead of a FATAL error? Like when you SIGHUP other linux proceses to reload config and reconnect?

Samuel Beaulieu added a comment - 2021-01-21 19:14 jglick thank you. Is SIGHUP in that context expected and should it be caught and just trigger a retry instead of a FATAL error? Like when you SIGHUP other linux proceses to reload config and reconnect?

Jesse Glick added a comment - 2021-01-22 23:31

It is usually left up to a higher layer to try to reconnect. For example, if the agents are run as a K8s Deployment then the error should cause a pod exit and recreation. I am not going to get into detailed recommendations for management; more of a users’ list question.

Jesse Glick added a comment - 2021-01-22 23:31 It is usually left up to a higher layer to try to reconnect. For example, if the agents are run as a K8s Deployment then the error should cause a pod exit and recreation. I am not going to get into detailed recommendations for management; more of a users’ list question.

Samuel Beaulieu added a comment - 2021-01-25 18:06

AFAIK there is no deployment for the jenkins kubernetes plugin, jenkins-agents are run as pods. Shouldn't the reconenct be first dealt at the application level, in the remoting library, instead of falling back to a cluster level restart of the application? If the pod is still available and running, it would be less costly then trying to use a deployment for exit and recreation. Even if we had a deployment for that, that kind of reconnect is not possible until the jenkins-agent remoting application can support it.

I'm still trying to troubleshoot the disconnect and find the root cause of the disconnect. I've been running dummy jenkins jobs/pods with tcpdump in an attempt to learn where the disconnect is initiated.

Samuel Beaulieu added a comment - 2021-01-25 18:06 AFAIK there is no deployment for the jenkins kubernetes plugin, jenkins-agents are run as pods. Shouldn't the reconenct be first dealt at the application level, in the remoting library, instead of falling back to a cluster level restart of the application? If the pod is still available and running, it would be less costly then trying to use a deployment for exit and recreation. Even if we had a deployment for that, that kind of reconnect is not possible until the jenkins-agent remoting application can support it. I'm still trying to troubleshoot the disconnect and find the root cause of the disconnect. I've been running dummy jenkins jobs/pods with tcpdump in an attempt to learn where the disconnect is initiated.

Samuel Beaulieu added a comment - 2021-01-26 17:02

The error log in jenkins.log at the same time as the disconnect:

2021-01-26 15:25:23.121+0000 [id=557104]	INFO	j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-557104 for jnlp-0ttwg terminated: java.nio.channels.ClosedChannelException

I've also set the k8s plugin rule to not delete pods that have failed, but it still deleted it

2021-01-26 15:25:23.493+0000 [id=552662]	INFO	o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent jnlp-0ttwg
2021-01-26 15:25:23.598+0000 [id=552662]	INFO	o.c.j.p.k.KubernetesSlave#deleteSlavePod: Terminated Kubernetes instance for agent ci-jenkins-setup-enterprise/jnlp-0ttwg
Terminated Kubernetes instance for agent ci-jenkins-setup-enterprise/jnlp-0ttwg
2021-01-26 15:25:23.598+0000 [id=552662]	INFO	o.c.j.p.k.KubernetesSlave#_terminate: Disconnected computer jnlp-0ttwg
Disconnected computer jnlp-0ttwg

Samuel Beaulieu added a comment - 2021-01-26 17:02 The error log in jenkins.log at the same time as the disconnect: 2021-01-26 15:25:23.121+0000 [id=557104] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Jetty (winstone)-557104 for jnlp-0ttwg terminated: java.nio.channels.ClosedChannelException I've also set the k8s plugin rule to not delete pods that have failed, but it still deleted it 2021-01-26 15:25:23.493+0000 [id=552662] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent jnlp-0ttwg 2021-01-26 15:25:23.598+0000 [id=552662] INFO o.c.j.p.k.KubernetesSlave#deleteSlavePod: Terminated Kubernetes instance for agent ci-jenkins-setup-enterprise/jnlp-0ttwg Terminated Kubernetes instance for agent ci-jenkins-setup-enterprise/jnlp-0ttwg 2021-01-26 15:25:23.598+0000 [id=552662] INFO o.c.j.p.k.KubernetesSlave#_terminate: Disconnected computer jnlp-0ttwg Disconnected computer jnlp-0ttwg

Samuel Beaulieu added a comment - 2021-01-26 17:13

Ok I've figured something, the jenkins-agent context is definitely receiving a SIGNAL of some sort, most likely SIGHUP like shown in the stack trace. My tcpdump command was interrupted and showed the packet received and that only happens if tcpdump receives a signal.

build console logs, disconnected after ~8 minutes:

09:17:18 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:25:23 4739 packets captured
09:25:23 4742 packets received by filter
09:25:23 0 packets dropped by kernel

Is there a way to know the originator of the SIGHUP, it usually contains the originator pid number?

I'm looking at the network traces and I don't see anything paricularly suspicious.

Samuel Beaulieu added a comment - 2021-01-26 17:13 Ok I've figured something, the jenkins-agent context is definitely receiving a SIGNAL of some sort, most likely SIGHUP like shown in the stack trace. My tcpdump command was interrupted and showed the packet received and that only happens if tcpdump receives a signal. build console logs, disconnected after ~8 minutes: 09:17:18 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 09:25:23 4739 packets captured 09:25:23 4742 packets received by filter 09:25:23 0 packets dropped by kernel Is there a way to know the originator of the SIGHUP, it usually contains the originator pid number? I'm looking at the network traces and I don't see anything paricularly suspicious.

Samuel Beaulieu added a comment - 2021-02-10 19:01

jglick I may have found something and I think you can help me with that.

If you go to node monitoring (/computer/configure) under the help text for "response time" it says

This monitors the round trip network response time from the master to the agent, and if it goes above a threshold repeatedly, it marks the agent offline.
This is useful for detecting unresponsive agents, or other network problems that clog the communication channel. More specifically, the master sends a no-op command to the agent, and checks the time it takes to get back the result of this no-op command.

I checked jenkins core code for that ResponseTimeMonitor and it sends a disconnect, contrary to the description saying it puts it Offline mode!

https://github.com/jenkinsci/jenkins/blame/master/core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java#L72-L78

My understanding of that code is thats if the timeout (hardcoded to 5000ms) is hit 5 times in a row, it’ll disconnect it thinking the node is hung. Is that an old assumption from back in the day where we only had a few static agents connected to the server?

Looking at the monitoring page during load /computer/ I saw a few with 8000ms + response time. I’d sure like to investigate if thats our real response time, since my tcpdump captures only show 50-200 ms RTT for tcp packets. I understand that the response time monitor is checking the actual time it would take the agent to complete a task.

My question, does this scale up when you have hundreds of agents connected at the same time? Would a higher load make the response time return a skewed number? I'm asking because for a workload perspective, the agent is performing as expected and does not seem to be lagging. We have moved our workload from a different cloud and can compare the running time of each builds and they are similar or better.

Samuel Beaulieu added a comment - 2021-02-10 19:01 jglick I may have found something and I think you can help me with that. If you go to node monitoring (/computer/configure) under the help text for "response time" it says This monitors the round trip network response time from the master to the agent, and if it goes above a threshold repeatedly, it marks the agent offline. This is useful for detecting unresponsive agents, or other network problems that clog the communication channel. More specifically, the master sends a no-op command to the agent, and checks the time it takes to get back the result of this no-op command. I checked jenkins core code for that ResponseTimeMonitor and it sends a disconnect, contrary to the description saying it puts it Offline mode! https://github.com/jenkinsci/jenkins/blame/master/core/src/main/java/hudson/node_monitors/ResponseTimeMonitor.java#L72-L78 My understanding of that code is thats if the timeout (hardcoded to 5000ms) is hit 5 times in a row, it’ll disconnect it thinking the node is hung. Is that an old assumption from back in the day where we only had a few static agents connected to the server? Looking at the monitoring page during load /computer/ I saw a few with 8000ms + response time. I’d sure like to investigate if thats our real response time, since my tcpdump captures only show 50-200 ms RTT for tcp packets. I understand that the response time monitor is checking the actual time it would take the agent to complete a task. My question, does this scale up when you have hundreds of agents connected at the same time? Would a higher load make the response time return a skewed number? I'm asking because for a workload perspective, the agent is performing as expected and does not seem to be lagging. We have moved our workload from a different cloud and can compare the running time of each builds and they are similar or better.

Jesse Glick added a comment - 2021-02-10 19:34

Is that an old assumption from back in the day

Probably.

Would a higher load make the response time return a skewed number?

Potentially yes—either because the actual network throughput cannot handle the traffic, or the controller JVM is not able to keep up with all the work, or the controller is generally OK but for some reason is not processing this node monitor in a timely fashion, or the fix of ~~JENKINS-18671~~ was incorrect, etc.

I'm asking because for a workload perspective, the agent is performing as expected and does not seem to be lagging.

Then try disabling this monitor in https://jenkins/computer/configure and see if the problem clears up. This and other monitors should arguably just be suppressed on Cloud agents: the whole system of node monitors only makes sense in the context of a modest-sized pool of static agents that an admin is directly managing.

Jesse Glick added a comment - 2021-02-10 19:34 Is that an old assumption from back in the day Probably. Would a higher load make the response time return a skewed number? Potentially yes—either because the actual network throughput cannot handle the traffic, or the controller JVM is not able to keep up with all the work, or the controller is generally OK but for some reason is not processing this node monitor in a timely fashion, or the fix of JENKINS-18671 was incorrect, etc. I'm asking because for a workload perspective, the agent is performing as expected and does not seem to be lagging. Then try disabling this monitor in https: //jenkins/computer/configure and see if the problem clears up. This and other monitors should arguably just be suppressed on Cloud agents: the whole system of node monitors only makes sense in the context of a modest-sized pool of static agents that an admin is directly managing.

Samuel Beaulieu added a comment - 2021-02-16 15:44

Thank you, it was easy enough to disable. Initially I was not expecting it to initiate a disconnect as opposed to making it offline. I have since removed all the monitoring actions, just to make sure but I have not seen any improvements to the original disconnect issue.

Since then I have added the support plugin in order to save Loggers to disk, and I am tracking hudson.slaves.ChannelPinger and hudson.remoting.PingThread thinking that maybe the pingthread would initiate the disconnect.

I am also able to reproduce the issue on a dedicated test jenkins server that I have moved back to using JNLP port 5006 for the channel instead of websocket.

For now, even under high load, all pings seem to respond in less than 1 second, with ping time set to check every 30s. The pingThread sees that the channel is getting closed at some point between two checks.

2021-02-15 18:26:58.049+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
2021-02-15 18:27:28.035+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: pinging JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
2021-02-15 18:27:28.035+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: waiting 239s on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
2021-02-15 18:27:28.048+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
2021-02-15 18:27:40.946+0000 [id=33623]	FINE	hudson.slaves.ChannelPinger$2#onClosed: Terminating ping thread for JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
2021-02-15 18:27:40.946+0000 [id=33477]	FINE	hudson.remoting.PingThread#run: Ping thread for channel hudson.remoting.Channel@24534fbc:JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 is interrupted. Terminating

I have added tcpdump on both sides of the connection. From the agent perspective, the capture is incomplete as wireshark warns that the capture has been cut short in the middle of the packets. From the server I can clearly see a FIN ACK coming from the agent, to which we reply FIN ACK and we get the reply ACK. This is a normal orderly 3-way handshake TCP termination. I still dont know what causes it, but it appears that the smoking gun is not on the network side. Something higher level is perhaps terminating the process which in turn closes the connection but I have not found it.

Any other library I should try to add in the Loggers for investigation? The whole remoting?

Samuel Beaulieu added a comment - 2021-02-16 15:44 Thank you, it was easy enough to disable. Initially I was not expecting it to initiate a disconnect as opposed to making it offline. I have since removed all the monitoring actions, just to make sure but I have not seen any improvements to the original disconnect issue. Since then I have added the support plugin in order to save Loggers to disk, and I am tracking hudson.slaves.ChannelPinger and hudson.remoting.PingThread thinking that maybe the pingthread would initiate the disconnect. I am also able to reproduce the issue on a dedicated test jenkins server that I have moved back to using JNLP port 5006 for the channel instead of websocket. For now, even under high load, all pings seem to respond in less than 1 second, with ping time set to check every 30s. The pingThread sees that the channel is getting closed at some point between two checks. 2021-02-15 18:26:58.049+0000 [id=33477] FINE hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:28.035+0000 [id=33477] FINE hudson.remoting.PingThread#ping: pinging JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:28.035+0000 [id=33477] FINE hudson.remoting.PingThread#ping: waiting 239s on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:28.048+0000 [id=33477] FINE hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:40.946+0000 [id=33623] FINE hudson.slaves.ChannelPinger$2#onClosed: Terminating ping thread for JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:40.946+0000 [id=33477] FINE hudson.remoting.PingThread#run: Ping thread for channel hudson.remoting.Channel@24534fbc:JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 is interrupted. Terminating I have added tcpdump on both sides of the connection. From the agent perspective, the capture is incomplete as wireshark warns that the capture has been cut short in the middle of the packets. From the server I can clearly see a FIN ACK coming from the agent, to which we reply FIN ACK and we get the reply ACK. This is a normal orderly 3-way handshake TCP termination. I still dont know what causes it, but it appears that the smoking gun is not on the network side. Something higher level is perhaps terminating the process which in turn closes the connection but I have not found it. Any other library I should try to add in the Loggers for investigation? The whole remoting?

Jesse Glick added a comment - 2021-02-16 17:40

That, and jenkins.agents.WebSocketAgents I suppose.

Jesse Glick added a comment - 2021-02-16 17:40 That, and jenkins.agents.WebSocketAgents I suppose.

Jesse Glick added a comment - 2021-02-19 19:18

Can you check whether adding -Djenkins.websocket.pingInterval=5 (or some other value less than the default of 30) to controller launch options helps?

Jesse Glick added a comment - 2021-02-19 19:18 Can you check whether adding -Djenkins.websocket.pingInterval=5 (or some other value less than the default of 30 ) to controller launch options helps?

Samuel Beaulieu added a comment - 2021-02-23 18:33 - edited

Thank you. I have turned it on in our test instance, currently monitoring. Will put that in production as soon as I can get a maintenance window in.

I notice there is nothing around triggering disconnects in the code for jenkins.websocket.pingInterval

So from this perspective its more of a keep-alive to make sure nothing on the network layer tears the connection down.

I have more information from the strace logs I have been running in our test instance:

This is the process space before we sleep for an hour and wait for the disconnect to happen (it does not always happen, but we capture when it does)

*13:05:03* UID          PID    PPID  C STIME TTY          TIME CMD
*13:05:03* root           1       0 31 19:04 pts/0    00:00:04 strace -ff -tt -o /home/jenkins/agent/strace.txt entrypoint
*13:05:03* root           7       1 66 19:04 pts/0    00:00:09 java -javaagent:/var/lib/jenkins/jSSLKeyLog.jar=/home/jenkins/agent/FOO.txt -Dorg.jenkinsci.remoting.engine.JnlpProtocol3.disabled=true -cp /usr/share/jenkins/slave.jar hudson.remoting.jnlp.Main -headless -url 
FOO -workDir /home/jenkins/agent 4d54c5f996f9da7c3ebad35ad60617c1e489e49f0d735ddc2f27189a0ed7623f jnlp-f64z8
*13:05:03* root         118       7  0 19:04 pts/0    00:00:00 tcpdump -s0 -i eth0 -w capture_jnlp-f64z8.cap
*13:05:03* root         153       7  0 19:05 pts/0    00:00:00 /bin/bash -ex /tmp/jenkins11227328797915352496.sh
*13:05:03* root         165     153  0 19:05 pts/0    00:00:00 ps -ef

The strace -ff follows the forks and saves each file with the pid as suffix eg strace.txt.7 for the java agent process. The entrypoint is based on https://github.com/jenkinsci/docker-inbound-agent/blob/ad12874fe5567e9c9f197144a767515b44683d9c/11/debian/Dockerfile and runs a source /usr/local/bin/jenkins-agent
Other home made custom images also suffer from the disconnect, so I do not think at this point it is related to the image itself.

Here are the relevant last lines of each strace:

strace.txt.7

19:40:07.317915 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
19:40:07.318007 futex(0x7fc7b4008b00, FUTEX_WAKE_PRIVATE, 1) = 0
19:40:07.318110 rt_sigreturn({mask=[]}) = 202
19:40:07.318201 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
19:40:07.364432 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
19:40:07.497333 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=118, si_uid=0, si_status=0, si_utime=0, si_stime=2} ---
19:40:07.497381 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ?
19:40:07.812948 +++ exited with 129 +++

strace.txt.118

19:40:07.462836 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
19:40:07.462897 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
19:40:07.462966 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=0, si_uid=0} ---
19:40:07.463013 alarm(0)                = 0
19:40:07.463072 rt_sigreturn({mask=[HUP]}) = 0
19:40:07.463134 alarm(0)                = 0
19:40:07.463370 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
19:40:07.464347 open("/proc/net/dev", O_RDONLY) = 5
19:40:07.466061 fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
19:40:07.466147 read(5, "Inter-|   Receive               "..., 1024) = 446
19:40:07.466432 close(5)                = 0
19:40:07.467197 getsockopt(3, SOL_PACKET, PACKET_STATISTICS, {packets=5167, drops=0}, [8]) = 0
19:40:07.467306 write(2, "5166 packets captured", 21) = -1 EIO (Input/output error)
19:40:07.467435 write(2, "\n", 1)       = -1 EIO (Input/output error)
19:40:07.467661 write(2, "5167 packets received by filter", 31) = -1 EIO (Input/output error)
19:40:07.467723 write(2, "\n", 1)       = -1 EIO (Input/output error)
19:40:07.467790 write(2, "0 packets dropped by kernel", 27) = -1 EIO (Input/output error)
19:40:07.467857 write(2, "\n", 1)       = -1 EIO (Input/output error)
19:40:07.467920 setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=0, block_nr=0, frame_size=0, frame_nr=0}, 16) = -1 EINVAL (Invalid argument)
19:40:07.467985 munmap(0x7f50a7ec8000, 2097152) = 0
19:40:07.468287 munmap(0x7f50a8ee5000, 266240) = 0
19:40:07.468739 close(3)                = 0
19:40:07.496796 write(4, "\314\205@\4#\222\237\257\366\344\370hX\357$A\35\350\242\311\232-\35\27\334\20Q\377\314\346\233\1"..., 2442) = 2442
19:40:07.496951 exit_group(0)           = ?
19:40:07.497192 +++ exited with 0 +++

strace.txt.153

19:40:07.361553 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
19:40:07.364313 +++ killed by SIGHUP +++

The weird part is that SIGHUP is coming from pid 0 which is the scheduler, I don't think I've seen that before. Looking at docker documentation it seems that a docker stop would first send a SIGTERM, then a SIGKILL after a time period. I have not seen any documented process that sends a SIGHUP (apart from sending a custom signal via docker cli where you can force it to be SIGHUP)

a) This aligns with the bahavior we have seen. Java exits with 129 which is 128+kernel signal. Signal 1 is SIGHUP
b) The tcpdump received SIGHUP also from pid 0, outputs some of its information eg "5166 packets captured" on the console and the exits successfully
c) The jenkins "Execute Shell" forked process gets a SIGHUP too and dies.

I have another 30+ strace files from that session. A lot of them seem to be forked java processes, that also exit with 129. Let me know if any of them would be useful data for this ticket. I have bundled them all in a tar file but I don't want to post this bundle on a public ticket.

Samuel Beaulieu added a comment - 2021-02-23 18:33 - edited Thank you. I have turned it on in our test instance, currently monitoring. Will put that in production as soon as I can get a maintenance window in. I notice there is nothing around triggering disconnects in the code for jenkins.websocket.pingInterval So from this perspective its more of a keep-alive to make sure nothing on the network layer tears the connection down. I have more information from the strace logs I have been running in our test instance: This is the process space before we sleep for an hour and wait for the disconnect to happen (it does not always happen, but we capture when it does) *13:05:03* UID PID PPID C STIME TTY TIME CMD *13:05:03* root 1 0 31 19:04 pts/0 00:00:04 strace -ff -tt -o /home/jenkins/agent/strace.txt entrypoint *13:05:03* root 7 1 66 19:04 pts/0 00:00:09 java -javaagent:/ var /lib/jenkins/jSSLKeyLog.jar=/home/jenkins/agent/FOO.txt -Dorg.jenkinsci.remoting.engine.JnlpProtocol3.disabled= true -cp /usr/share/jenkins/slave.jar hudson.remoting.jnlp.Main -headless -url FOO -workDir /home/jenkins/agent 4d54c5f996f9da7c3ebad35ad60617c1e489e49f0d735ddc2f27189a0ed7623f jnlp-f64z8 *13:05:03* root 118 7 0 19:04 pts/0 00:00:00 tcpdump -s0 -i eth0 -w capture_jnlp-f64z8.cap *13:05:03* root 153 7 0 19:05 pts/0 00:00:00 /bin/bash -ex /tmp/jenkins11227328797915352496.sh *13:05:03* root 165 153 0 19:05 pts/0 00:00:00 ps -ef The strace -ff follows the forks and saves each file with the pid as suffix eg strace.txt.7 for the java agent process. The entrypoint is based on https://github.com/jenkinsci/docker-inbound-agent/blob/ad12874fe5567e9c9f197144a767515b44683d9c/11/debian/Dockerfile and runs a source /usr/local/bin/jenkins-agent Other home made custom images also suffer from the disconnect, so I do not think at this point it is related to the image itself. Here are the relevant last lines of each strace: strace.txt.7 19:40:07.317915 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.318007 futex(0x7fc7b4008b00, FUTEX_WAKE_PRIVATE, 1) = 0 19:40:07.318110 rt_sigreturn({mask=[]}) = 202 19:40:07.318201 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 19:40:07.364432 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 19:40:07.497333 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=118, si_uid=0, si_status=0, si_utime=0, si_stime=2} --- 19:40:07.497381 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? 19:40:07.812948 +++ exited with 129 +++ strace.txt.118 19:40:07.462836 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.462897 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.462966 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.463013 alarm(0) = 0 19:40:07.463072 rt_sigreturn({mask=[HUP]}) = 0 19:40:07.463134 alarm(0) = 0 19:40:07.463370 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) 19:40:07.464347 open( "/proc/net/dev" , O_RDONLY) = 5 19:40:07.466061 fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 19:40:07.466147 read(5, "Inter-| Receive " ..., 1024) = 446 19:40:07.466432 close(5) = 0 19:40:07.467197 getsockopt(3, SOL_PACKET, PACKET_STATISTICS, {packets=5167, drops=0}, [8]) = 0 19:40:07.467306 write(2, "5166 packets captured" , 21) = -1 EIO (Input/output error) 19:40:07.467435 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467661 write(2, "5167 packets received by filter" , 31) = -1 EIO (Input/output error) 19:40:07.467723 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467790 write(2, "0 packets dropped by kernel" , 27) = -1 EIO (Input/output error) 19:40:07.467857 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467920 setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=0, block_nr=0, frame_size=0, frame_nr=0}, 16) = -1 EINVAL (Invalid argument) 19:40:07.467985 munmap(0x7f50a7ec8000, 2097152) = 0 19:40:07.468287 munmap(0x7f50a8ee5000, 266240) = 0 19:40:07.468739 close(3) = 0 19:40:07.496796 write(4, "\314\205@\4#\222\237\257\366\344\370hX\357$A\35\350\242\311\232-\35\27\334\20Q\377\314\346\233\1" ..., 2442) = 2442 19:40:07.496951 exit_group(0) = ? 19:40:07.497192 +++ exited with 0 +++ strace.txt.153 19:40:07.361553 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.364313 +++ killed by SIGHUP +++ The weird part is that SIGHUP is coming from pid 0 which is the scheduler, I don't think I've seen that before. Looking at docker documentation it seems that a docker stop would first send a SIGTERM, then a SIGKILL after a time period. I have not seen any documented process that sends a SIGHUP (apart from sending a custom signal via docker cli where you can force it to be SIGHUP) a) This aligns with the bahavior we have seen. Java exits with 129 which is 128+kernel signal. Signal 1 is SIGHUP b) The tcpdump received SIGHUP also from pid 0, outputs some of its information eg "5166 packets captured" on the console and the exits successfully c) The jenkins "Execute Shell" forked process gets a SIGHUP too and dies. I have another 30+ strace files from that session. A lot of them seem to be forked java processes, that also exit with 129. Let me know if any of them would be useful data for this ticket. I have bundled them all in a tar file but I don't want to post this bundle on a public ticket.

Jesse Glick added a comment - 2021-02-23 19:30

I cannot think of any reason offhand why you would get SIGHUP.

Note that if you suspect problems with WebSocket, you can check the server behavior using variants of

websocat -vv -t wss://user:apitoken@jenkins/wsecho/

Jesse Glick added a comment - 2021-02-23 19:30 I cannot think of any reason offhand why you would get SIGHUP. Note that if you suspect problems with WebSocket, you can check the server behavior using variants of websocat -vv -t wss://user:apitoken@jenkins/wsecho/

Samuel Beaulieu added a comment - 2021-03-04 15:12

Thanks for the tip. I do not think at this point that it is a specific issue due to websockets, it just happen that it is our setup, but the issue presents itself when using JNLP port too.
Our next step to investigate is to run jenkins servers in kubernetes so that they are close to each other and see if the issue is still present.

Samuel Beaulieu added a comment - 2021-03-04 15:12 Thanks for the tip. I do not think at this point that it is a specific issue due to websockets, it just happen that it is our setup, but the issue presents itself when using JNLP port too. Our next step to investigate is to run jenkins servers in kubernetes so that they are close to each other and see if the issue is still present.

Samuel Beaulieu added a comment - 2021-05-19 16:34

We found out that the k8s nodes were being removed from the cluster because they were pre-emptible instances

Samuel Beaulieu added a comment - 2021-05-19 16:34 We found out that the k8s nodes were being removed from the cluster because they were pre-emptible instances

Jesse Glick added a comment - 2021-05-19 17:48

Hmm. An issue for kubernetes-plugin perhaps, to add appropriate labels or something?

Jesse Glick added a comment - 2021-05-19 17:48 Hmm. An issue for kubernetes-plugin perhaps, to add appropriate labels or something?

Jesse Glick added a comment - 2021-05-19 17:50

And then there is the diagnosis aspect. I wonder if https://www.jenkins.io/projects/gsoc/2021/project-ideas/remoting-monitoring/ would help make it more apparent what is going on.

Jesse Glick added a comment - 2021-05-19 17:50 And then there is the diagnosis aspect. I wonder if https://www.jenkins.io/projects/gsoc/2021/project-ideas/remoting-monitoring/ would help make it more apparent what is going on.

Assignee:: Jeff Thompson

Reporter:: Samuel Beaulieu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2021-01-12 20:09

Updated:: 2021-07-08 15:17

Resolved:: 2021-05-19 16:34

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Tim Jacomb added a comment - 2021-01-18 08:22

Expand comment: Tim Jacomb added a comment - 2021-01-18 08:22

Collapse comment: Samuel Beaulieu added a comment - 2021-01-19 00:12

Expand comment: Samuel Beaulieu added a comment - 2021-01-19 00:12

Collapse comment: Jesse Glick added a comment - 2021-01-20 04:46

Expand comment: Jesse Glick added a comment - 2021-01-20 04:46

Collapse comment: Samuel Beaulieu added a comment - 2021-01-20 16:06, Edited by Samuel Beaulieu - 2021-01-20 16:14

Expand comment: Samuel Beaulieu added a comment - 2021-01-20 16:06, Edited by Samuel Beaulieu - 2021-01-20 16:14

Collapse comment: Jesse Glick added a comment - 2021-01-20 22:22

Expand comment: Jesse Glick added a comment - 2021-01-20 22:22

Collapse comment: Samuel Beaulieu added a comment - 2021-01-21 19:14

Expand comment: Samuel Beaulieu added a comment - 2021-01-21 19:14

Collapse comment: Jesse Glick added a comment - 2021-01-22 23:31

Expand comment: Jesse Glick added a comment - 2021-01-22 23:31

Collapse comment: Samuel Beaulieu added a comment - 2021-01-25 18:06

Expand comment: Samuel Beaulieu added a comment - 2021-01-25 18:06

Collapse comment: Samuel Beaulieu added a comment - 2021-01-26 17:02

Expand comment: Samuel Beaulieu added a comment - 2021-01-26 17:02

Collapse comment: Samuel Beaulieu added a comment - 2021-01-26 17:13

Expand comment: Samuel Beaulieu added a comment - 2021-01-26 17:13

Collapse comment: Samuel Beaulieu added a comment - 2021-02-10 19:01

Expand comment: Samuel Beaulieu added a comment - 2021-02-10 19:01

Collapse comment: Jesse Glick added a comment - 2021-02-10 19:34

Expand comment: Jesse Glick added a comment - 2021-02-10 19:34

Collapse comment: Samuel Beaulieu added a comment - 2021-02-16 15:44

Expand comment: Samuel Beaulieu added a comment - 2021-02-16 15:44

Collapse comment: Jesse Glick added a comment - 2021-02-16 17:40

Expand comment: Jesse Glick added a comment - 2021-02-16 17:40

Collapse comment: Jesse Glick added a comment - 2021-02-19 19:18

Expand comment: Jesse Glick added a comment - 2021-02-19 19:18

Collapse comment: Samuel Beaulieu added a comment - 2021-02-23 18:33, Edited by Samuel Beaulieu - 2021-02-23 18:34

Expand comment: Samuel Beaulieu added a comment - 2021-02-23 18:33, Edited by Samuel Beaulieu - 2021-02-23 18:34

Collapse comment: Jesse Glick added a comment - 2021-02-23 19:30

Expand comment: Jesse Glick added a comment - 2021-02-23 19:30

Collapse comment: Samuel Beaulieu added a comment - 2021-03-04 15:12

Expand comment: Samuel Beaulieu added a comment - 2021-03-04 15:12

Collapse comment: Samuel Beaulieu added a comment - 2021-05-19 16:34

Expand comment: Samuel Beaulieu added a comment - 2021-05-19 16:34

Collapse comment: Jesse Glick added a comment - 2021-05-19 17:48

Expand comment: Jesse Glick added a comment - 2021-05-19 17:48

Collapse comment: Jesse Glick added a comment - 2021-05-19 17:50

Expand comment: Jesse Glick added a comment - 2021-05-19 17:50

People

Dates