Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-64598

Jenkins agent disconnects on k8s with SIGHUP / ClosedChannelException

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Major Major
    • jenkins instance:
      jenkins core 2.263.1
      CentOS Linux 7 (Core)
      kubernetes plugin 1.28.4

      jenkins agent remoting VERSION=4.6
      -websocket flag passed to jenkins agent

      I get intermittent agent disconnects while build is running. I'll try to provide as much info, let me know what else I can check.

       

      • Jenkins master java version 11 (java-11-openjdk-11.0.5.10) started with hudson.slaves.ChannelPinger.pingIntervalSeconds 30 in order to avoid disconnects
      • Nginx reverse proxy in use and ssl timeout is 5 minutes, which was too close to the default hudson.slaves.ChannelPinger.pingIntervalSeconds, so was reduced to 30 seconds with good results, and reduced the number of disconnects per day (stack trace was different and did not show a SIGHUP)
      • jenkins masters are on premise
      • jenkins agents are in GKE GCP kubernetes version 1.16.5
      • jenkins agent container image has default java -version
        openjdk version "1.8.0_232"
        OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
        OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
      • remoting VERSION=4.6
      • -websocket flag passed to jenkins agent via the k8s plugin extra cli command, I noticed afterwards there is a checkbox for websocket (in the kubernetes plugin config), but couldn't find docs to go with it, should I switch to using that?
      • In terms of sizing, we peak to about 400 jenkins-agents / pods connected at a time, the limit is set to 500 in the jenkins kubernetes plugin configuration
      • The issue happens even when load is low

      The connection is established fine, but intermittently gets disconnected. Let me know what else I can look at.

       

      Stack trace:

       

      SignalException: SIGHUP
      FATAL: command execution failed
      java.nio.channels.ClosedChannelException
      	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
      	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      	at com.sun.proxy.$Proxy105.onWebSocketClose(Unknown Source)
      	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149)
      	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:316)
      	at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280)
      	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293)
      	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
      	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
      	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440)
      	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
      	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
      	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
      	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
      	at java.base/java.lang.Thread.run(Thread.java:834)

          [JENKINS-64598] Jenkins agent disconnects on k8s with SIGHUP / ClosedChannelException

          Jesse Glick added a comment -

          Is that an old assumption from back in the day

          Probably.

          Would a higher load make the response time return a skewed number?

          Potentially yes—either because the actual network throughput cannot handle the traffic, or the controller JVM is not able to keep up with all the work, or the controller is generally OK but for some reason is not processing this node monitor in a timely fashion, or the fix of JENKINS-18671 was incorrect, etc.

          I'm asking because for a workload perspective, the agent is performing as expected and does not seem to be lagging.

          Then try disabling this monitor in https://jenkins/computer/configure and see if the problem clears up. This and other monitors should arguably just be suppressed on Cloud agents: the whole system of node monitors only makes sense in the context of a modest-sized pool of static agents that an admin is directly managing.

          Jesse Glick added a comment - Is that an old assumption from back in the day Probably. Would a higher load make the response time return a skewed number? Potentially yes—either because the actual network throughput cannot handle the traffic, or the controller JVM is not able to keep up with all the work, or the controller is generally OK but for some reason is not processing this node monitor in a timely fashion, or the fix of JENKINS-18671 was incorrect, etc. I'm asking because for a workload perspective, the agent is performing as expected and does not seem to be lagging. Then try disabling this monitor in https: //jenkins/computer/configure and see if the problem clears up. This and other monitors should arguably just be suppressed on Cloud agents: the whole system of node monitors only makes sense in the context of a modest-sized pool of static agents that an admin is directly managing.

          Thank you, it was easy enough to disable. Initially I was not expecting it to initiate a disconnect as opposed to making it offline. I have since removed all the monitoring actions, just to make sure but I have not seen any improvements to the original disconnect issue.

          Since then I have added the support plugin in order to save Loggers to disk, and I am tracking hudson.slaves.ChannelPinger and hudson.remoting.PingThread thinking that maybe the pingthread would initiate the disconnect.

          I am also able to reproduce the issue on a dedicated test jenkins server that I have moved back to using JNLP port 5006 for the channel instead of websocket.

          For now, even under high load, all pings seem to respond in less than 1 second, with ping time set to check every 30s. The pingThread sees that the channel is getting closed at some point between two checks.

          2021-02-15 18:26:58.049+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
          2021-02-15 18:27:28.035+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: pinging JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
          2021-02-15 18:27:28.035+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: waiting 239s on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
          2021-02-15 18:27:28.048+0000 [id=33477]	FINE	hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
          2021-02-15 18:27:40.946+0000 [id=33623]	FINE	hudson.slaves.ChannelPinger$2#onClosed: Terminating ping thread for JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314
          2021-02-15 18:27:40.946+0000 [id=33477]	FINE	hudson.remoting.PingThread#run: Ping thread for channel hudson.remoting.Channel@24534fbc:JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 is interrupted. Terminating
           

          I have added tcpdump on both sides of the connection. From the agent perspective, the capture is incomplete as wireshark warns that the capture has been cut short in the middle of the packets. From the server I can clearly see a FIN ACK coming from the agent, to which we reply FIN ACK and we get the reply ACK. This is a normal orderly 3-way handshake TCP termination. I still dont know what causes it, but it appears that the smoking gun is not on the network side. Something higher level is perhaps terminating the process which in turn closes the connection but I have not found it.

          Any other library I should try to add in the Loggers for investigation? The whole remoting?

          Samuel Beaulieu added a comment - Thank you, it was easy enough to disable. Initially I was not expecting it to initiate a disconnect as opposed to making it offline. I have since removed all the monitoring actions, just to make sure but I have not seen any improvements to the original disconnect issue. Since then I have added the support plugin in order to save Loggers to disk, and I am tracking hudson.slaves.ChannelPinger and hudson.remoting.PingThread thinking that maybe the pingthread would initiate the disconnect. I am also able to reproduce the issue on a dedicated test jenkins server that I have moved back to using JNLP port 5006 for the channel instead of websocket. For now, even under high load, all pings seem to respond in less than 1 second, with ping time set to check every 30s. The pingThread sees that the channel is getting closed at some point between two checks. 2021-02-15 18:26:58.049+0000 [id=33477] FINE hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:28.035+0000 [id=33477] FINE hudson.remoting.PingThread#ping: pinging JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:28.035+0000 [id=33477] FINE hudson.remoting.PingThread#ping: waiting 239s on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:28.048+0000 [id=33477] FINE hudson.remoting.PingThread#ping: ping succeeded on JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:40.946+0000 [id=33623] FINE hudson.slaves.ChannelPinger$2#onClosed: Terminating ping thread for JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 2021-02-15 18:27:40.946+0000 [id=33477] FINE hudson.remoting.PingThread#run: Ping thread for channel hudson.remoting.Channel@24534fbc:JNLP4-connect connection from 10.236.113.8/10.236.113.8:50314 is interrupted. Terminating I have added tcpdump on both sides of the connection. From the agent perspective, the capture is incomplete as wireshark warns that the capture has been cut short in the middle of the packets. From the server I can clearly see a FIN ACK coming from the agent, to which we reply FIN ACK and we get the reply ACK. This is a normal orderly 3-way handshake TCP termination. I still dont know what causes it, but it appears that the smoking gun is not on the network side. Something higher level is perhaps terminating the process which in turn closes the connection but I have not found it. Any other library I should try to add in the Loggers for investigation? The whole remoting?

          Jesse Glick added a comment -

          That, and jenkins.agents.WebSocketAgents I suppose.

          Jesse Glick added a comment - That, and jenkins.agents.WebSocketAgents I suppose.

          Jesse Glick added a comment -

          Can you check whether adding -Djenkins.websocket.pingInterval=5 (or some other value less than the default of 30) to controller launch options helps?

          Jesse Glick added a comment - Can you check whether adding -Djenkins.websocket.pingInterval=5 (or some other value less than the default of 30 ) to controller launch options helps?

          Samuel Beaulieu added a comment - - edited

          Thank you. I have turned it on in our test instance, currently monitoring. Will put that in production as soon as I can get a maintenance window in.

          I notice there is nothing around triggering disconnects in the code for jenkins.websocket.pingInterval

          So from this perspective its more of a keep-alive to make sure nothing on the network layer tears the connection down.

           

          I have more information from the strace logs I have been running in our test instance:

          This is the process space before we sleep for an hour and wait for the disconnect to happen (it does not always happen, but we capture when it does)

          *13:05:03* UID          PID    PPID  C STIME TTY          TIME CMD
          *13:05:03* root           1       0 31 19:04 pts/0    00:00:04 strace -ff -tt -o /home/jenkins/agent/strace.txt entrypoint
          *13:05:03* root           7       1 66 19:04 pts/0    00:00:09 java -javaagent:/var/lib/jenkins/jSSLKeyLog.jar=/home/jenkins/agent/FOO.txt -Dorg.jenkinsci.remoting.engine.JnlpProtocol3.disabled=true -cp /usr/share/jenkins/slave.jar hudson.remoting.jnlp.Main -headless -url 
          FOO -workDir /home/jenkins/agent 4d54c5f996f9da7c3ebad35ad60617c1e489e49f0d735ddc2f27189a0ed7623f jnlp-f64z8
          *13:05:03* root         118       7  0 19:04 pts/0    00:00:00 tcpdump -s0 -i eth0 -w capture_jnlp-f64z8.cap
          *13:05:03* root         153       7  0 19:05 pts/0    00:00:00 /bin/bash -ex /tmp/jenkins11227328797915352496.sh
          *13:05:03* root         165     153  0 19:05 pts/0    00:00:00 ps -ef
           

          The strace -ff follows the forks and saves each file with the pid as suffix eg strace.txt.7 for the java agent process. The entrypoint is based on https://github.com/jenkinsci/docker-inbound-agent/blob/ad12874fe5567e9c9f197144a767515b44683d9c/11/debian/Dockerfile and runs a source /usr/local/bin/jenkins-agent
          Other home made custom images also suffer from the disconnect, so I do not think at this point it is related to the image itself.

          Here are the relevant last lines of each strace:

          strace.txt.7

          19:40:07.317915 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.318007 futex(0x7fc7b4008b00, FUTEX_WAKE_PRIVATE, 1) = 0
          19:40:07.318110 rt_sigreturn({mask=[]}) = 202
          19:40:07.318201 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
          19:40:07.364432 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
          19:40:07.497333 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=118, si_uid=0, si_status=0, si_utime=0, si_stime=2} ---
          19:40:07.497381 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ?
          19:40:07.812948 +++ exited with 129 +++
          

          strace.txt.118

          19:40:07.462836 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.462897 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.462966 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.463013 alarm(0)                = 0
          19:40:07.463072 rt_sigreturn({mask=[HUP]}) = 0
          19:40:07.463134 alarm(0)                = 0
          19:40:07.463370 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
          19:40:07.464347 open("/proc/net/dev", O_RDONLY) = 5
          19:40:07.466061 fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
          19:40:07.466147 read(5, "Inter-|   Receive               "..., 1024) = 446
          19:40:07.466432 close(5)                = 0
          19:40:07.467197 getsockopt(3, SOL_PACKET, PACKET_STATISTICS, {packets=5167, drops=0}, [8]) = 0
          19:40:07.467306 write(2, "5166 packets captured", 21) = -1 EIO (Input/output error)
          19:40:07.467435 write(2, "\n", 1)       = -1 EIO (Input/output error)
          19:40:07.467661 write(2, "5167 packets received by filter", 31) = -1 EIO (Input/output error)
          19:40:07.467723 write(2, "\n", 1)       = -1 EIO (Input/output error)
          19:40:07.467790 write(2, "0 packets dropped by kernel", 27) = -1 EIO (Input/output error)
          19:40:07.467857 write(2, "\n", 1)       = -1 EIO (Input/output error)
          19:40:07.467920 setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=0, block_nr=0, frame_size=0, frame_nr=0}, 16) = -1 EINVAL (Invalid argument)
          19:40:07.467985 munmap(0x7f50a7ec8000, 2097152) = 0
          19:40:07.468287 munmap(0x7f50a8ee5000, 266240) = 0
          19:40:07.468739 close(3)                = 0
          19:40:07.496796 write(4, "\314\205@\4#\222\237\257\366\344\370hX\357$A\35\350\242\311\232-\35\27\334\20Q\377\314\346\233\1"..., 2442) = 2442
          19:40:07.496951 exit_group(0)           = ?
          19:40:07.497192 +++ exited with 0 +++
          

          strace.txt.153

          19:40:07.361553 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.364313 +++ killed by SIGHUP +++
          

          The weird part is that SIGHUP is coming from pid 0 which is the scheduler, I don't think I've seen that before. Looking at docker documentation it seems that a docker stop would first send a SIGTERM, then a SIGKILL after a time period. I have not seen any documented process that sends a SIGHUP (apart from sending a custom signal via docker cli where you can force it to be SIGHUP)

          a) This aligns with the bahavior we have seen. Java exits with 129 which is 128+kernel signal. Signal 1 is SIGHUP
          b) The tcpdump received SIGHUP also from pid 0, outputs some of its information eg "5166 packets captured" on the console and the exits successfully
          c) The jenkins "Execute Shell" forked process gets a SIGHUP too and dies.

          I have another 30+ strace files from that session. A lot of them seem to be forked java processes, that also exit with 129. Let me know if any of them would be useful data for this ticket. I have bundled them all in a tar file but I don't want to post this bundle on a public ticket.

          Samuel Beaulieu added a comment - - edited Thank you. I have turned it on in our test instance, currently monitoring. Will put that in production as soon as I can get a maintenance window in. I notice there is nothing around triggering disconnects in the code for jenkins.websocket.pingInterval So from this perspective its more of a keep-alive to make sure nothing on the network layer tears the connection down.   I have more information from the strace logs I have been running in our test instance: This is the process space before we sleep for an hour and wait for the disconnect to happen (it does not always happen, but we capture when it does) *13:05:03* UID PID PPID C STIME TTY TIME CMD *13:05:03* root 1 0 31 19:04 pts/0 00:00:04 strace -ff -tt -o /home/jenkins/agent/strace.txt entrypoint *13:05:03* root 7 1 66 19:04 pts/0 00:00:09 java -javaagent:/ var /lib/jenkins/jSSLKeyLog.jar=/home/jenkins/agent/FOO.txt -Dorg.jenkinsci.remoting.engine.JnlpProtocol3.disabled= true -cp /usr/share/jenkins/slave.jar hudson.remoting.jnlp.Main -headless -url FOO -workDir /home/jenkins/agent 4d54c5f996f9da7c3ebad35ad60617c1e489e49f0d735ddc2f27189a0ed7623f jnlp-f64z8 *13:05:03* root 118 7 0 19:04 pts/0 00:00:00 tcpdump -s0 -i eth0 -w capture_jnlp-f64z8.cap *13:05:03* root 153 7 0 19:05 pts/0 00:00:00 /bin/bash -ex /tmp/jenkins11227328797915352496.sh *13:05:03* root 165 153 0 19:05 pts/0 00:00:00 ps -ef   The strace -ff follows the forks and saves each file with the pid as suffix eg strace.txt.7 for the java agent process. The entrypoint is based on https://github.com/jenkinsci/docker-inbound-agent/blob/ad12874fe5567e9c9f197144a767515b44683d9c/11/debian/Dockerfile and runs a source /usr/local/bin/jenkins-agent Other home made custom images also suffer from the disconnect, so I do not think at this point it is related to the image itself. Here are the relevant last lines of each strace: strace.txt.7 19:40:07.317915 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.318007 futex(0x7fc7b4008b00, FUTEX_WAKE_PRIVATE, 1) = 0 19:40:07.318110 rt_sigreturn({mask=[]}) = 202 19:40:07.318201 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 19:40:07.364432 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 19:40:07.497333 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=118, si_uid=0, si_status=0, si_utime=0, si_stime=2} --- 19:40:07.497381 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? 19:40:07.812948 +++ exited with 129 +++ strace.txt.118 19:40:07.462836 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.462897 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.462966 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.463013 alarm(0) = 0 19:40:07.463072 rt_sigreturn({mask=[HUP]}) = 0 19:40:07.463134 alarm(0) = 0 19:40:07.463370 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) 19:40:07.464347 open( "/proc/net/dev" , O_RDONLY) = 5 19:40:07.466061 fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 19:40:07.466147 read(5, "Inter-| Receive " ..., 1024) = 446 19:40:07.466432 close(5) = 0 19:40:07.467197 getsockopt(3, SOL_PACKET, PACKET_STATISTICS, {packets=5167, drops=0}, [8]) = 0 19:40:07.467306 write(2, "5166 packets captured" , 21) = -1 EIO (Input/output error) 19:40:07.467435 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467661 write(2, "5167 packets received by filter" , 31) = -1 EIO (Input/output error) 19:40:07.467723 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467790 write(2, "0 packets dropped by kernel" , 27) = -1 EIO (Input/output error) 19:40:07.467857 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467920 setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=0, block_nr=0, frame_size=0, frame_nr=0}, 16) = -1 EINVAL (Invalid argument) 19:40:07.467985 munmap(0x7f50a7ec8000, 2097152) = 0 19:40:07.468287 munmap(0x7f50a8ee5000, 266240) = 0 19:40:07.468739 close(3) = 0 19:40:07.496796 write(4, "\314\205@\4#\222\237\257\366\344\370hX\357$A\35\350\242\311\232-\35\27\334\20Q\377\314\346\233\1" ..., 2442) = 2442 19:40:07.496951 exit_group(0) = ? 19:40:07.497192 +++ exited with 0 +++ strace.txt.153 19:40:07.361553 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.364313 +++ killed by SIGHUP +++ The weird part is that SIGHUP is coming from pid 0 which is the scheduler, I don't think I've seen that before. Looking at docker documentation it seems that a docker stop would first send a SIGTERM, then a SIGKILL after a time period. I have not seen any documented process that sends a SIGHUP (apart from sending a custom signal via docker cli where you can force it to be SIGHUP) a) This aligns with the bahavior we have seen. Java exits with 129 which is 128+kernel signal. Signal 1 is SIGHUP b) The tcpdump received SIGHUP also from pid 0, outputs some of its information eg "5166 packets captured" on the console and the exits successfully c) The jenkins "Execute Shell" forked process gets a SIGHUP too and dies. I have another 30+ strace files from that session. A lot of them seem to be forked java processes, that also exit with 129. Let me know if any of them would be useful data for this ticket. I have bundled them all in a tar file but I don't want to post this bundle on a public ticket.

          Jesse Glick added a comment -

          I cannot think of any reason offhand why you would get SIGHUP.

          Note that if you suspect problems with WebSocket, you can check the server behavior using variants of

          websocat -vv -t wss://user:apitoken@jenkins/wsecho/
          

          Jesse Glick added a comment - I cannot think of any reason offhand why you would get SIGHUP. Note that if you suspect problems with WebSocket, you can check the server behavior using variants of websocat -vv -t wss://user:apitoken@jenkins/wsecho/

          Thanks for the tip. I do not think at this point that it is a specific issue due to websockets, it just happen that it is our setup, but the issue presents itself when using JNLP port too.
          Our next step to investigate is to run jenkins servers in kubernetes so that they are close to each other and see if the issue is still present.

          Samuel Beaulieu added a comment - Thanks for the tip. I do not think at this point that it is a specific issue due to websockets, it just happen that it is our setup, but the issue presents itself when using JNLP port too. Our next step to investigate is to run jenkins servers in kubernetes so that they are close to each other and see if the issue is still present.

          We found out that the k8s nodes were being removed from the cluster because they were pre-emptible instances

          Samuel Beaulieu added a comment - We found out that the k8s nodes were being removed from the cluster because they were pre-emptible instances

          Jesse Glick added a comment -

          Hmm. An issue for kubernetes-plugin perhaps, to add appropriate labels or something?

          Jesse Glick added a comment - Hmm. An issue for kubernetes-plugin perhaps, to add appropriate labels or something?

          Jesse Glick added a comment -

          And then there is the diagnosis aspect. I wonder if https://www.jenkins.io/projects/gsoc/2021/project-ideas/remoting-monitoring/ would help make it more apparent what is going on.

          Jesse Glick added a comment - And then there is the diagnosis aspect. I wonder if https://www.jenkins.io/projects/gsoc/2021/project-ideas/remoting-monitoring/ would help make it more apparent what is going on.

            jthompson Jeff Thompson
            sbeaulie Samuel Beaulieu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: