Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-64598

Jenkins agent disconnects on k8s with SIGHUP / ClosedChannelException

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Labels:
    • Environment:
      jenkins instance:
      jenkins core 2.263.1
      CentOS Linux 7 (Core)
      kubernetes plugin 1.28.4

      jenkins agent remoting VERSION=4.6
      -websocket flag passed to jenkins agent
    • Similar Issues:

      Description

      I get intermittent agent disconnects while build is running. I'll try to provide as much info, let me know what else I can check.

       

      • Jenkins master java version 11 (java-11-openjdk-11.0.5.10) started with hudson.slaves.ChannelPinger.pingIntervalSeconds 30 in order to avoid disconnects
      • Nginx reverse proxy in use and ssl timeout is 5 minutes, which was too close to the default hudson.slaves.ChannelPinger.pingIntervalSeconds, so was reduced to 30 seconds with good results, and reduced the number of disconnects per day (stack trace was different and did not show a SIGHUP)
      • jenkins masters are on premise
      • jenkins agents are in GKE GCP kubernetes version 1.16.5
      • jenkins agent container image has default java -version
        openjdk version "1.8.0_232"
        OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
        OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
      • remoting VERSION=4.6
      • -websocket flag passed to jenkins agent via the k8s plugin extra cli command, I noticed afterwards there is a checkbox for websocket (in the kubernetes plugin config), but couldn't find docs to go with it, should I switch to using that?
      • In terms of sizing, we peak to about 400 jenkins-agents / pods connected at a time, the limit is set to 500 in the jenkins kubernetes plugin configuration
      • The issue happens even when load is low

      The connection is established fine, but intermittently gets disconnected. Let me know what else I can look at.

       

      Stack trace:

       

      SignalException: SIGHUP
      FATAL: command execution failed
      java.nio.channels.ClosedChannelException
      	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
      	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      	at com.sun.proxy.$Proxy105.onWebSocketClose(Unknown Source)
      	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:149)
      	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:394)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:316)
      	at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280)
      	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293)
      	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
      	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
      	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:581)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:510)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:440)
      	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
      	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
      	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
      	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
      	at java.base/java.lang.Thread.run(Thread.java:834)

        Attachments

          Activity

          Hide
          sbeaulie Samuel Beaulieu added a comment -

          Thanks for the tip. I do not think at this point that it is a specific issue due to websockets, it just happen that it is our setup, but the issue presents itself when using JNLP port too.
          Our next step to investigate is to run jenkins servers in kubernetes so that they are close to each other and see if the issue is still present.

          Show
          sbeaulie Samuel Beaulieu added a comment - Thanks for the tip. I do not think at this point that it is a specific issue due to websockets, it just happen that it is our setup, but the issue presents itself when using JNLP port too. Our next step to investigate is to run jenkins servers in kubernetes so that they are close to each other and see if the issue is still present.
          Hide
          jglick Jesse Glick added a comment -

          I cannot think of any reason offhand why you would get SIGHUP.

          Note that if you suspect problems with WebSocket, you can check the server behavior using variants of

          websocat -vv -t wss://user:apitoken@jenkins/wsecho/
          
          Show
          jglick Jesse Glick added a comment - I cannot think of any reason offhand why you would get SIGHUP. Note that if you suspect problems with WebSocket, you can check the server behavior using variants of websocat -vv -t wss://user:apitoken@jenkins/wsecho/
          Hide
          sbeaulie Samuel Beaulieu added a comment - - edited

          Thank you. I have turned it on in our test instance, currently monitoring. Will put that in production as soon as I can get a maintenance window in.

          I notice there is nothing around triggering disconnects in the code for jenkins.websocket.pingInterval

          So from this perspective its more of a keep-alive to make sure nothing on the network layer tears the connection down.

           

          I have more information from the strace logs I have been running in our test instance:

          This is the process space before we sleep for an hour and wait for the disconnect to happen (it does not always happen, but we capture when it does)

          *13:05:03* UID          PID    PPID  C STIME TTY          TIME CMD
          *13:05:03* root           1       0 31 19:04 pts/0    00:00:04 strace -ff -tt -o /home/jenkins/agent/strace.txt entrypoint
          *13:05:03* root           7       1 66 19:04 pts/0    00:00:09 java -javaagent:/var/lib/jenkins/jSSLKeyLog.jar=/home/jenkins/agent/FOO.txt -Dorg.jenkinsci.remoting.engine.JnlpProtocol3.disabled=true -cp /usr/share/jenkins/slave.jar hudson.remoting.jnlp.Main -headless -url 
          FOO -workDir /home/jenkins/agent 4d54c5f996f9da7c3ebad35ad60617c1e489e49f0d735ddc2f27189a0ed7623f jnlp-f64z8
          *13:05:03* root         118       7  0 19:04 pts/0    00:00:00 tcpdump -s0 -i eth0 -w capture_jnlp-f64z8.cap
          *13:05:03* root         153       7  0 19:05 pts/0    00:00:00 /bin/bash -ex /tmp/jenkins11227328797915352496.sh
          *13:05:03* root         165     153  0 19:05 pts/0    00:00:00 ps -ef
           

          The strace -ff follows the forks and saves each file with the pid as suffix eg strace.txt.7 for the java agent process. The entrypoint is based on https://github.com/jenkinsci/docker-inbound-agent/blob/ad12874fe5567e9c9f197144a767515b44683d9c/11/debian/Dockerfile and runs a source /usr/local/bin/jenkins-agent
          Other home made custom images also suffer from the disconnect, so I do not think at this point it is related to the image itself.

          Here are the relevant last lines of each strace:

          strace.txt.7

          19:40:07.317915 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.318007 futex(0x7fc7b4008b00, FUTEX_WAKE_PRIVATE, 1) = 0
          19:40:07.318110 rt_sigreturn({mask=[]}) = 202
          19:40:07.318201 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
          19:40:07.364432 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
          19:40:07.497333 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=118, si_uid=0, si_status=0, si_utime=0, si_stime=2} ---
          19:40:07.497381 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ?
          19:40:07.812948 +++ exited with 129 +++
          

          strace.txt.118

          19:40:07.462836 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.462897 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.462966 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.463013 alarm(0)                = 0
          19:40:07.463072 rt_sigreturn({mask=[HUP]}) = 0
          19:40:07.463134 alarm(0)                = 0
          19:40:07.463370 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
          19:40:07.464347 open("/proc/net/dev", O_RDONLY) = 5
          19:40:07.466061 fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
          19:40:07.466147 read(5, "Inter-|   Receive               "..., 1024) = 446
          19:40:07.466432 close(5)                = 0
          19:40:07.467197 getsockopt(3, SOL_PACKET, PACKET_STATISTICS, {packets=5167, drops=0}, [8]) = 0
          19:40:07.467306 write(2, "5166 packets captured", 21) = -1 EIO (Input/output error)
          19:40:07.467435 write(2, "\n", 1)       = -1 EIO (Input/output error)
          19:40:07.467661 write(2, "5167 packets received by filter", 31) = -1 EIO (Input/output error)
          19:40:07.467723 write(2, "\n", 1)       = -1 EIO (Input/output error)
          19:40:07.467790 write(2, "0 packets dropped by kernel", 27) = -1 EIO (Input/output error)
          19:40:07.467857 write(2, "\n", 1)       = -1 EIO (Input/output error)
          19:40:07.467920 setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=0, block_nr=0, frame_size=0, frame_nr=0}, 16) = -1 EINVAL (Invalid argument)
          19:40:07.467985 munmap(0x7f50a7ec8000, 2097152) = 0
          19:40:07.468287 munmap(0x7f50a8ee5000, 266240) = 0
          19:40:07.468739 close(3)                = 0
          19:40:07.496796 write(4, "\314\205@\4#\222\237\257\366\344\370hX\357$A\35\350\242\311\232-\35\27\334\20Q\377\314\346\233\1"..., 2442) = 2442
          19:40:07.496951 exit_group(0)           = ?
          19:40:07.497192 +++ exited with 0 +++
          

          strace.txt.153

          19:40:07.361553 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} ---
          19:40:07.364313 +++ killed by SIGHUP +++
          

          The weird part is that SIGHUP is coming from pid 0 which is the scheduler, I don't think I've seen that before. Looking at docker documentation it seems that a docker stop would first send a SIGTERM, then a SIGKILL after a time period. I have not seen any documented process that sends a SIGHUP (apart from sending a custom signal via docker cli where you can force it to be SIGHUP)

          a) This aligns with the bahavior we have seen. Java exits with 129 which is 128+kernel signal. Signal 1 is SIGHUP
          b) The tcpdump received SIGHUP also from pid 0, outputs some of its information eg "5166 packets captured" on the console and the exits successfully
          c) The jenkins "Execute Shell" forked process gets a SIGHUP too and dies.

          I have another 30+ strace files from that session. A lot of them seem to be forked java processes, that also exit with 129. Let me know if any of them would be useful data for this ticket. I have bundled them all in a tar file but I don't want to post this bundle on a public ticket.

          Show
          sbeaulie Samuel Beaulieu added a comment - - edited Thank you. I have turned it on in our test instance, currently monitoring. Will put that in production as soon as I can get a maintenance window in. I notice there is nothing around triggering disconnects in the code for jenkins.websocket.pingInterval So from this perspective its more of a keep-alive to make sure nothing on the network layer tears the connection down.   I have more information from the strace logs I have been running in our test instance: This is the process space before we sleep for an hour and wait for the disconnect to happen (it does not always happen, but we capture when it does) *13:05:03* UID PID PPID C STIME TTY TIME CMD *13:05:03* root 1 0 31 19:04 pts/0 00:00:04 strace -ff -tt -o /home/jenkins/agent/strace.txt entrypoint *13:05:03* root 7 1 66 19:04 pts/0 00:00:09 java -javaagent:/ var /lib/jenkins/jSSLKeyLog.jar=/home/jenkins/agent/FOO.txt -Dorg.jenkinsci.remoting.engine.JnlpProtocol3.disabled= true -cp /usr/share/jenkins/slave.jar hudson.remoting.jnlp.Main -headless -url FOO -workDir /home/jenkins/agent 4d54c5f996f9da7c3ebad35ad60617c1e489e49f0d735ddc2f27189a0ed7623f jnlp-f64z8 *13:05:03* root 118 7 0 19:04 pts/0 00:00:00 tcpdump -s0 -i eth0 -w capture_jnlp-f64z8.cap *13:05:03* root 153 7 0 19:05 pts/0 00:00:00 /bin/bash -ex /tmp/jenkins11227328797915352496.sh *13:05:03* root 165 153 0 19:05 pts/0 00:00:00 ps -ef   The strace -ff follows the forks and saves each file with the pid as suffix eg strace.txt.7 for the java agent process. The entrypoint is based on https://github.com/jenkinsci/docker-inbound-agent/blob/ad12874fe5567e9c9f197144a767515b44683d9c/11/debian/Dockerfile and runs a source /usr/local/bin/jenkins-agent Other home made custom images also suffer from the disconnect, so I do not think at this point it is related to the image itself. Here are the relevant last lines of each strace: strace.txt.7 19:40:07.317915 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.318007 futex(0x7fc7b4008b00, FUTEX_WAKE_PRIVATE, 1) = 0 19:40:07.318110 rt_sigreturn({mask=[]}) = 202 19:40:07.318201 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 19:40:07.364432 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 19:40:07.497333 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=118, si_uid=0, si_status=0, si_utime=0, si_stime=2} --- 19:40:07.497381 futex(0x7fc7bd8399d0, FUTEX_WAIT, 119, NULL) = ? 19:40:07.812948 +++ exited with 129 +++ strace.txt.118 19:40:07.462836 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.462897 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.462966 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.463013 alarm(0) = 0 19:40:07.463072 rt_sigreturn({mask=[HUP]}) = 0 19:40:07.463134 alarm(0) = 0 19:40:07.463370 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) 19:40:07.464347 open( "/proc/net/dev" , O_RDONLY) = 5 19:40:07.466061 fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 19:40:07.466147 read(5, "Inter-| Receive " ..., 1024) = 446 19:40:07.466432 close(5) = 0 19:40:07.467197 getsockopt(3, SOL_PACKET, PACKET_STATISTICS, {packets=5167, drops=0}, [8]) = 0 19:40:07.467306 write(2, "5166 packets captured" , 21) = -1 EIO (Input/output error) 19:40:07.467435 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467661 write(2, "5167 packets received by filter" , 31) = -1 EIO (Input/output error) 19:40:07.467723 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467790 write(2, "0 packets dropped by kernel" , 27) = -1 EIO (Input/output error) 19:40:07.467857 write(2, "\n" , 1) = -1 EIO (Input/output error) 19:40:07.467920 setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=0, block_nr=0, frame_size=0, frame_nr=0}, 16) = -1 EINVAL (Invalid argument) 19:40:07.467985 munmap(0x7f50a7ec8000, 2097152) = 0 19:40:07.468287 munmap(0x7f50a8ee5000, 266240) = 0 19:40:07.468739 close(3) = 0 19:40:07.496796 write(4, "\314\205@\4#\222\237\257\366\344\370hX\357$A\35\350\242\311\232-\35\27\334\20Q\377\314\346\233\1" ..., 2442) = 2442 19:40:07.496951 exit_group(0) = ? 19:40:07.497192 +++ exited with 0 +++ strace.txt.153 19:40:07.361553 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=0, si_uid=0} --- 19:40:07.364313 +++ killed by SIGHUP +++ The weird part is that SIGHUP is coming from pid 0 which is the scheduler, I don't think I've seen that before. Looking at docker documentation it seems that a docker stop would first send a SIGTERM, then a SIGKILL after a time period. I have not seen any documented process that sends a SIGHUP (apart from sending a custom signal via docker cli where you can force it to be SIGHUP) a) This aligns with the bahavior we have seen. Java exits with 129 which is 128+kernel signal. Signal 1 is SIGHUP b) The tcpdump received SIGHUP also from pid 0, outputs some of its information eg "5166 packets captured" on the console and the exits successfully c) The jenkins "Execute Shell" forked process gets a SIGHUP too and dies. I have another 30+ strace files from that session. A lot of them seem to be forked java processes, that also exit with 129. Let me know if any of them would be useful data for this ticket. I have bundled them all in a tar file but I don't want to post this bundle on a public ticket.
          Hide
          jglick Jesse Glick added a comment -

          Can you check whether adding -Djenkins.websocket.pingInterval=5 (or some other value less than the default of 30) to controller launch options helps?

          Show
          jglick Jesse Glick added a comment - Can you check whether adding -Djenkins.websocket.pingInterval=5 (or some other value less than the default of 30 ) to controller launch options helps?
          Hide
          jglick Jesse Glick added a comment -

          That, and jenkins.agents.WebSocketAgents I suppose.

          Show
          jglick Jesse Glick added a comment - That, and jenkins.agents.WebSocketAgents I suppose.

            People

            Assignee:
            jthompson Jeff Thompson
            Reporter:
            sbeaulie Samuel Beaulieu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated: