Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69955

WebSocketTimeoutException: Connection Idle Timeout

    • 2.395

      We first experienced the problem that websocket connections were closed down unexpectedly in Jenkins  2.361.1 LTS. The problem was reported in JENKINS-69509, and Jenkins 2.375 was released subsequently to address the issue. We tried Jenkins 2.375 and found the websocket problem still there. The websocket was closed down in less than 2 hours after the build started.  Attached all the necessary logs

      Reverted back to  Jenkins 2.346.3 LTS is a workaround that works for us.

      How to Reproduce

      • Start Jenkins 2.361.x or later with -Djenkins.websocket.pingInterval=120
      • Connect a Websocket agent
        --> Notice that the websocket agent disconnect/reconnect at every ping

      An interval of 120 is a way to consistently see the error. Though it should happen with any value > 30. It may happen with the default 30 but with a lower likelihood.

          [JENKINS-69955] WebSocketTimeoutException: Connection Idle Timeout

          Basil Crow added a comment -

          Could you try running with the controller with --httpKeepAliveTimeout=120000 (or --httpsKeepAliveTimeout=120000)? I also notice your agents are running with an old version of Remoting (4.13.2); not likely to be related to this problem, but still worth upgrading to a more recent version.

          Basil Crow added a comment - Could you try running with the controller with --httpKeepAliveTimeout=120000 (or --httpsKeepAliveTimeout=120000 )? I also notice your agents are running with an old version of Remoting (4.13.2); not likely to be related to this problem, but still worth upgrading to a more recent version.

          George Yu added a comment -

          Can not experiment it now as our jenkins server is in production mode. However, before I reverted our jenkins version, I did try "--httpKeepAliveTimeout=3600000 --sessionEviction=-1 --sessionTimeout=90000". Strangely, the disconnect happened in a few minutes instead of less than 2 hours.  So with those settings the disconnect happened in minutes. Without those setting the disconnect happened in hours

          George Yu added a comment - Can not experiment it now as our jenkins server is in production mode. However, before I reverted our jenkins version, I did try "--httpKeepAliveTimeout=3600000 --sessionEviction=-1 --sessionTimeout=90000". Strangely, the disconnect happened in a few minutes instead of less than 2 hours.  So with those settings the disconnect happened in minutes. Without those setting the disconnect happened in hours

          Basil Crow added a comment -

          Without logs from when you were running with those settings I cannot be of much help I am afraid.

          Basil Crow added a comment - Without logs from when you were running with those settings I cannot be of much help I am afraid.

          George Yu added a comment - - edited

          Borrowed a new jenkins server and an agent, with Jenkins 2.375 installed, ran the job  with the controller with --httpKeepAliveTimeout=120000 and the build terminated after 12min. Collected all logs, attached. The names of the five logs/system info all contain "httpKeepAliveTimeout=120000"

          George Yu added a comment - - edited Borrowed a new jenkins server and an agent, with Jenkins 2.375 installed, ran the job  with the controller with --httpKeepAliveTimeout=120000 and the build terminated after 12min. Collected all logs, attached. The names of the five logs/system info all contain "httpKeepAliveTimeout=120000"

          Dan Wang added a comment -

          Face same issue, hope this can be fixed.

          Dan Wang added a comment - Face same issue, hope this can be fixed.

          Basil Crow added a comment -

          sbc8112 Have you tried increasing the timeout to two (2) minutes?

          Basil Crow added a comment - sbc8112 Have you tried increasing the timeout to two (2) minutes?

          Dan Wang added a comment -

          Hi basil ,

           

          I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.

          Dan Wang added a comment - Hi basil ,   I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.

          Basil Crow added a comment -

          Downgrading to 2.346 is a dead end, as there is very little chance we would revert back to Jetty 9 at this point. The question is how can we get users onto a stable deployment pattern on Jetty 10. A very small number of users are affected by this problem, and I suspect if pings aren't making it through within 2 minutes that those users have other problems with networking and/or CPU saturation impacting networking. But ultimately this timeout is configurable, so there should be a guaranteed workaround for anyone affected: just set the timeout to an extremely high value (e.g. 86,400,000 milliseconds which is 24 hours).

          Basil Crow added a comment - Downgrading to 2.346 is a dead end, as there is very little chance we would revert back to Jetty 9 at this point. The question is how can we get users onto a stable deployment pattern on Jetty 10. A very small number of users are affected by this problem, and I suspect if pings aren't making it through within 2 minutes that those users have other problems with networking and/or CPU saturation impacting networking. But ultimately this timeout is configurable, so there should be a guaranteed workaround for anyone affected: just set the timeout to an extremely high value (e.g. 86,400,000 milliseconds which is 24 hours).

          we have the same problem, the connection stops at some point and returns a "channel is already closed" message.

          we have already tried to solve this on our own, in vain. It would be nice if this got a little more attention.

          jenkinsivo jenkinsivo added a comment - we have the same problem, the connection stops at some point and returns a "channel is already closed" message. we have already tried to solve this on our own, in vain. It would be nice if this got a little more attention.

          Basil Crow added a comment -

          It would be nice if this got a little more attention.

          jenkinsivo It would be nice if you could read JENKINS-69955 (comment).

          Basil Crow added a comment - It would be nice if this got a little more attention. jenkinsivo It would be nice if you could read JENKINS-69955 (comment) .

          Dan Wang added a comment -

          I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.

          As same as gyu test, there is no connection issue on v2.346..

          Dan Wang added a comment - I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help. As same as gyu test, there is no connection issue on v2.346..

          Basil Crow added a comment -

          And same as my reply to George Yu, I have to reiterate that downgrading to 2.346 is an exercise in futility compared to applying the workaround I described previously.

          Basil Crow added a comment - And same as my reply to George Yu, I have to reiterate that downgrading to 2.346 is an exercise in futility compared to applying the workaround I described previously.

          We have added the keepalive in the config yesterday to 30 seconds, but the error still occurs afterwards, so 86400000 is the next step. to be continued.

          I'm not going to downgrade immediately, that's a version from May this year, then we're going very far back in time.

          jenkinsivo jenkinsivo added a comment - We have added the keepalive in the config yesterday to 30 seconds, but the error still occurs afterwards, so 86400000 is the next step. to be continued. I'm not going to downgrade immediately, that's a version from May this year, then we're going very far back in time.

          the keepalive is set to 86400000 but it looks like the issue still occurs unfortunately. any other suggestions? we prefer not to downgrade basil 

          jenkinsivo jenkinsivo added a comment - the keepalive is set to 86400000 but it looks like the issue still occurs unfortunately. any other suggestions? we prefer not to downgrade basil  

          George Yu added a comment - - edited

          I also tried a big keepalive number and the disconnects still occur, as stated in my comment on 2022-10-26.  Note that when disconnects occurred, there were no network performance issues, no network delay.

          George Yu added a comment - - edited I also tried a big keepalive number and the disconnects still occur, as stated in my comment on 2022-10-26.  Note that when disconnects occurred, there were no network performance issues, no network delay.

          indeed, at our side there are no network issues either. this occurs in the software.

          It would be nice if you could take a look at this? basil 

          jenkinsivo jenkinsivo added a comment - indeed, at our side there are no network issues either. this occurs in the software. It would be nice if you could take a look at this? basil  

          Basil Crow added a comment -

          I changed the keepalive value to 86400000 and confirmed in a debugger that the new value was being set and enforced in my local machine, so I think some other problem must be going on if setting the keepalive to 86400000 isn't working for you. jenkinsivo Please stop pinging me if you are unwilling to provide logs or do any analysis. gyu Sorry but I am out of ideas about how to help you here, as everything seems to be working as expected for me locally. If you can get the issue to reproduce, I would suggest that you attach a Java debugger to the controller and debug it yourself, or else provide instructions in this ticket about how to reproduce the problem from scratch. With that said I am now unsubscribing from notifications to this thread.

          Basil Crow added a comment - I changed the keepalive value to 86400000 and confirmed in a debugger that the new value was being set and enforced in my local machine, so I think some other problem must be going on if setting the keepalive to 86400000 isn't working for you. jenkinsivo Please stop pinging me if you are unwilling to provide logs or do any analysis. gyu Sorry but I am out of ideas about how to help you here, as everything seems to be working as expected for me locally. If you can get the issue to reproduce, I would suggest that you attach a Java debugger to the controller and debug it yourself, or else provide instructions in this ticket about how to reproduce the problem from scratch. With that said I am now unsubscribing from notifications to this thread.

          Nik Reiman added a comment -

          We are also experiencing this issue ever since the Jetty 10 update in 2.361.1. I haven't had time to really debug the issue until now, and as a result we've been pinned to 2.346.3. However, I have now set up a build cluster that mirrors our production environment, and I can easily reproduce the error there as well. Some observations:

          1. Setting `httpKeepAliveTimeout` didn't resolve the issue.
          2. We do have some builds that take multiple hours to run, but we also have many jobs that finish in just a few minutes. We observe disconnections on all types of nodes, regardless of the job duration.
          3. We have a variety of Linux, Mac, and Windows nodes, and we observe disconnections on all platforms.
          4. We observe many jobs that fail with this error: 
            ERROR: Cannot resume build because FlowNode 32 for FlowHead 1 could not be loaded. This is expected to happen when using the PERFORMANCE_OPTIMIZED durability setting and Jenkins is not shut down cleanly. Consider investigating to understand if Jenkins was not shut down cleanly or switching to the MAX_SURVIVABILITY durability setting which should prevent this issue in most cases.However, I have not yet tried to change the pipeline durability setting. I'll try that and report back.
          5. Node disconnections do seem to be correlated to some type of build activity, though it is hard to determine exactly what. Below, I've pasted a graph of 24 hours of activity from my test environment. Note that blue line (number of executors) shows disconnections when there are active builds running. When the cluster was idle, all nodes remain connected.

          As Basil is no longer watching this thread, I will avoid reaching out to him directly until I have more diagnostic information that I can provide.

          Nik Reiman added a comment - We are also experiencing this issue ever since the Jetty 10 update in 2.361.1. I haven't had time to really debug the issue until now, and as a result we've been pinned to 2.346.3. However, I have now set up a build cluster that mirrors our production environment, and I can easily reproduce the error there as well. Some observations: Setting `httpKeepAliveTimeout` didn't resolve the issue. We do have some builds that take multiple hours to run, but we also have many jobs that finish in just a few minutes. We observe disconnections on all types of nodes, regardless of the job duration. We have a variety of Linux, Mac, and Windows nodes, and we observe disconnections on all platforms. We observe many jobs that fail with this error:  ERROR: Cannot resume build because FlowNode 32 for FlowHead 1 could not be loaded. This is expected to happen when using the PERFORMANCE_OPTIMIZED durability setting and Jenkins is not shut down cleanly. Consider investigating to understand if Jenkins was not shut down cleanly or switching to the MAX_SURVIVABILITY durability setting which should prevent this issue in most cases.However, I have not yet tried to change the pipeline durability setting. I'll try that and report back. Node disconnections do seem to be correlated to some type of build activity, though it is hard to determine exactly what. Below, I've pasted a graph of 24 hours of activity from my test environment. Note that blue line (number of executors) shows disconnections when there are active builds running. When the cluster was idle, all nodes remain connected. As Basil is no longer watching this thread, I will avoid reaching out to him directly until I have more diagnostic information that I can provide.

          Nik Reiman added a comment - - edited

          Sorry, upon further testing, it seems that nodes disconnect even when idle.

          Nik Reiman added a comment - - edited Sorry, upon further testing, it seems that nodes disconnect even when idle.

          George Yu added a comment -

          I also observed idle nodes got disconnected periodically. The other interesting thing is, if the node is running a program with very long sleep (e.g. 60 minutes) in it, the node got disconnected sometimes

          George Yu added a comment - I also observed idle nodes got disconnected periodically. The other interesting thing is, if the node is running a program with very long sleep (e.g. 60 minutes) in it, the node got disconnected sometimes

          Jim Sears added a comment -

          Hi nre_ableton, I have the exact same error and line numbers as you. I also have disconnects while running and not running jobs so it is a relief to know I'm not alone looking for answers!

          Can you tell me how you graphed your executors and the queue in your post? 

          Jim Sears added a comment - Hi nre_ableton , I have the exact same error and line numbers as you. I also have disconnects while running and not running jobs so it is a relief to know I'm not alone looking for answers! Can you tell me how you graphed your executors and the queue in your post? 

          Nik Reiman added a comment -

          jimsears7 we use Prometheus to scrape various metrics from sources for hosts on our network. There is a Prometheus Jenkins plugin that provides metrics about queue length, executors, etc., which we install on our Jenkins controllers. Finally, we use Grafana to graph it all.

          It's a lot of stuff to setup just to generate a graph or two, but since we already had all of this stuff in our infrastructure, it was relatively easy for me.

          Nik Reiman added a comment - jimsears7 we use Prometheus to scrape various metrics from sources for hosts on our network. There is a Prometheus Jenkins plugin that provides metrics about queue length, executors, etc., which we install on our Jenkins controllers. Finally, we use Grafana to graph it all. It's a lot of stuff to setup just to generate a graph or two, but since we already had all of this stuff in our infrastructure, it was relatively easy for me.

          Nik Reiman added a comment - - edited

          FWIW, the problem is still present in 2.375.2. I have more graphing data but it's similar to the above pictures, so I won't paste it here.

          Nik Reiman added a comment - - edited FWIW, the problem is still present in 2.375.2. I have more graphing data but it's similar to the above pictures, so I won't paste it here.

          Enrico Walther added a comment - - edited

          Hi, we are facing exact the same issue (same  log output on jobs etc.) on Windows 10 Slave nodes with Jenkins 2.375.1 master.

          Enrico Walther added a comment - - edited Hi, we are facing exact the same issue (same  log output on jobs etc.) on Windows 10 Slave nodes with Jenkins 2.375.1 master.

          Daniel Beland added a comment -

          Hi,

          I don't want to add noise to the thread but I just want to say I've updated from 2.346.3 to 2.375.2 and I don't have any issue with my websocket agent.

          All our agents use TCP except 1 that we host externally in Azure that uses websockets, it runs a single monitoring job every 10 minutes that takes 20 seconds to complete.

          Jenkins and the agent run in docker (Linux host), agent uses the image jenkins/inbound-agent:3077.vd69cf116da_6f-4.

          We have nginx reverse proxy in front of Jenkins (docker container in same network) and also the corporate reverse proxy, so it's not a direct websocket connection either.

          Looking at the agent container logs, it only reconnects when I have installed plugins and did a soft restart of Jenkins, which has now been over 3 days.

          Hopefully you can use that info to help narrow it down and find the problem which seems to be specific to certain scenarios (I see some run it on windows or have very long jobs).

          Daniel Beland added a comment - Hi, I don't want to add noise to the thread but I just want to say I've updated from 2.346.3 to 2.375.2 and I don't have any issue with my websocket agent. All our agents use TCP except 1 that we host externally in Azure that uses websockets, it runs a single monitoring job every 10 minutes that takes 20 seconds to complete. Jenkins and the agent run in docker (Linux host), agent uses the image jenkins/inbound-agent:3077.vd69cf116da_6f-4. We have nginx reverse proxy in front of Jenkins (docker container in same network) and also the corporate reverse proxy, so it's not a direct websocket connection either. Looking at the agent container logs, it only reconnects when I have installed plugins and did a soft restart of Jenkins, which has now been over 3 days. Hopefully you can use that info to help narrow it down and find the problem which seems to be specific to certain scenarios (I see some run it on windows or have very long jobs).

          Nik Reiman added a comment -

          I just noticed that we are specifying `webSocket: true` in our Swarm Client config, I wonder if this has something to do with it... 🤔

          I'll run some more tests over the weekend to see.

          Nik Reiman added a comment - I just noticed that we are specifying `webSocket: true` in our Swarm Client config, I wonder if this has something to do with it... 🤔 I'll run some more tests over the weekend to see.

          Nik Reiman added a comment -

          I can now confirm that the `webSocket: true` option in the Swarm Client plugin seems to have been the culprit! We jut ran a test cluster for 4 days with no node disconnections. 🎉

          Nik Reiman added a comment - I can now confirm that the `webSocket: true` option in the Swarm Client plugin seems to have been the culprit! We jut ran a test cluster for 4 days with no node disconnections. 🎉

          Allan BURDAJEWICZ added a comment - - edited

          Websocket agents seem to be intermittently disconnecting. This problem is reproducible in current weekly 2.391, even just locally:

          • Spin up a new Jenkins controller
          • Create an inbound Websocket agent
          • Start the websocket agent

          Wait until you see the agent disconnecting:

          Feb. 21, 2023 3:39:31 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connected
          Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Read side closed
          Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Terminated
          Feb. 21, 2023 3:46:26 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Performing onReconnect operation.
          

          The controller show the timeoutexception:

          Feb. 21, 2023 3:46:16 PM jenkins.agents.WebSocketAgents$Session error
          WARNING: null
          org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout
          	at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.convertCause(JettyWebSocketFrameHandler.java:524)
          	at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.onError(JettyWebSocketFrameHandler.java:258)
          	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$closeConnection$2(WebSocketCoreSession.java:284)
          	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1468)
          	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1487)
          	at org.eclipse.jetty.websocket.core.server.internal.AbstractHandshaker$1.handle(AbstractHandshaker.java:212)
          	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection(WebSocketCoreSession.java:284)
          	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$sendFrame$7(WebSocketCoreSession.java:519)
          	at org.eclipse.jetty.util.Callback$3.succeeded(Callback.java:155)
          	at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.notifyCallbackSuccess(TransformingFlusher.java:197)
          	at org.eclipse.jetty.websocket.core.internal.TransformingFlusher$Flusher.process(TransformingFlusher.java:154)
          	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:232)
          	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:214)
          	at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.sendFrame(TransformingFlusher.java:77)
          	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.sendFrame(WebSocketCoreSession.java:522)
          	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.close(WebSocketCoreSession.java:239)
          	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.processHandlerError(WebSocketCoreSession.java:371)
          	at org.eclipse.jetty.websocket.core.internal.WebSocketConnection.onIdleExpired(WebSocketConnection.java:233)
          	at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407)
          	at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170)
          	at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112)
          	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
          	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          	at java.base/java.lang.Thread.run(Thread.java:829)
          Caused by: org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout
          	... 10 more
          

          ****

          I am not 100% sure this is remoting. It looks like people are hitting this since the move to Jetty 10 jenkins.websocket.Jetty10Provider. I collected debug jetty log from the controller, hopefully that can help:

          I can only acknowledge that the default websocket connection timeout is 30s. And per Jetty, we get over it:

          Feb. 21, 2023 3:46:16 PM org.eclipse.jetty.io.IdleTimeout checkIdleTimeout
          FINE: SocketChannelEndPoint@45b2acfd[{l=/127.0.0.1:8081,r=/127.0.0.1:63856,OPEN,fill=FI,flush=W,to=30003/30000}{io=1/1,kio=1,kro=1}]->[WebSocketConnection@47fa53ab[SERVER,p=Parser@d1a2f85[s=START,c=0,o=0x0,m=-,l=-1],f=Flusher@7e9adc28[PROCESSING][queueSize=0,aggregate=null],g=org.eclipse.jetty.websocket.core.internal.Generator@6ef93c39]] idle timeout check, elapsed: 30003 ms, remaining: -3 ms
          

          Allan BURDAJEWICZ added a comment - - edited Websocket agents seem to be intermittently disconnecting. This problem is reproducible in current weekly 2.391, even just locally: Spin up a new Jenkins controller Create an inbound Websocket agent Start the websocket agent Wait until you see the agent disconnecting: Feb. 21, 2023 3:39:31 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Read side closed Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Feb. 21, 2023 3:46:26 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Performing onReconnect operation. The controller show the timeoutexception: Feb. 21, 2023 3:46:16 PM jenkins.agents.WebSocketAgents$Session error WARNING: null org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.convertCause(JettyWebSocketFrameHandler.java:524) at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.onError(JettyWebSocketFrameHandler.java:258) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$closeConnection$2(WebSocketCoreSession.java:284) at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1468) at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1487) at org.eclipse.jetty.websocket.core.server.internal.AbstractHandshaker$1.handle(AbstractHandshaker.java:212) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection(WebSocketCoreSession.java:284) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$sendFrame$7(WebSocketCoreSession.java:519) at org.eclipse.jetty.util.Callback$3.succeeded(Callback.java:155) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.notifyCallbackSuccess(TransformingFlusher.java:197) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher$Flusher.process(TransformingFlusher.java:154) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:232) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:214) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.sendFrame(TransformingFlusher.java:77) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.sendFrame(WebSocketCoreSession.java:522) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.close(WebSocketCoreSession.java:239) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.processHandlerError(WebSocketCoreSession.java:371) at org.eclipse.jetty.websocket.core.internal.WebSocketConnection.onIdleExpired(WebSocketConnection.java:233) at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407) at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170) at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:829) Caused by: org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout ... 10 more **** I am not 100% sure this is remoting. It looks like people are hitting this since the move to Jetty 10 jenkins.websocket.Jetty10Provider . I collected debug jetty log from the controller, hopefully that can help: Agent name JENKINS-69955 Disconnection detected at 3:46:16 PM hudson.remoting-0.log.0 jenkins.agents.WebSocketAgents-0.log.0 org.jetty-0.log.0 I can only acknowledge that the default websocket connection timeout is 30s. And per Jetty, we get over it: Feb. 21, 2023 3:46:16 PM org.eclipse.jetty.io.IdleTimeout checkIdleTimeout FINE: SocketChannelEndPoint@45b2acfd[{l=/127.0.0.1:8081,r=/127.0.0.1:63856,OPEN,fill=FI,flush=W,to=30003/30000}{io=1/1,kio=1,kro=1}]->[WebSocketConnection@47fa53ab[SERVER,p=Parser@d1a2f85[s=START,c=0,o=0x0,m=-,l=-1],f=Flusher@7e9adc28[PROCESSING][queueSize=0,aggregate= null ],g=org.eclipse.jetty.websocket.core.internal.Generator@6ef93c39]] idle timeout check, elapsed: 30003 ms, remaining: -3 ms

          Dan Wang added a comment -

          Hi nre_ableton ,

           Could you share some tips on where to add the "webSocket: true" option?  

          Dan Wang added a comment - Hi nre_ableton ,  Could you share some tips on where to add the "webSocket: true" option?  

          Nik Reiman added a comment -

          sbc8112 it's an argument to the Swarm Client, in this case in a YAML configuration file. See https://github.com/jenkinsci/swarm-plugin#available-options. If you aren't using Swarm Client, then you should check whatever protocol your agents use to connect. Also note that the solution (for me, anyways), was not to specify this option. We were using web sockets before, and now we are not.

          Nik Reiman added a comment - sbc8112 it's an argument to the Swarm Client, in this case in a YAML configuration file. See https://github.com/jenkinsci/swarm-plugin#available-options. If you aren't using Swarm Client, then you should check whatever protocol your agents use to connect. Also note that the solution (for me, anyways), was not to specify this option. We were using web sockets before, and now we are not.

          Olivier Lamy added a comment -
          idle timeout check, elapsed: 30003 ms, remaining: -3 ms 

          really possible reason. Jetty have a default IdleTime out 30s.

          websocket is sending ping per default every 30s. (see https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/core/src/main/java/jenkins/websocket/WebSocketSession.java#L58)

          so yup this ping can work or not work by a matter of few milliseconds (in this logs it's 3ms) it depends on the network and if you are lucky or not  

          possible workaround start jenkins master with

           -Djenkins.websocket.pingInterval=15

          ping delay will be shorter than Jetty idle timeout.

          Change the configuration of Jetty websocket container to have a  larger per default idle timeout.

          can be done around here https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java#L55

          with something such

          JettyWebSocketServerContainer.getContainer(req.getServletContext()).setIdleTimeout(some duration); 

           

          Javadoc from here https://github.com/eclipse/jetty.project/blob/b7075161d015ddce23fbf3db873d5f6b539f6a6b/jetty-io/src/main/java/org/eclipse/jetty/io/IdleTimeout.java#L29

          a check is then made to see when the last operation took place. 

          so if nothing happen during 30s in the established websocket connection.... 

          Olivier Lamy added a comment - idle timeout check, elapsed: 30003 ms, remaining: -3 ms really possible reason. Jetty have a default IdleTime out 30s. websocket is sending ping per default every 30s. (see https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/core/src/main/java/jenkins/websocket/WebSocketSession.java#L58) so yup this ping can work or not work by a matter of few milliseconds (in this logs it's 3ms) it depends on the network and if you are lucky or not   possible workaround start jenkins master with -Djenkins.websocket.pingInterval=15 ping delay will be shorter than Jetty idle timeout. Change the configuration of Jetty websocket container to have a  larger per default idle timeout. can be done around here https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java#L55 with something such JettyWebSocketServerContainer.getContainer(req.getServletContext()).setIdleTimeout(some duration);   Javadoc from here https://github.com/eclipse/jetty.project/blob/b7075161d015ddce23fbf3db873d5f6b539f6a6b/jetty-io/src/main/java/org/eclipse/jetty/io/IdleTimeout.java#L29 a check is then made to see when the last operation took place. so if nothing happen during 30s in the established websocket connection.... 

          Allan BURDAJEWICZ added a comment - - edited

          I can definitely reproduce with 2.361.1 by adjusting the websocket ping interval. And I can't reproduce with 2.346.4.
          Updated the description with a reproduction scenario.

          IIUC the previous websocket timeout was 5 minutes. Set by the WebsocketPolicy at https://github.com/eclipse/jetty.project/blob/jetty-9.4.48.v20220622/jetty-websocket/websocket-api/src/main/java/org/eclipse/jetty/websocket/api/WebSocketPolicy.java#L81-L86

          Allan BURDAJEWICZ added a comment - - edited I can definitely reproduce with 2.361.1 by adjusting the websocket ping interval. And I can't reproduce with 2.346.4. Updated the description with a reproduction scenario. IIUC the previous websocket timeout was 5 minutes. Set by the WebsocketPolicy at https://github.com/eclipse/jetty.project/blob/jetty-9.4.48.v20220622/jetty-websocket/websocket-api/src/main/java/org/eclipse/jetty/websocket/api/WebSocketPolicy.java#L81-L86

          Hung added a comment -

          allan_burdajewicz Hi Allan, could you have some updates or workaround on this issue?

          Currently, i'm using Jenkins 2.375.1 and due to some reason I could not rollback to jenkins 2.346.3 LTS as suggestion above. 

          Hung added a comment - allan_burdajewicz Hi Allan, could you have some updates or workaround on this issue? Currently, i'm using Jenkins 2.375.1 and due to some reason I could not rollback to jenkins 2.346.3 LTS as suggestion above. 

          Olivier Lamy added a comment -

          leminhhung0110  this a PR ready. Currently you can use the workaround 
          -Djenkins.websocket.pingInterval=15
          or even less

          Olivier Lamy added a comment - leminhhung0110   this a PR ready. Currently you can use the workaround  -Djenkins.websocket.pingInterval=15 or even less

          Hung added a comment -

          olamy do you meant i will use this command when starting jenkins master "jenkins restart -Djenkins.websocket.pingInterval=15"?

          Hung added a comment - olamy do you meant i will use this command when starting jenkins master " jenkins restart -Djenkins.websocket.pingInterval=15 "?

          Olivier Lamy added a comment -

          leminhhung0110 I have no idea what your script called jenkins is doing   but Jenkins need to be started with the system property.

          Olivier Lamy added a comment - leminhhung0110 I have no idea what your script called jenkins is doing   but Jenkins need to be started with the system property.

            allan_burdajewicz Allan BURDAJEWICZ
            gyu George Yu
            Votes:
            13 Vote for this issue
            Watchers:
            25 Start watching this issue

              Created:
              Updated:
              Resolved: