[JENKINS-69955] WebSocketTimeoutException: Connection Idle Timeout

Type: Bug
Resolution: Fixed
Priority: Blocker
Component/s: core
Labels:
Environment:
Jenkins:2.361.1

Similar Issues:
Powered by SuggestiMate

Show
Released As:
2.395

We first experienced the problem that websocket connections were closed down unexpectedly in Jenkins 2.361.1 LTS. The problem was reported in ~~JENKINS-69509~~, and Jenkins 2.375 was released subsequently to address the issue. We tried Jenkins 2.375 and found the websocket problem still there. The websocket was closed down in less than 2 hours after the build started. Attached all the necessary logs

Reverted back to Jenkins 2.346.3 LTS is a workaround that works for us.

How to Reproduce

Start Jenkins 2.361.x or later with -Djenkins.websocket.pingInterval=120
Connect a Websocket agent
--> Notice that the websocket agent disconnect/reconnect at every ping

An interval of 120 is a way to consistently see the error. Though it should happen with any value > 30. It may happen with the default 30 but with a lower likelihood.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Agent logs.txt
4 kB
2022-10-26 17:44
Agent logs httpKeepAliveTimeout=120000.log
10 kB
2022-11-02 13:52
Agent sys info.txt
3 kB
2022-10-26 17:44
Agent sys info httpKeepAliveTimeout=120000.txt
3 kB
2022-11-02 13:52
Controller Log httpKeepAliveTimeout=120000.log
15 kB
2022-11-02 13:52
Controller logs.txt
5 kB
2022-10-26 18:20
Controller sys info.txt
3 kB
2022-10-26 18:20
Controller Sys info httpKeepAliveTimeout=120000.txt
3 kB
2022-11-02 13:52
hudson.remoting-0.log.0
1.27 MB
2023-02-21 06:10
jenkins.agents.WebSocketAgents-0.log.0
886 kB
2023-02-21 06:10
Job build console httpKeepAliveTimeout=120000.log
12 kB
2022-11-02 13:52
Job build console log.txt
5 kB
2022-10-26 17:44
org.jetty-0.log.0
7.05 MB
2023-02-21 06:10
Screenshot from 2022-12-20 16-51-57.png
60 kB
2022-12-20 15:53
Screenshot from 2022-12-22 13-26-21.png
31 kB
2022-12-22 12:26

is related to

JENKINS-70697 Improve Jetty10Provider initialization with ServletContainerInitializer

Open

links to

jenkins #7670

Basil Crow added a comment - 2022-10-26 19:18

Could you try running with the controller with --httpKeepAliveTimeout=120000 (or --httpsKeepAliveTimeout=120000)? I also notice your agents are running with an old version of Remoting (4.13.2); not likely to be related to this problem, but still worth upgrading to a more recent version.

Basil Crow added a comment - 2022-10-26 19:18 Could you try running with the controller with --httpKeepAliveTimeout=120000 (or --httpsKeepAliveTimeout=120000 )? I also notice your agents are running with an old version of Remoting (4.13.2); not likely to be related to this problem, but still worth upgrading to a more recent version.

George Yu added a comment - 2022-10-26 19:43

Can not experiment it now as our jenkins server is in production mode. However, before I reverted our jenkins version, I did try "--httpKeepAliveTimeout=3600000 --sessionEviction=-1 --sessionTimeout=90000". Strangely, the disconnect happened in a few minutes instead of less than 2 hours. So with those settings the disconnect happened in minutes. Without those setting the disconnect happened in hours

George Yu added a comment - 2022-10-26 19:43 Can not experiment it now as our jenkins server is in production mode. However, before I reverted our jenkins version, I did try "--httpKeepAliveTimeout=3600000 --sessionEviction=-1 --sessionTimeout=90000". Strangely, the disconnect happened in a few minutes instead of less than 2 hours. So with those settings the disconnect happened in minutes. Without those setting the disconnect happened in hours

Basil Crow added a comment - 2022-10-26 19:59

Without logs from when you were running with those settings I cannot be of much help I am afraid.

Basil Crow added a comment - 2022-10-26 19:59 Without logs from when you were running with those settings I cannot be of much help I am afraid.

George Yu added a comment - 2022-11-02 13:55 - edited

Borrowed a new jenkins server and an agent, with Jenkins 2.375 installed, ran the job with the controller with --httpKeepAliveTimeout=120000 and the build terminated after 12min. Collected all logs, attached. The names of the five logs/system info all contain "httpKeepAliveTimeout=120000"

George Yu added a comment - 2022-11-02 13:55 - edited Borrowed a new jenkins server and an agent, with Jenkins 2.375 installed, ran the job with the controller with --httpKeepAliveTimeout=120000 and the build terminated after 12min. Collected all logs, attached. The names of the five logs/system info all contain "httpKeepAliveTimeout=120000"

Dan Wang added a comment - 2022-11-09 03:29

Face same issue, hope this can be fixed.

Dan Wang added a comment - 2022-11-09 03:29 Face same issue, hope this can be fixed.

Basil Crow added a comment - 2022-11-09 05:03

sbc8112 Have you tried increasing the timeout to two (2) minutes?

Basil Crow added a comment - 2022-11-09 05:03 sbc8112 Have you tried increasing the timeout to two (2) minutes?

Dan Wang added a comment - 2022-11-10 07:30

Hi basil ,

I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.

Dan Wang added a comment - 2022-11-10 07:30 Hi basil , I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.

Basil Crow added a comment - 2022-11-10 17:16

Downgrading to 2.346 is a dead end, as there is very little chance we would revert back to Jetty 9 at this point. The question is how can we get users onto a stable deployment pattern on Jetty 10. A very small number of users are affected by this problem, and I suspect if pings aren't making it through within 2 minutes that those users have other problems with networking and/or CPU saturation impacting networking. But ultimately this timeout is configurable, so there should be a guaranteed workaround for anyone affected: just set the timeout to an extremely high value (e.g. 86,400,000 milliseconds which is 24 hours).

Basil Crow added a comment - 2022-11-10 17:16 Downgrading to 2.346 is a dead end, as there is very little chance we would revert back to Jetty 9 at this point. The question is how can we get users onto a stable deployment pattern on Jetty 10. A very small number of users are affected by this problem, and I suspect if pings aren't making it through within 2 minutes that those users have other problems with networking and/or CPU saturation impacting networking. But ultimately this timeout is configurable, so there should be a guaranteed workaround for anyone affected: just set the timeout to an extremely high value (e.g. 86,400,000 milliseconds which is 24 hours).

jenkinsivo jenkinsivo added a comment - 2022-11-11 08:33

we have the same problem, the connection stops at some point and returns a "channel is already closed" message.

we have already tried to solve this on our own, in vain. It would be nice if this got a little more attention.

jenkinsivo jenkinsivo added a comment - 2022-11-11 08:33 we have the same problem, the connection stops at some point and returns a "channel is already closed" message. we have already tried to solve this on our own, in vain. It would be nice if this got a little more attention.

Basil Crow added a comment - 2022-11-11 14:09

It would be nice if this got a little more attention.

jenkinsivo It would be nice if you could read JENKINS-69955 (comment).

Basil Crow added a comment - 2022-11-11 14:09 It would be nice if this got a little more attention. jenkinsivo It would be nice if you could read JENKINS-69955 (comment) .

Dan Wang added a comment - 2022-11-15 02:43

I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.

As same as gyu test, there is no connection issue on v2.346..

Dan Wang added a comment - 2022-11-15 02:43 I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help. As same as gyu test, there is no connection issue on v2.346..

Basil Crow added a comment - 2022-11-15 06:40

And same as my reply to George Yu, I have to reiterate that downgrading to 2.346 is an exercise in futility compared to applying the workaround I described previously.

Basil Crow added a comment - 2022-11-15 06:40 And same as my reply to George Yu, I have to reiterate that downgrading to 2.346 is an exercise in futility compared to applying the workaround I described previously.

jenkinsivo jenkinsivo added a comment - 2022-11-15 07:12

We have added the keepalive in the config yesterday to 30 seconds, but the error still occurs afterwards, so 86400000 is the next step. to be continued.

I'm not going to downgrade immediately, that's a version from May this year, then we're going very far back in time.

jenkinsivo jenkinsivo added a comment - 2022-11-15 07:12 We have added the keepalive in the config yesterday to 30 seconds, but the error still occurs afterwards, so 86400000 is the next step. to be continued. I'm not going to downgrade immediately, that's a version from May this year, then we're going very far back in time.

jenkinsivo jenkinsivo added a comment - 2022-11-17 09:21

the keepalive is set to 86400000 but it looks like the issue still occurs unfortunately. any other suggestions? we prefer not to downgrade basil

jenkinsivo jenkinsivo added a comment - 2022-11-17 09:21 the keepalive is set to 86400000 but it looks like the issue still occurs unfortunately. any other suggestions? we prefer not to downgrade basil

George Yu added a comment - 2022-11-21 14:24 - edited

I also tried a big keepalive number and the disconnects still occur, as stated in my comment on 2022-10-26. Note that when disconnects occurred, there were no network performance issues, no network delay.

George Yu added a comment - 2022-11-21 14:24 - edited I also tried a big keepalive number and the disconnects still occur, as stated in my comment on 2022-10-26. Note that when disconnects occurred, there were no network performance issues, no network delay.

jenkinsivo jenkinsivo added a comment - 2022-11-23 13:30

indeed, at our side there are no network issues either. this occurs in the software.

It would be nice if you could take a look at this? basil

jenkinsivo jenkinsivo added a comment - 2022-11-23 13:30 indeed, at our side there are no network issues either. this occurs in the software. It would be nice if you could take a look at this? basil

Basil Crow added a comment - 2022-12-10 14:26

I changed the keepalive value to 86400000 and confirmed in a debugger that the new value was being set and enforced in my local machine, so I think some other problem must be going on if setting the keepalive to 86400000 isn't working for you. jenkinsivo Please stop pinging me if you are unwilling to provide logs or do any analysis. gyu Sorry but I am out of ideas about how to help you here, as everything seems to be working as expected for me locally. If you can get the issue to reproduce, I would suggest that you attach a Java debugger to the controller and debug it yourself, or else provide instructions in this ticket about how to reproduce the problem from scratch. With that said I am now unsubscribing from notifications to this thread.

Basil Crow added a comment - 2022-12-10 14:26 I changed the keepalive value to 86400000 and confirmed in a debugger that the new value was being set and enforced in my local machine, so I think some other problem must be going on if setting the keepalive to 86400000 isn't working for you. jenkinsivo Please stop pinging me if you are unwilling to provide logs or do any analysis. gyu Sorry but I am out of ideas about how to help you here, as everything seems to be working as expected for me locally. If you can get the issue to reproduce, I would suggest that you attach a Java debugger to the controller and debug it yourself, or else provide instructions in this ticket about how to reproduce the problem from scratch. With that said I am now unsubscribing from notifications to this thread.

Nik Reiman added a comment - 2022-12-20 15:57

We are also experiencing this issue ever since the Jetty 10 update in 2.361.1. I haven't had time to really debug the issue until now, and as a result we've been pinned to 2.346.3. However, I have now set up a build cluster that mirrors our production environment, and I can easily reproduce the error there as well. Some observations:

Setting `httpKeepAliveTimeout` didn't resolve the issue.
We do have some builds that take multiple hours to run, but we also have many jobs that finish in just a few minutes. We observe disconnections on all types of nodes, regardless of the job duration.
We have a variety of Linux, Mac, and Windows nodes, and we observe disconnections on all platforms.
We observe many jobs that fail with this error:
ERROR: Cannot resume build because FlowNode 32 for FlowHead 1 could not be loaded. This is expected to happen when using the PERFORMANCE_OPTIMIZED durability setting and Jenkins is not shut down cleanly. Consider investigating to understand if Jenkins was not shut down cleanly or switching to the MAX_SURVIVABILITY durability setting which should prevent this issue in most cases.However, I have not yet tried to change the pipeline durability setting. I'll try that and report back.
Node disconnections do seem to be correlated to some type of build activity, though it is hard to determine exactly what. Below, I've pasted a graph of 24 hours of activity from my test environment. Note that blue line (number of executors) shows disconnections when there are active builds running. When the cluster was idle, all nodes remain connected.

As Basil is no longer watching this thread, I will avoid reaching out to him directly until I have more diagnostic information that I can provide.

Nik Reiman added a comment - 2022-12-20 15:57 We are also experiencing this issue ever since the Jetty 10 update in 2.361.1. I haven't had time to really debug the issue until now, and as a result we've been pinned to 2.346.3. However, I have now set up a build cluster that mirrors our production environment, and I can easily reproduce the error there as well. Some observations: Setting `httpKeepAliveTimeout` didn't resolve the issue. We do have some builds that take multiple hours to run, but we also have many jobs that finish in just a few minutes. We observe disconnections on all types of nodes, regardless of the job duration. We have a variety of Linux, Mac, and Windows nodes, and we observe disconnections on all platforms. We observe many jobs that fail with this error: ERROR: Cannot resume build because FlowNode 32 for FlowHead 1 could not be loaded. This is expected to happen when using the PERFORMANCE_OPTIMIZED durability setting and Jenkins is not shut down cleanly. Consider investigating to understand if Jenkins was not shut down cleanly or switching to the MAX_SURVIVABILITY durability setting which should prevent this issue in most cases.However, I have not yet tried to change the pipeline durability setting. I'll try that and report back. Node disconnections do seem to be correlated to some type of build activity, though it is hard to determine exactly what. Below, I've pasted a graph of 24 hours of activity from my test environment. Note that blue line (number of executors) shows disconnections when there are active builds running. When the cluster was idle, all nodes remain connected. As Basil is no longer watching this thread, I will avoid reaching out to him directly until I have more diagnostic information that I can provide.

Nik Reiman added a comment - 2022-12-22 12:26 - edited

Sorry, upon further testing, it seems that nodes disconnect even when idle.

Nik Reiman added a comment - 2022-12-22 12:26 - edited Sorry, upon further testing, it seems that nodes disconnect even when idle.

George Yu added a comment - 2022-12-22 14:03

I also observed idle nodes got disconnected periodically. The other interesting thing is, if the node is running a program with very long sleep (e.g. 60 minutes) in it, the node got disconnected sometimes

George Yu added a comment - 2022-12-22 14:03 I also observed idle nodes got disconnected periodically. The other interesting thing is, if the node is running a program with very long sleep (e.g. 60 minutes) in it, the node got disconnected sometimes

Jim Sears added a comment - 2023-01-06 01:04

Hi nre_ableton, I have the exact same error and line numbers as you. I also have disconnects while running and not running jobs so it is a relief to know I'm not alone looking for answers!

Can you tell me how you graphed your executors and the queue in your post?

Jim Sears added a comment - 2023-01-06 01:04 Hi nre_ableton , I have the exact same error and line numbers as you. I also have disconnects while running and not running jobs so it is a relief to know I'm not alone looking for answers! Can you tell me how you graphed your executors and the queue in your post?

Nik Reiman added a comment - 2023-01-06 08:21

jimsears7 we use Prometheus to scrape various metrics from sources for hosts on our network. There is a Prometheus Jenkins plugin that provides metrics about queue length, executors, etc., which we install on our Jenkins controllers. Finally, we use Grafana to graph it all.

It's a lot of stuff to setup just to generate a graph or two, but since we already had all of this stuff in our infrastructure, it was relatively easy for me.

Nik Reiman added a comment - 2023-01-06 08:21 jimsears7 we use Prometheus to scrape various metrics from sources for hosts on our network. There is a Prometheus Jenkins plugin that provides metrics about queue length, executors, etc., which we install on our Jenkins controllers. Finally, we use Grafana to graph it all. It's a lot of stuff to setup just to generate a graph or two, but since we already had all of this stuff in our infrastructure, it was relatively easy for me.

Nik Reiman added a comment - 2023-01-16 08:35 - edited

FWIW, the problem is still present in 2.375.2. I have more graphing data but it's similar to the above pictures, so I won't paste it here.

Nik Reiman added a comment - 2023-01-16 08:35 - edited FWIW, the problem is still present in 2.375.2. I have more graphing data but it's similar to the above pictures, so I won't paste it here.

Enrico Walther added a comment - 2023-01-18 07:22 - edited

Hi, we are facing exact the same issue (same log output on jobs etc.) on Windows 10 Slave nodes with Jenkins 2.375.1 master.

Enrico Walther added a comment - 2023-01-18 07:22 - edited Hi, we are facing exact the same issue (same log output on jobs etc.) on Windows 10 Slave nodes with Jenkins 2.375.1 master.

Daniel Beland added a comment - 2023-01-19 18:04

Hi,

I don't want to add noise to the thread but I just want to say I've updated from 2.346.3 to 2.375.2 and I don't have any issue with my websocket agent.

All our agents use TCP except 1 that we host externally in Azure that uses websockets, it runs a single monitoring job every 10 minutes that takes 20 seconds to complete.

Jenkins and the agent run in docker (Linux host), agent uses the image jenkins/inbound-agent:3077.vd69cf116da_6f-4.

We have nginx reverse proxy in front of Jenkins (docker container in same network) and also the corporate reverse proxy, so it's not a direct websocket connection either.

Looking at the agent container logs, it only reconnects when I have installed plugins and did a soft restart of Jenkins, which has now been over 3 days.

Hopefully you can use that info to help narrow it down and find the problem which seems to be specific to certain scenarios (I see some run it on windows or have very long jobs).

Daniel Beland added a comment - 2023-01-19 18:04 Hi, I don't want to add noise to the thread but I just want to say I've updated from 2.346.3 to 2.375.2 and I don't have any issue with my websocket agent. All our agents use TCP except 1 that we host externally in Azure that uses websockets, it runs a single monitoring job every 10 minutes that takes 20 seconds to complete. Jenkins and the agent run in docker (Linux host), agent uses the image jenkins/inbound-agent:3077.vd69cf116da_6f-4. We have nginx reverse proxy in front of Jenkins (docker container in same network) and also the corporate reverse proxy, so it's not a direct websocket connection either. Looking at the agent container logs, it only reconnects when I have installed plugins and did a soft restart of Jenkins, which has now been over 3 days. Hopefully you can use that info to help narrow it down and find the problem which seems to be specific to certain scenarios (I see some run it on windows or have very long jobs).

Nik Reiman added a comment - 2023-02-09 12:57

I just noticed that we are specifying `webSocket: true` in our Swarm Client config, I wonder if this has something to do with it... 🤔

I'll run some more tests over the weekend to see.

Nik Reiman added a comment - 2023-02-09 12:57 I just noticed that we are specifying `webSocket: true` in our Swarm Client config, I wonder if this has something to do with it... 🤔 I'll run some more tests over the weekend to see.

Nik Reiman added a comment - 2023-02-13 10:20

I can now confirm that the `webSocket: true` option in the Swarm Client plugin seems to have been the culprit! We jut ran a test cluster for 4 days with no node disconnections. 🎉

Nik Reiman added a comment - 2023-02-13 10:20 I can now confirm that the `webSocket: true` option in the Swarm Client plugin seems to have been the culprit! We jut ran a test cluster for 4 days with no node disconnections. 🎉

Allan BURDAJEWICZ added a comment - 2023-02-21 06:10 - edited

Websocket agents seem to be intermittently disconnecting. This problem is reproducible in current weekly 2.391, even just locally:

Spin up a new Jenkins controller
Create an inbound Websocket agent
Start the websocket agent

Wait until you see the agent disconnecting:

Feb. 21, 2023 3:39:31 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Read side closed
Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Feb. 21, 2023 3:46:26 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Performing onReconnect operation.

The controller show the timeoutexception:

Feb. 21, 2023 3:46:16 PM jenkins.agents.WebSocketAgents$Session error
WARNING: null
org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout
	at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.convertCause(JettyWebSocketFrameHandler.java:524)
	at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.onError(JettyWebSocketFrameHandler.java:258)
	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$closeConnection$2(WebSocketCoreSession.java:284)
	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1468)
	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1487)
	at org.eclipse.jetty.websocket.core.server.internal.AbstractHandshaker$1.handle(AbstractHandshaker.java:212)
	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection(WebSocketCoreSession.java:284)
	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$sendFrame$7(WebSocketCoreSession.java:519)
	at org.eclipse.jetty.util.Callback$3.succeeded(Callback.java:155)
	at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.notifyCallbackSuccess(TransformingFlusher.java:197)
	at org.eclipse.jetty.websocket.core.internal.TransformingFlusher$Flusher.process(TransformingFlusher.java:154)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:232)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:214)
	at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.sendFrame(TransformingFlusher.java:77)
	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.sendFrame(WebSocketCoreSession.java:522)
	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.close(WebSocketCoreSession.java:239)
	at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.processHandlerError(WebSocketCoreSession.java:371)
	at org.eclipse.jetty.websocket.core.internal.WebSocketConnection.onIdleExpired(WebSocketConnection.java:233)
	at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407)
	at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170)
	at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout
	... 10 more

****

I am not 100% sure this is remoting. It looks like people are hitting this since the move to Jetty 10 jenkins.websocket.Jetty10Provider. I collected debug jetty log from the controller, hopefully that can help:

I can only acknowledge that the default websocket connection timeout is 30s. And per Jetty, we get over it:

Feb. 21, 2023 3:46:16 PM org.eclipse.jetty.io.IdleTimeout checkIdleTimeout
FINE: SocketChannelEndPoint@45b2acfd[{l=/127.0.0.1:8081,r=/127.0.0.1:63856,OPEN,fill=FI,flush=W,to=30003/30000}{io=1/1,kio=1,kro=1}]->[WebSocketConnection@47fa53ab[SERVER,p=Parser@d1a2f85[s=START,c=0,o=0x0,m=-,l=-1],f=Flusher@7e9adc28[PROCESSING][queueSize=0,aggregate=null],g=org.eclipse.jetty.websocket.core.internal.Generator@6ef93c39]] idle timeout check, elapsed: 30003 ms, remaining: -3 ms

Allan BURDAJEWICZ added a comment - 2023-02-21 06:10 - edited Websocket agents seem to be intermittently disconnecting. This problem is reproducible in current weekly 2.391, even just locally: Spin up a new Jenkins controller Create an inbound Websocket agent Start the websocket agent Wait until you see the agent disconnecting: Feb. 21, 2023 3:39:31 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Read side closed Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Feb. 21, 2023 3:46:26 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Performing onReconnect operation. The controller show the timeoutexception: Feb. 21, 2023 3:46:16 PM jenkins.agents.WebSocketAgents$Session error WARNING: null org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.convertCause(JettyWebSocketFrameHandler.java:524) at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.onError(JettyWebSocketFrameHandler.java:258) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$closeConnection$2(WebSocketCoreSession.java:284) at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1468) at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1487) at org.eclipse.jetty.websocket.core.server.internal.AbstractHandshaker$1.handle(AbstractHandshaker.java:212) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection(WebSocketCoreSession.java:284) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$sendFrame$7(WebSocketCoreSession.java:519) at org.eclipse.jetty.util.Callback$3.succeeded(Callback.java:155) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.notifyCallbackSuccess(TransformingFlusher.java:197) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher$Flusher.process(TransformingFlusher.java:154) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:232) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:214) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.sendFrame(TransformingFlusher.java:77) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.sendFrame(WebSocketCoreSession.java:522) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.close(WebSocketCoreSession.java:239) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.processHandlerError(WebSocketCoreSession.java:371) at org.eclipse.jetty.websocket.core.internal.WebSocketConnection.onIdleExpired(WebSocketConnection.java:233) at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407) at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170) at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:829) Caused by: org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout ... 10 more **** I am not 100% sure this is remoting. It looks like people are hitting this since the move to Jetty 10 jenkins.websocket.Jetty10Provider . I collected debug jetty log from the controller, hopefully that can help: Agent name JENKINS-69955 Disconnection detected at 3:46:16 PM hudson.remoting-0.log.0 jenkins.agents.WebSocketAgents-0.log.0 org.jetty-0.log.0 I can only acknowledge that the default websocket connection timeout is 30s. And per Jetty, we get over it: Feb. 21, 2023 3:46:16 PM org.eclipse.jetty.io.IdleTimeout checkIdleTimeout FINE: SocketChannelEndPoint@45b2acfd[{l=/127.0.0.1:8081,r=/127.0.0.1:63856,OPEN,fill=FI,flush=W,to=30003/30000}{io=1/1,kio=1,kro=1}]->[WebSocketConnection@47fa53ab[SERVER,p=Parser@d1a2f85[s=START,c=0,o=0x0,m=-,l=-1],f=Flusher@7e9adc28[PROCESSING][queueSize=0,aggregate= null ],g=org.eclipse.jetty.websocket.core.internal.Generator@6ef93c39]] idle timeout check, elapsed: 30003 ms, remaining: -3 ms

Dan Wang added a comment - 2023-02-21 07:07

Hi nre_ableton ,

Could you share some tips on where to add the "webSocket: true" option?

Dan Wang added a comment - 2023-02-21 07:07 Hi nre_ableton , Could you share some tips on where to add the "webSocket: true" option?

Nik Reiman added a comment - 2023-02-21 08:15

sbc8112 it's an argument to the Swarm Client, in this case in a YAML configuration file. See https://github.com/jenkinsci/swarm-plugin#available-options. If you aren't using Swarm Client, then you should check whatever protocol your agents use to connect. Also note that the solution (for me, anyways), was not to specify this option. We were using web sockets before, and now we are not.

Nik Reiman added a comment - 2023-02-21 08:15 sbc8112 it's an argument to the Swarm Client, in this case in a YAML configuration file. See https://github.com/jenkinsci/swarm-plugin#available-options. If you aren't using Swarm Client, then you should check whatever protocol your agents use to connect. Also note that the solution (for me, anyways), was not to specify this option. We were using web sockets before, and now we are not.

Olivier Lamy added a comment - 2023-02-22 05:30

idle timeout check, elapsed: 30003 ms, remaining: -3 ms

really possible reason. Jetty have a default IdleTime out 30s.

websocket is sending ping per default every 30s. (see https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/core/src/main/java/jenkins/websocket/WebSocketSession.java#L58)

so yup this ping can work or not work by a matter of few milliseconds (in this logs it's 3ms) it depends on the network and if you are lucky or not

possible workaround start jenkins master with

 -Djenkins.websocket.pingInterval=15

ping delay will be shorter than Jetty idle timeout.

Change the configuration of Jetty websocket container to have a larger per default idle timeout.

can be done around here https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java#L55

with something such

JettyWebSocketServerContainer.getContainer(req.getServletContext()).setIdleTimeout(some duration);

Javadoc from here https://github.com/eclipse/jetty.project/blob/b7075161d015ddce23fbf3db873d5f6b539f6a6b/jetty-io/src/main/java/org/eclipse/jetty/io/IdleTimeout.java#L29

a check is then made to see when the last operation took place.

so if nothing happen during 30s in the established websocket connection....

Olivier Lamy added a comment - 2023-02-22 05:30 idle timeout check, elapsed: 30003 ms, remaining: -3 ms really possible reason. Jetty have a default IdleTime out 30s. websocket is sending ping per default every 30s. (see https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/core/src/main/java/jenkins/websocket/WebSocketSession.java#L58) so yup this ping can work or not work by a matter of few milliseconds (in this logs it's 3ms) it depends on the network and if you are lucky or not possible workaround start jenkins master with -Djenkins.websocket.pingInterval=15 ping delay will be shorter than Jetty idle timeout. Change the configuration of Jetty websocket container to have a larger per default idle timeout. can be done around here https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java#L55 with something such JettyWebSocketServerContainer.getContainer(req.getServletContext()).setIdleTimeout(some duration); Javadoc from here https://github.com/eclipse/jetty.project/blob/b7075161d015ddce23fbf3db873d5f6b539f6a6b/jetty-io/src/main/java/org/eclipse/jetty/io/IdleTimeout.java#L29 a check is then made to see when the last operation took place. so if nothing happen during 30s in the established websocket connection....

Allan BURDAJEWICZ added a comment - 2023-02-22 06:00 - edited

I can definitely reproduce with 2.361.1 by adjusting the websocket ping interval. And I can't reproduce with 2.346.4.
Updated the description with a reproduction scenario.

IIUC the previous websocket timeout was 5 minutes. Set by the WebsocketPolicy at https://github.com/eclipse/jetty.project/blob/jetty-9.4.48.v20220622/jetty-websocket/websocket-api/src/main/java/org/eclipse/jetty/websocket/api/WebSocketPolicy.java#L81-L86

Allan BURDAJEWICZ added a comment - 2023-02-22 06:00 - edited I can definitely reproduce with 2.361.1 by adjusting the websocket ping interval. And I can't reproduce with 2.346.4. Updated the description with a reproduction scenario. IIUC the previous websocket timeout was 5 minutes. Set by the WebsocketPolicy at https://github.com/eclipse/jetty.project/blob/jetty-9.4.48.v20220622/jetty-websocket/websocket-api/src/main/java/org/eclipse/jetty/websocket/api/WebSocketPolicy.java#L81-L86

Hung added a comment - 2023-02-28 06:35

allan_burdajewicz Hi Allan, could you have some updates or workaround on this issue?

Currently, i'm using Jenkins 2.375.1 and due to some reason I could not rollback to jenkins 2.346.3 LTS as suggestion above.

Hung added a comment - 2023-02-28 06:35 allan_burdajewicz Hi Allan, could you have some updates or workaround on this issue? Currently, i'm using Jenkins 2.375.1 and due to some reason I could not rollback to jenkins 2.346.3 LTS as suggestion above.

Olivier Lamy added a comment - 2023-02-28 07:26

leminhhung0110 this a PR ready. Currently you can use the workaround
-Djenkins.websocket.pingInterval=15
or even less

Olivier Lamy added a comment - 2023-02-28 07:26 leminhhung0110 this a PR ready. Currently you can use the workaround -Djenkins.websocket.pingInterval=15 or even less

Hung added a comment - 2023-02-28 07:38

olamy do you meant i will use this command when starting jenkins master "jenkins restart -Djenkins.websocket.pingInterval=15"?

Hung added a comment - 2023-02-28 07:38 olamy do you meant i will use this command when starting jenkins master " jenkins restart -Djenkins.websocket.pingInterval=15 "?

Olivier Lamy added a comment - 2023-03-01 05:44

leminhhung0110 I have no idea what your script called jenkins is doing but Jenkins need to be started with the system property.

Olivier Lamy added a comment - 2023-03-01 05:44 leminhhung0110 I have no idea what your script called jenkins is doing but Jenkins need to be started with the system property.

Assignee:: Allan BURDAJEWICZ

Reporter:: George Yu

Votes:: 13 Vote for this issue

Watchers:: 25 Start watching this issue

Created:: 2022-10-26 17:43

Updated:: 2023-03-20 18:55

Resolved:: 2023-03-12 09:59

Jenkins

Details

Description

How to Reproduce

Attachments

Attachments

Issue Links

Activity

Collapse comment: Basil Crow added a comment - 2022-10-26 19:18

Expand comment: Basil Crow added a comment - 2022-10-26 19:18

Collapse comment: George Yu added a comment - 2022-10-26 19:43

Expand comment: George Yu added a comment - 2022-10-26 19:43

Collapse comment: Basil Crow added a comment - 2022-10-26 19:59

Expand comment: Basil Crow added a comment - 2022-10-26 19:59

Collapse comment: George Yu added a comment - 2022-11-02 13:55, Edited by George Yu - 2022-11-02 13:59

Expand comment: George Yu added a comment - 2022-11-02 13:55, Edited by George Yu - 2022-11-02 13:59

Collapse comment: Dan Wang added a comment - 2022-11-09 03:29

Expand comment: Dan Wang added a comment - 2022-11-09 03:29

Collapse comment: Basil Crow added a comment - 2022-11-09 05:03

Expand comment: Basil Crow added a comment - 2022-11-09 05:03

Collapse comment: Dan Wang added a comment - 2022-11-10 07:30

Expand comment: Dan Wang added a comment - 2022-11-10 07:30

Collapse comment: Basil Crow added a comment - 2022-11-10 17:16

Expand comment: Basil Crow added a comment - 2022-11-10 17:16

Collapse comment: jenkinsivo jenkinsivo added a comment - 2022-11-11 08:33

Expand comment: jenkinsivo jenkinsivo added a comment - 2022-11-11 08:33

Collapse comment: Basil Crow added a comment - 2022-11-11 14:09

Expand comment: Basil Crow added a comment - 2022-11-11 14:09

Collapse comment: Dan Wang added a comment - 2022-11-15 02:43

Expand comment: Dan Wang added a comment - 2022-11-15 02:43

Collapse comment: Basil Crow added a comment - 2022-11-15 06:40

Expand comment: Basil Crow added a comment - 2022-11-15 06:40

Collapse comment: jenkinsivo jenkinsivo added a comment - 2022-11-15 07:12

Expand comment: jenkinsivo jenkinsivo added a comment - 2022-11-15 07:12

Collapse comment: jenkinsivo jenkinsivo added a comment - 2022-11-17 09:21

Expand comment: jenkinsivo jenkinsivo added a comment - 2022-11-17 09:21

Collapse comment: George Yu added a comment - 2022-11-21 14:24, Edited by George Yu - 2022-11-21 14:25

Expand comment: George Yu added a comment - 2022-11-21 14:24, Edited by George Yu - 2022-11-21 14:25

Collapse comment: jenkinsivo jenkinsivo added a comment - 2022-11-23 13:30

Expand comment: jenkinsivo jenkinsivo added a comment - 2022-11-23 13:30

Collapse comment: Basil Crow added a comment - 2022-12-10 14:26

Expand comment: Basil Crow added a comment - 2022-12-10 14:26

Collapse comment: Nik Reiman added a comment - 2022-12-20 15:57

Expand comment: Nik Reiman added a comment - 2022-12-20 15:57

Collapse comment: Nik Reiman added a comment - 2022-12-22 12:26, Edited by Nik Reiman - 2022-12-22 12:27

Expand comment: Nik Reiman added a comment - 2022-12-22 12:26, Edited by Nik Reiman - 2022-12-22 12:27

Collapse comment: George Yu added a comment - 2022-12-22 14:03

Expand comment: George Yu added a comment - 2022-12-22 14:03

Collapse comment: Jim Sears added a comment - 2023-01-06 01:04

Expand comment: Jim Sears added a comment - 2023-01-06 01:04

Collapse comment: Nik Reiman added a comment - 2023-01-06 08:21

Expand comment: Nik Reiman added a comment - 2023-01-06 08:21

Collapse comment: Nik Reiman added a comment - 2023-01-16 08:35, Edited by Nik Reiman - 2023-01-16 08:36

Expand comment: Nik Reiman added a comment - 2023-01-16 08:35, Edited by Nik Reiman - 2023-01-16 08:36

Collapse comment: Enrico Walther added a comment - 2023-01-18 07:22, Edited by Enrico Walther - 2023-01-18 07:25

Expand comment: Enrico Walther added a comment - 2023-01-18 07:22, Edited by Enrico Walther - 2023-01-18 07:25

Collapse comment: Daniel Beland added a comment - 2023-01-19 18:04

Expand comment: Daniel Beland added a comment - 2023-01-19 18:04

Collapse comment: Nik Reiman added a comment - 2023-02-09 12:57

Expand comment: Nik Reiman added a comment - 2023-02-09 12:57

Collapse comment: Nik Reiman added a comment - 2023-02-13 10:20

Expand comment: Nik Reiman added a comment - 2023-02-13 10:20

Collapse comment: Allan BURDAJEWICZ added a comment - 2023-02-21 06:10, Edited by Allan BURDAJEWICZ - 2023-02-21 06:40

Expand comment: Allan BURDAJEWICZ added a comment - 2023-02-21 06:10, Edited by Allan BURDAJEWICZ - 2023-02-21 06:40

Collapse comment: Dan Wang added a comment - 2023-02-21 07:07

Expand comment: Dan Wang added a comment - 2023-02-21 07:07

Collapse comment: Nik Reiman added a comment - 2023-02-21 08:15

Expand comment: Nik Reiman added a comment - 2023-02-21 08:15

Collapse comment: Olivier Lamy added a comment - 2023-02-22 05:30

Expand comment: Olivier Lamy added a comment - 2023-02-22 05:30

Collapse comment: Allan BURDAJEWICZ added a comment - 2023-02-22 06:00, Edited by Allan BURDAJEWICZ - 2023-02-22 21:36

Expand comment: Allan BURDAJEWICZ added a comment - 2023-02-22 06:00, Edited by Allan BURDAJEWICZ - 2023-02-22 21:36

Collapse comment: Hung added a comment - 2023-02-28 06:35

Expand comment: Hung added a comment - 2023-02-28 06:35

Collapse comment: Olivier Lamy added a comment - 2023-02-28 07:26

Expand comment: Olivier Lamy added a comment - 2023-02-28 07:26

Collapse comment: Hung added a comment - 2023-02-28 07:38

Expand comment: Hung added a comment - 2023-02-28 07:38

Collapse comment: Olivier Lamy added a comment - 2023-03-01 05:44