-
Bug
-
Resolution: Fixed
-
Blocker
-
Jenkins:2.361.1
-
Powered by SuggestiMate -
2.395
We first experienced the problem that websocket connections were closed down unexpectedly in Jenkins 2.361.1 LTS. The problem was reported in JENKINS-69509, and Jenkins 2.375 was released subsequently to address the issue. We tried Jenkins 2.375 and found the websocket problem still there. The websocket was closed down in less than 2 hours after the build started. Attached all the necessary logs
Reverted back to Jenkins 2.346.3 LTS is a workaround that works for us.
How to Reproduce
- Start Jenkins 2.361.x or later with -Djenkins.websocket.pingInterval=120
- Connect a Websocket agent
--> Notice that the websocket agent disconnect/reconnect at every ping
An interval of 120 is a way to consistently see the error. Though it should happen with any value > 30. It may happen with the default 30 but with a lower likelihood.
- Agent logs.txt
- 4 kB
- hudson.remoting-0.log.0
- 1.27 MB
- org.jetty-0.log.0
- 7.05 MB
- is related to
-
JENKINS-70697 Improve Jetty10Provider initialization with ServletContainerInitializer
-
- Open
-
- links to
[JENKINS-69955] WebSocketTimeoutException: Connection Idle Timeout
Can not experiment it now as our jenkins server is in production mode. However, before I reverted our jenkins version, I did try "--httpKeepAliveTimeout=3600000 --sessionEviction=-1 --sessionTimeout=90000". Strangely, the disconnect happened in a few minutes instead of less than 2 hours. So with those settings the disconnect happened in minutes. Without those setting the disconnect happened in hours
Without logs from when you were running with those settings I cannot be of much help I am afraid.
Borrowed a new jenkins server and an agent, with Jenkins 2.375 installed, ran the job with the controller with --httpKeepAliveTimeout=120000 and the build terminated after 12min. Collected all logs, attached. The names of the five logs/system info all contain "httpKeepAliveTimeout=120000"
Hi basil ,
I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.
Downgrading to 2.346 is a dead end, as there is very little chance we would revert back to Jetty 9 at this point. The question is how can we get users onto a stable deployment pattern on Jetty 10. A very small number of users are affected by this problem, and I suspect if pings aren't making it through within 2 minutes that those users have other problems with networking and/or CPU saturation impacting networking. But ultimately this timeout is configurable, so there should be a guaranteed workaround for anyone affected: just set the timeout to an extremely high value (e.g. 86,400,000 milliseconds which is 24 hours).
we have the same problem, the connection stops at some point and returns a "channel is already closed" message.
we have already tried to solve this on our own, in vain. It would be nice if this got a little more attention.
It would be nice if this got a little more attention.
jenkinsivo It would be nice if you could read JENKINS-69955 (comment).
I have tried to update to v2.377 and add the --httpKeepAliveTimeout=120000 but the abnormal disconnect happened in hours. I will try to downgrade to v2.346 and check if this can help.
As same as gyu test, there is no connection issue on v2.346..
And same as my reply to George Yu, I have to reiterate that downgrading to 2.346 is an exercise in futility compared to applying the workaround I described previously.
We have added the keepalive in the config yesterday to 30 seconds, but the error still occurs afterwards, so 86400000 is the next step. to be continued.
I'm not going to downgrade immediately, that's a version from May this year, then we're going very far back in time.
the keepalive is set to 86400000 but it looks like the issue still occurs unfortunately. any other suggestions? we prefer not to downgrade basil
I also tried a big keepalive number and the disconnects still occur, as stated in my comment on 2022-10-26. Note that when disconnects occurred, there were no network performance issues, no network delay.
indeed, at our side there are no network issues either. this occurs in the software.
It would be nice if you could take a look at this? basil
I changed the keepalive value to 86400000 and confirmed in a debugger that the new value was being set and enforced in my local machine, so I think some other problem must be going on if setting the keepalive to 86400000 isn't working for you. jenkinsivo Please stop pinging me if you are unwilling to provide logs or do any analysis. gyu Sorry but I am out of ideas about how to help you here, as everything seems to be working as expected for me locally. If you can get the issue to reproduce, I would suggest that you attach a Java debugger to the controller and debug it yourself, or else provide instructions in this ticket about how to reproduce the problem from scratch. With that said I am now unsubscribing from notifications to this thread.
We are also experiencing this issue ever since the Jetty 10 update in 2.361.1. I haven't had time to really debug the issue until now, and as a result we've been pinned to 2.346.3. However, I have now set up a build cluster that mirrors our production environment, and I can easily reproduce the error there as well. Some observations:
- Setting `httpKeepAliveTimeout` didn't resolve the issue.
- We do have some builds that take multiple hours to run, but we also have many jobs that finish in just a few minutes. We observe disconnections on all types of nodes, regardless of the job duration.
- We have a variety of Linux, Mac, and Windows nodes, and we observe disconnections on all platforms.
- We observe many jobs that fail with this error:
ERROR: Cannot resume build because FlowNode 32 for FlowHead 1 could not be loaded. This is expected to happen when using the PERFORMANCE_OPTIMIZED durability setting and Jenkins is not shut down cleanly. Consider investigating to understand if Jenkins was not shut down cleanly or switching to the MAX_SURVIVABILITY durability setting which should prevent this issue in most cases.However, I have not yet tried to change the pipeline durability setting. I'll try that and report back. - Node disconnections do seem to be correlated to some type of build activity, though it is hard to determine exactly what. Below, I've pasted a graph of 24 hours of activity from my test environment. Note that blue line (number of executors) shows disconnections when there are active builds running. When the cluster was idle, all nodes remain connected.
As Basil is no longer watching this thread, I will avoid reaching out to him directly until I have more diagnostic information that I can provide.
Sorry, upon further testing, it seems that nodes disconnect even when idle.
I also observed idle nodes got disconnected periodically. The other interesting thing is, if the node is running a program with very long sleep (e.g. 60 minutes) in it, the node got disconnected sometimes
Hi nre_ableton, I have the exact same error and line numbers as you. I also have disconnects while running and not running jobs so it is a relief to know I'm not alone looking for answers!
Can you tell me how you graphed your executors and the queue in your post?
jimsears7 we use Prometheus to scrape various metrics from sources for hosts on our network. There is a Prometheus Jenkins plugin that provides metrics about queue length, executors, etc., which we install on our Jenkins controllers. Finally, we use Grafana to graph it all.
It's a lot of stuff to setup just to generate a graph or two, but since we already had all of this stuff in our infrastructure, it was relatively easy for me.
FWIW, the problem is still present in 2.375.2. I have more graphing data but it's similar to the above pictures, so I won't paste it here.
Hi, we are facing exact the same issue (same log output on jobs etc.) on Windows 10 Slave nodes with Jenkins 2.375.1 master.
Hi,
I don't want to add noise to the thread but I just want to say I've updated from 2.346.3 to 2.375.2 and I don't have any issue with my websocket agent.
All our agents use TCP except 1 that we host externally in Azure that uses websockets, it runs a single monitoring job every 10 minutes that takes 20 seconds to complete.
Jenkins and the agent run in docker (Linux host), agent uses the image jenkins/inbound-agent:3077.vd69cf116da_6f-4.
We have nginx reverse proxy in front of Jenkins (docker container in same network) and also the corporate reverse proxy, so it's not a direct websocket connection either.
Looking at the agent container logs, it only reconnects when I have installed plugins and did a soft restart of Jenkins, which has now been over 3 days.
Hopefully you can use that info to help narrow it down and find the problem which seems to be specific to certain scenarios (I see some run it on windows or have very long jobs).
I just noticed that we are specifying `webSocket: true` in our Swarm Client config, I wonder if this has something to do with it... 🤔
I'll run some more tests over the weekend to see.
I can now confirm that the `webSocket: true` option in the Swarm Client plugin seems to have been the culprit! We jut ran a test cluster for 4 days with no node disconnections. 🎉
Websocket agents seem to be intermittently disconnecting. This problem is reproducible in current weekly 2.391, even just locally:
- Spin up a new Jenkins controller
- Create an inbound Websocket agent
- Start the websocket agent
Wait until you see the agent disconnecting:
Feb. 21, 2023 3:39:31 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Read side closed Feb. 21, 2023 3:46:16 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Feb. 21, 2023 3:46:26 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Performing onReconnect operation.
The controller show the timeoutexception:
Feb. 21, 2023 3:46:16 PM jenkins.agents.WebSocketAgents$Session error WARNING: null org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.convertCause(JettyWebSocketFrameHandler.java:524) at org.eclipse.jetty.websocket.common.JettyWebSocketFrameHandler.onError(JettyWebSocketFrameHandler.java:258) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$closeConnection$2(WebSocketCoreSession.java:284) at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1468) at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1487) at org.eclipse.jetty.websocket.core.server.internal.AbstractHandshaker$1.handle(AbstractHandshaker.java:212) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection(WebSocketCoreSession.java:284) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.lambda$sendFrame$7(WebSocketCoreSession.java:519) at org.eclipse.jetty.util.Callback$3.succeeded(Callback.java:155) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.notifyCallbackSuccess(TransformingFlusher.java:197) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher$Flusher.process(TransformingFlusher.java:154) at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:232) at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:214) at org.eclipse.jetty.websocket.core.internal.TransformingFlusher.sendFrame(TransformingFlusher.java:77) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.sendFrame(WebSocketCoreSession.java:522) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.close(WebSocketCoreSession.java:239) at org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.processHandlerError(WebSocketCoreSession.java:371) at org.eclipse.jetty.websocket.core.internal.WebSocketConnection.onIdleExpired(WebSocketConnection.java:233) at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407) at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170) at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.eclipse.jetty.websocket.core.exception.WebSocketTimeoutException: Connection Idle Timeout ... 10 more
****
I am not 100% sure this is remoting. It looks like people are hitting this since the move to Jetty 10 jenkins.websocket.Jetty10Provider. I collected debug jetty log from the controller, hopefully that can help:
- Agent name
JENKINS-69955 - Disconnection detected at 3:46:16 PM
- hudson.remoting-0.log.0
- jenkins.agents.WebSocketAgents-0.log.0
- org.jetty-0.log.0
I can only acknowledge that the default websocket connection timeout is 30s. And per Jetty, we get over it:
Feb. 21, 2023 3:46:16 PM org.eclipse.jetty.io.IdleTimeout checkIdleTimeout
FINE: SocketChannelEndPoint@45b2acfd[{l=/127.0.0.1:8081,r=/127.0.0.1:63856,OPEN,fill=FI,flush=W,to=30003/30000}{io=1/1,kio=1,kro=1}]->[WebSocketConnection@47fa53ab[SERVER,p=Parser@d1a2f85[s=START,c=0,o=0x0,m=-,l=-1],f=Flusher@7e9adc28[PROCESSING][queueSize=0,aggregate=null],g=org.eclipse.jetty.websocket.core.internal.Generator@6ef93c39]] idle timeout check, elapsed: 30003 ms, remaining: -3 ms
Hi nre_ableton ,
Could you share some tips on where to add the "webSocket: true" option?
sbc8112 it's an argument to the Swarm Client, in this case in a YAML configuration file. See https://github.com/jenkinsci/swarm-plugin#available-options. If you aren't using Swarm Client, then you should check whatever protocol your agents use to connect. Also note that the solution (for me, anyways), was not to specify this option. We were using web sockets before, and now we are not.
idle timeout check, elapsed: 30003 ms, remaining: -3 ms
really possible reason. Jetty have a default IdleTime out 30s.
websocket is sending ping per default every 30s. (see https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/core/src/main/java/jenkins/websocket/WebSocketSession.java#L58)
so yup this ping can work or not work by a matter of few milliseconds (in this logs it's 3ms) it depends on the network and if you are lucky or not
possible workaround start jenkins master with
-Djenkins.websocket.pingInterval=15
ping delay will be shorter than Jetty idle timeout.
Change the configuration of Jetty websocket container to have a larger per default idle timeout.
can be done around here https://github.com/jenkinsci/jenkins/blob/a3f31145e621ab0072bb872ecac93a2c6cbcbaae/websocket/jetty10/src/main/java/jenkins/websocket/Jetty10Provider.java#L55
with something such
JettyWebSocketServerContainer.getContainer(req.getServletContext()).setIdleTimeout(some duration);
Javadoc from here https://github.com/eclipse/jetty.project/blob/b7075161d015ddce23fbf3db873d5f6b539f6a6b/jetty-io/src/main/java/org/eclipse/jetty/io/IdleTimeout.java#L29
a check is then made to see when the last operation took place.
so if nothing happen during 30s in the established websocket connection....
I can definitely reproduce with 2.361.1 by adjusting the websocket ping interval. And I can't reproduce with 2.346.4.
Updated the description with a reproduction scenario.
IIUC the previous websocket timeout was 5 minutes. Set by the WebsocketPolicy at https://github.com/eclipse/jetty.project/blob/jetty-9.4.48.v20220622/jetty-websocket/websocket-api/src/main/java/org/eclipse/jetty/websocket/api/WebSocketPolicy.java#L81-L86
allan_burdajewicz Hi Allan, could you have some updates or workaround on this issue?
Currently, i'm using Jenkins 2.375.1 and due to some reason I could not rollback to jenkins 2.346.3 LTS as suggestion above.
leminhhung0110 this a PR ready. Currently you can use the workaround
-Djenkins.websocket.pingInterval=15
or even less
olamy do you meant i will use this command when starting jenkins master "jenkins restart -Djenkins.websocket.pingInterval=15"?
leminhhung0110 I have no idea what your script called jenkins is doing but Jenkins need to be started with the system property.
Could you try running with the controller with --httpKeepAliveTimeout=120000 (or --httpsKeepAliveTimeout=120000)? I also notice your agents are running with an old version of Remoting (4.13.2); not likely to be related to this problem, but still worth upgrading to a more recent version.