We have ~20 on-prem Windows 10 agents using websockets to connect to Jenkins controllers running on AWS ECS. Unfortunately these agents have to run on-prem due to embedded development boards connected to these Windows 10 agents for running regression test suites and run for 1-2 hours.
We can track down some of the disconnects to networking blips which is expected with the connection from on-prem into AWS cloud.
But we also have a small set of disconnects which only occur when the job is running on the node. The agent in question only seems to disconnect while running a job. I setup another windows 10 agent in our dev environment just connected and no jobs running. It stays connected for multiple weeks while the agent running the builds seems to disconnect 1-2 times per week.
I configured some websocket systems logs and the log shows the connection closed due to "Idle timeout expired". Looks like a 1 second timeout on something...which seems pretty short.
Jul 06, 2021 9:07:17 AM WARNING jenkins.agents.WebSocketAgents$Session error null java.util.concurrent.TimeoutException: Idle timeout expired: 2463/1000 ms Caused: org.eclipse.jetty.websocket.api.CloseException at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onReadTimeout(AbstractWebSocketConnection.java:564) at org.eclipse.jetty.io.AbstractConnection.onFillInterestedFailed(AbstractConnection.java:172) at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillInterestedFailed(AbstractWebSocketConnection.java:539) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.failed(AbstractConnection.java:317) at org.eclipse.jetty.io.FillInterest.onFail(FillInterest.java:140) at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407) at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) at org.eclipse.jetty.io.IdleTimeout.activate(IdleTimeout.java:136) at org.eclipse.jetty.io.IdleTimeout.setIdleTimeout(IdleTimeout.java:100) at org.eclipse.jetty.server.LowResourceMonitor.setLowResources(LowResourceMonitor.java:412) at org.eclipse.jetty.server.LowResourceMonitor.monitor(LowResourceMonitor.java:352) at org.eclipse.jetty.server.LowResourceMonitor$1.run(LowResourceMonitor.java:84) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Any ideas? I was going to start running a test job on my dev agent and see if it still remains stable while it running a job...