Unexplained websocket idle timeout disconnects from Windows 10 agents and Jenkins controllers in AWS ECS

This issue is archived. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Minor
    • Component/s: remoting
    • Environment:
      Jenkins 2.263.3, On-prem Windows 10 agents connecting via websockets to Jenkins controller in AWS ECS.

      We have ~20 on-prem Windows 10 agents using websockets to connect to Jenkins controllers running on AWS ECS.    Unfortunately these agents have to run on-prem due to embedded development boards connected to these Windows 10 agents for running regression test suites and run for 1-2 hours.

      We can track down some of the disconnects to networking blips which is expected with the connection from on-prem into AWS cloud.

      But we also have a small set of disconnects which only occur when the job is running on the node.   The agent in question only seems to disconnect while running a job.  I setup another windows 10 agent in our dev environment just connected and no jobs running.  It stays connected for multiple weeks while the agent running the builds seems to disconnect 1-2  times per week.   

      I configured some websocket systems logs and the log shows the connection closed due to "Idle timeout expired".   Looks like a 1 second timeout on something...which seems pretty short.

      Jul 06, 2021 9:07:17 AM WARNING jenkins.agents.WebSocketAgents$Session error null
      java.util.concurrent.TimeoutException: Idle timeout expired: 2463/1000 ms
      Caused: org.eclipse.jetty.websocket.api.CloseException
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onReadTimeout(AbstractWebSocketConnection.java:564)
      	at org.eclipse.jetty.io.AbstractConnection.onFillInterestedFailed(AbstractConnection.java:172)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillInterestedFailed(AbstractWebSocketConnection.java:539)
      	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.failed(AbstractConnection.java:317)
      	at org.eclipse.jetty.io.FillInterest.onFail(FillInterest.java:140)
      	at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407)
      	at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171)
      	at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113)
      	at org.eclipse.jetty.io.IdleTimeout.activate(IdleTimeout.java:136)
      	at org.eclipse.jetty.io.IdleTimeout.setIdleTimeout(IdleTimeout.java:100)
      	at org.eclipse.jetty.server.LowResourceMonitor.setLowResources(LowResourceMonitor.java:412)
      	at org.eclipse.jetty.server.LowResourceMonitor.monitor(LowResourceMonitor.java:352)
      	at org.eclipse.jetty.server.LowResourceMonitor$1.run(LowResourceMonitor.java:84)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

       

      Any ideas?  I was going to start running a test job on my dev agent and see if it still remains stable while it running a job...

            Assignee:
            Unassigned
            Reporter:
            John Lengeling
            Archiver:
            Jenkins Service Account

              Created:
              Updated:
              Archived: