Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-66172

Unexplained websocket idle timeout disconnects from Windows 10 agents and Jenkins controllers in AWS ECS

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
    • Environment:
      Jenkins 2.263.3, On-prem Windows 10 agents connecting via websockets to Jenkins controller in AWS ECS.
    • Similar Issues:

      Description

      We have ~20 on-prem Windows 10 agents using websockets to connect to Jenkins controllers running on AWS ECS.    Unfortunately these agents have to run on-prem due to embedded development boards connected to these Windows 10 agents for running regression test suites and run for 1-2 hours.

      We can track down some of the disconnects to networking blips which is expected with the connection from on-prem into AWS cloud.

      But we also have a small set of disconnects which only occur when the job is running on the node.   The agent in question only seems to disconnect while running a job.  I setup another windows 10 agent in our dev environment just connected and no jobs running.  It stays connected for multiple weeks while the agent running the builds seems to disconnect 1-2  times per week.   

      I configured some websocket systems logs and the log shows the connection closed due to "Idle timeout expired".   Looks like a 1 second timeout on something...which seems pretty short.

       

      Jul 06, 2021 9:07:17 AM WARNING jenkins.agents.WebSocketAgents$Session error
      {{ null}}
      {{ java.util.concurrent.TimeoutException: Idle timeout expired: 2463/1000 ms}}
      {{ Caused: org.eclipse.jetty.websocket.api.CloseException}}
      {{ at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onReadTimeout(AbstractWebSocketConnection.java:564)}}
      {{ at org.eclipse.jetty.io.AbstractConnection.onFillInterestedFailed(AbstractConnection.java:172)}}
      {{ at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillInterestedFailed(AbstractWebSocketConnection.java:539)}}
      {{ at org.eclipse.jetty.io.AbstractConnection$ReadCallback.failed(AbstractConnection.java:317)}}
      {{ at org.eclipse.jetty.io.FillInterest.onFail(FillInterest.java:140)}}
      {{ at org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:407)}}
      {{ at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171)}}
      {{ at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113)}}
      {{ at org.eclipse.jetty.io.IdleTimeout.activate(IdleTimeout.java:136)}}
      {{ at org.eclipse.jetty.io.IdleTimeout.setIdleTimeout(IdleTimeout.java:100)}}
      {{ at org.eclipse.jetty.server.LowResourceMonitor.setLowResources(LowResourceMonitor.java:412)}}
      {{ at org.eclipse.jetty.server.LowResourceMonitor.monitor(LowResourceMonitor.java:352)}}
      {{ at org.eclipse.jetty.server.LowResourceMonitor$1.run(LowResourceMonitor.java:84)}}
      {{ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
      {{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
      {{ at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)}}
      {{ at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)}}
      {{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
      {{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
      {{ at java.lang.Thread.run(Thread.java:748)}}

       

      Any ideas?  I was going to start running a test job on my dev agent and see if it still remains stable while it running a job...

        Attachments

          Activity

          Hide
          jthompson Jeff Thompson added a comment -

          Disconnects are almost always due to some aspect of local configuration. It is very rare that they are caused by Jenkins, Remoting, or the agents. Unfortunately, they can prove very difficult to diagnose. They can be caused by networking, system, or environmental issues. Another class of cause that shows up less often is caused by bad interactions between different plugins, though it's difficult to predict or determine which ones might be the cause. Frequently they occur under heavier load, such as when running a job.

          I recommend general troubleshooting and investigation including what you're considering. Good luck tracking it down.

          Show
          jthompson Jeff Thompson added a comment - Disconnects are almost always due to some aspect of local configuration. It is very rare that they are caused by Jenkins, Remoting, or the agents. Unfortunately, they can prove very difficult to diagnose. They can be caused by networking, system, or environmental issues. Another class of cause that shows up less often is caused by bad interactions between different plugins, though it's difficult to predict or determine which ones might be the cause. Frequently they occur under heavier load, such as when running a job. I recommend general troubleshooting and investigation including what you're considering. Good luck tracking it down.

            People

            Assignee:
            jthompson Jeff Thompson
            Reporter:
            johnlengeling John Lengeling
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: