Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-62576

Websockets connection unstable since remoting 4.2.1 (LTS 2.222.4)

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Environment:
    • Similar Issues:

      Description

      Hi,

      Since we upgraded to Jenkins Core 2.222.4 (to include the fix JENKINS-61409) and remoting 4.2.1

      We are facing much more stability issue on the websocket connection. It was not the case before with remoting 4.2 (The only issues we faced was the large payload).

      We can observe now, disconnection on the middle of builds

      Connection break after a simple git checkout.

      [Pipeline] { (Git Checkout)
      [Pipeline] dir
      10:08:31  Running in /home/jenkins/agent/workspace/workspace/*****
      [Pipeline] {
      [Pipeline] checkout (hide)
      [Pipeline] }
      [Pipeline] // dir
      [Pipeline] }
      [Pipeline] // stage
      [Pipeline] }
      10:08:45  ********* was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
      10:08:45  	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
      10:08:45  	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      10:08:45  	at com.sun.proxy.$Proxy91.onWebSocketClose(Unknown Source)
      10:08:45  	at 
      

      On the agent (multiple exception)

      Jun 05, 2020 8:07:10 AM org.jenkinsci.plugins.workflow.log.GCFlushedOutputStream$FlushRef lambda$static$0
      WARNING: null
      hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@6cca5774:*****": channel is already closed
              at hudson.remoting.Channel.send(Channel.java:760)
              at hudson.remoting.ProxyOutputStream.flush(ProxyOutputStream.java:155)
              at hudson.remoting.RemoteOutputStream.flush(RemoteOutputStream.java:112)
              at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
              at org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$FlushControlledOutputStream.flush(DelayBufferedOutputStream.java:131)
              at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:141)
              at org.jenkinsci.plugins.workflow.log.GCFlushedOutputStream$FlushRef.lambda$static$0(GCFlushedOutputStream.java:77)
              at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@6cca5774:******": channel is already closed
              at hudson.remoting.Engine$1AgentEndpoint.onClose(Engine.java:590)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1251)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusWebSocket.onClose(TyrusWebSocket.java:130)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.ProtocolHandler.close(ProtocolHandler.java:469)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusWebSocket.close(TyrusWebSocket.java:260)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.TyrusClientEngine$2$1.close(TyrusClientEngine.java:635)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.ClientFilter.processError(ClientFilter.java:254)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onError(Filter.java:180)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onError(Filter.java:183)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onError(Filter.java:183)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onError(Filter.java:183)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.failed(TransportFilter.java:314)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.failed(TransportFilter.java:283)
              at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:128)
              at sun.nio.ch.Invoker$2.run(Invoker.java:218)
              at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112)
              ... 3 more
      
      
      Jun 05, 2020 8:07:28 AM io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.ClientFilter processError
      SEVERE: Connection error has occurred
      java.io.IOException: Connection reset by peer
              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:197)
              at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishRead(UnixAsynchronousSocketChannelImpl.java:388)
              at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:191)
              at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
              at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
              at java.lang.Thread.run(Thread.java:748)
      
      WARNING: LinkageError while performing UserRequest:hudson.FilePath$IsDirectory@4d6ab49
      java.lang.NoClassDefFoundError: hudson/util/io/Archiver
              at java.lang.Class.getDeclaredFields0(Native Method)
              at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
              at java.lang.Class.getDeclaredFields(Class.java:1916)
              at java.io.ObjectStreamClass.getDefaultSerialFields(ObjectStreamClass.java:1851)
              at java.io.ObjectStreamClass.getSerialFields(ObjectStreamClass.java:1773)
              at java.io.ObjectStreamClass.access$800(ObjectStreamClass.java:79)
              at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:508)
              at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
              at java.security.AccessController.doPrivileged(Native Method)
              at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:494)
              at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
              at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681)
              at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1942)
              at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1808)
              at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2099)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
              at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2344)
              at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)
              at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
              at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2344)
              at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2268)
              at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2126)
              at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
              at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)
              at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)
              at hudson.remoting.UserRequest.deserialize(UserRequest.java:290)
              at hudson.remoting.UserRequest.perform(UserRequest.java:189)
              at hudson.remoting.UserRequest.perform(UserRequest.java:54)
              at hudson.remoting.Request$2.run(Request.java:369)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: java.lang.ClassNotFoundException: hudson.util.io.Archiver
              at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
              at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:173)
              at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
              at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
              ... 36 more
      

      Both the master and agent are running on JDK8

      Agent (A VM)

      openjdk version "1.8.0_252"
      OpenJDK Runtime Environment (build 1.8.0_252-b09)
      OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
      

      Master (Official docker container)

      openjdk version "1.8.0_242"
      OpenJDK Runtime Environment (build 1.8.0_242-b08)
      OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
      

      If you have any idea about what is causing the issue.

      PS : I didn't had time to test using latest jenkins version and remoting 4.3. I don't know if it will change anything
      PS2: Totally aware that Websocket are still in beta

      Thanks!

        Attachments

          Issue Links

            Activity

            Hide
            jglick Jesse Glick added a comment -

            Unfortunately Remoting disconnections are not generally diagnosable from logs alone, and there could be any number of root causes. It would be very helpful if there were a known way to reproduce the problem from scratch in a clean environment.

            Show
            jglick Jesse Glick added a comment - Unfortunately Remoting disconnections are not generally diagnosable from logs alone, and there could be any number of root causes. It would be very helpful if there were a known way to reproduce the problem from scratch in a clean environment.
            Hide
            jonesbusy Valentin Delaye added a comment -

            Ok thanks, we will try to reproduce the issue on a clean infra

            Show
            jonesbusy Valentin Delaye added a comment - Ok thanks, we will try to reproduce the issue on a clean infra
            Hide
            jonesbusy Valentin Delaye added a comment -

            We were able to compare the Websocket vs the TCP with the same network component (Traefik 2 on Kubernetes which support HTTP and TCP routes)

            Without any activity (no job running)

            1) The TCP connection is completely stable after many days 

            INFO: Using Remoting version: 4.3 
            Aug 07, 2020 5:31:51 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir 
            INFO: Using /home/jenkins/agent/workspace2/remoting as a remoting work directory 
            Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status 
            INFO: Locating server among [***************] 
            Aug 07, 2020 5:31:51 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve 
            INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] 
            Aug 07, 2020 5:31:51 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve 
            INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check 
            Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status 
            INFO: Agent discovery successful 
             Agent address: ***********
             Agent port:    **** 
             Identity:      **************
            Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status 
            INFO: Handshaking 
            Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status 
            INFO: Connecting to ****************
            Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status 
            INFO: Trying protocol: JNLP4-connect 
            Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status 
            INFO: Remote identity confirmed: *************
            Aug 07, 2020 5:31:52 PM hudson.remoting.jnlp.Main$CuiListener status 
            INFO: Connected
            

            2) The Websocket connection fail after few minutes (ping timeout)

            Aug 07, 2020 5:31:58 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
            INFO: Using /home/jenkins/agent/workspace1/remoting as a remoting work directory
            Aug 07, 2020 5:31:58 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
            INFO: Both error and output logs will be printed to /home/jenkins/agent/workspace1/remoting
            Aug 07, 2020 5:31:58 PM hudson.remoting.jnlp.Main createEngine
            INFO: Setting up agent: ws-agent-agent-01
            Aug 07, 2020 5:31:58 PM hudson.remoting.jnlp.Main$CuiListener <init>
            INFO: Jenkins agent is running in headless mode.
            Aug 07, 2020 5:31:58 PM hudson.remoting.Engine startEngine
            INFO: Using Remoting version: 4.3
            Aug 07, 2020 5:31:58 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
            INFO: Using /home/jenkins/agent/workspace1/remoting as a remoting work directory
            Aug 07, 2020 5:31:59 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: WebSocket connection open
            Aug 07, 2020 5:31:59 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connected
            Aug 07, 2020 5:51:02 PM hudson.slaves.ChannelPinger$1 onDead
            INFO: Ping failed. Terminating the channel ws-agent-agent-01.
            java.util.concurrent.TimeoutException: Ping started at 1596815222809 hasn't completed by 1596815462809
                    at hudson.remoting.PingThread.ping(PingThread.java:133)
                    at hudson.remoting.PingThread.run(PingThread.java:89)
            

            Sadly the JNLP4-connect doesn't support SNI, which prevent us to use a single port to connect our external agent

            Show
            jonesbusy Valentin Delaye added a comment - We were able to compare the Websocket vs the TCP with the same network component (Traefik 2 on Kubernetes which support HTTP and TCP routes) Without any activity (no job running) 1) The TCP connection is completely stable after many days  INFO: Using Remoting version: 4.3 Aug 07, 2020 5:31:51 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /home/jenkins/agent/workspace2/remoting as a remoting work directory Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [***************] Aug 07, 2020 5:31:51 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Aug 07, 2020 5:31:51 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful  Agent address: ***********  Agent port:    ****   Identity:      ************** Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to **************** Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Aug 07, 2020 5:31:51 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: ************* Aug 07, 2020 5:31:52 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected 2) The Websocket connection fail after few minutes (ping timeout) Aug 07, 2020 5:31:58 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /home/jenkins/agent/workspace1/remoting as a remoting work directory Aug 07, 2020 5:31:58 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /home/jenkins/agent/workspace1/remoting Aug 07, 2020 5:31:58 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: ws-agent-agent-01 Aug 07, 2020 5:31:58 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 07, 2020 5:31:58 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 4.3 Aug 07, 2020 5:31:58 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /home/jenkins/agent/workspace1/remoting as a remoting work directory Aug 07, 2020 5:31:59 PM hudson.remoting.jnlp.Main$CuiListener status INFO: WebSocket connection open Aug 07, 2020 5:31:59 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Aug 07, 2020 5:51:02 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel ws-agent-agent-01. java.util.concurrent.TimeoutException: Ping started at 1596815222809 hasn't completed by 1596815462809 at hudson.remoting.PingThread.ping(PingThread.java:133) at hudson.remoting.PingThread.run(PingThread.java:89) Sadly the JNLP4-connect doesn't support SNI, which prevent us to use a single port to connect our external agent
            Hide
            jglick Jesse Glick added a comment -

            Cannot guess why JENKINS-61409 would have caused issues with Traefik. That fix changed the details of how Remoting commands are encoded in WS—from one command = one WS frame to a more complex chunked framing implementation shared with TCP agents—but nothing essential about how the connection is started, or the outbound WS ping every 30s, etc. If you manage to find the root cause here it would be great.

            Show
            jglick Jesse Glick added a comment - Cannot guess why JENKINS-61409 would have caused issues with Traefik. That fix changed the details of how Remoting commands are encoded in WS—from one command = one WS frame to a more complex chunked framing implementation shared with TCP agents—but nothing essential about how the connection is started, or the outbound WS ping every 30s, etc. If you manage to find the root cause here it would be great.
            Hide
            jglick Jesse Glick added a comment -

            Maybe like JENKINS-64598?

            Show
            jglick Jesse Glick added a comment - Maybe like JENKINS-64598 ?

              People

              Assignee:
              jthompson Jeff Thompson
              Reporter:
              jonesbusy Valentin Delaye
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated: