Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-61851

Repeated build failures after agent disconnection

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Not A Defect
    • Component/s: remoting
    • Labels:
      None
    • Environment:
      master = Jenkins 2.229, in docker image (linux) proxy passed by nginx, openjdk 1.8.0_242
      agent = remoting 4.3 (with -webSocket connection), windows10, openjdk 1.9.0_202
    • Similar Issues:

      Description

      We've been facing a couple of build failure because of agent disconnection for no apparent reasons. Failures are following the same pattern:

      The build logs say (time of error):

      9:45:50 FATAL: command execution failed
      java.nio.channels.ClosedChannelException
      	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
      	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      ...
      Caused: java.io.IOException: Backing channel 'our-windows10-agent' is disconnected.
      	at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)
      

      On the agent, the logs say:

      Apr 09, 2020 9:45:50 AM hudson.Launcher$RemoteLaunchCallable$1 join
      INFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@11010e5e:our-windows10-agent
      hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": Remote call on our-windows10-agent failed. The channel is closing down or has closed down
              at hudson.remoting.Channel.call(Channel.java:991)
              at hudson.remoting.Channel.syncIO(Channel.java:1730)
      ...
      Caused by: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": channel is already closed
              at hudson.remoting.Engine$1AgentEndpoint.onClose(Engine.java:590)
              at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1251)
      
      Apr 09, 2020 9:45:50 AM hudson.remoting.UserRequest perform
      WARNING: LinkageError while performing UserRequest:UserRPCRequest:hudson.Launcher$RemoteProcess.join[](4)
      java.lang.LinkageError: Failed to load hudson.util.ProcessTree
              at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:391)
      ...
      Caused by: java.lang.NoClassDefFoundError: hudson/util/ProcessTreeRemoting$IProcessTree
              at java.lang.ClassLoader.defineClass1(Native Method)
      ...
      Caused by: java.lang.ClassNotFoundException: hudson.util.ProcessTreeRemoting$IProcessTree
              at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
      

      On the master, the agent logs say (file slave.log.1, mtime = Apr 9 09:45):

      Inbound agent connected from x.x.x.x
      Remoting version: 4.3
      This is a Windows agent
      onOnline: class org.jenkinsci.modules.slave_installer.impl.ComputerListenerImpl reported an exception: java.net.MalformedURLException: no protocol: jnlpJars/slave.jar
      Agent successfully connected and online
      ERROR: Connection terminated
      java.nio.channels.ClosedChannelException
      	at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
      	at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
      	at jenkins.websocket.WebSockets$$Lambda$282/00000000940934C0.invoke(Unknown Source)
      	at com.sun.proxy.$Proxy74.onWebSocketClose(Unknown Source)
      	at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:119)
      	at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:389)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:317)
      	at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280)
      	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293)
      	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
      	at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
      	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
      	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:584)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:511)
      	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:441)
      	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
      	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
      	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
      	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
      	at java.lang.Thread.run(Unknown Source)
      

      I see no error in the nginx (reverse proxy) logs. I only see the GET /wsagent/ in the access logs.

      Weirdly, the last GET I see in the access logs on the the wsagents endoint does not fit with the error. It's like the agent did not reconnect. The agent logs in the master say that the "Connection terminated" (see above) and it's still connected (the mtime of the latest agent log file - slave.log - on the master is Apr 9 09:46).

      We used to run this agent with a direct TCP connection (with Jenkins 2.204.5 and remoting 3.36.1) and we were not facing the issue. We need to switch to websocket connection to avoid streaming the TCP connection in the nginx reverse-proxy.

      Is there anything I can do to better analyze what's going on?

        Attachments

          Activity

          mbarbero Mikaël Barbero created issue -
          mbarbero Mikaël Barbero made changes -
          Field Original Value New Value
          Description We've been facing a couple of build failure because of agent disconnection for no apparent reasons. Failures are following the same pattern:

          The build logs say (time of error):

          {{
          9:45:50 FATAL: command execution failed
          java.nio.channels.ClosedChannelException
          at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
          at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
          ...
          Caused: java.io.IOException: Backing channel 'our-windows10-agent' is disconnected.
          at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)}}

          On the agent, the logs say:

          {{
          Apr 09, 2020 9:45:50 AM hudson.Launcher$RemoteLaunchCallable$1 join
          INFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@11010e5e:our-windows10-agent
          hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": Remote call on our-windows10-agent failed. The channel is closing down or has closed down
                  at hudson.remoting.Channel.call(Channel.java:991)
                  at hudson.remoting.Channel.syncIO(Channel.java:1730)
          ...
          Caused by: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": channel is already closed
                  at hudson.remoting.Engine$1AgentEndpoint.onClose(Engine.java:590)
                  at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1251)

          Apr 09, 2020 9:45:50 AM hudson.remoting.UserRequest perform
          WARNING: LinkageError while performing UserRequest:UserRPCRequest:hudson.Launcher$RemoteProcess.join[](4)
          java.lang.LinkageError: Failed to load hudson.util.ProcessTree
                  at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:391)
          ...
          Caused by: java.lang.NoClassDefFoundError: hudson/util/ProcessTreeRemoting$IProcessTree
                  at java.lang.ClassLoader.defineClass1(Native Method)
          ...
          Caused by: java.lang.ClassNotFoundException: hudson.util.ProcessTreeRemoting$IProcessTree
                  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
          }}

          On the master, the agent logs say (file slave.log.1, mtime = Apr 9 09:45):

          {{
          Inbound agent connected from x.x.x.x
          Remoting version: 4.3
          This is a Windows agent
          onOnline: class org.jenkinsci.modules.slave_installer.impl.ComputerListenerImpl reported an exception: java.net.MalformedURLException: no protocol: jnlpJars/slave.jar
          Agent successfully connected and online
          ERROR: Connection terminated
          java.nio.channels.ClosedChannelException
          at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
          at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
          at jenkins.websocket.WebSockets$$Lambda$282/00000000940934C0.invoke(Unknown Source)
          at com.sun.proxy.$Proxy74.onWebSocketClose(Unknown Source)
          at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:119)
          at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:389)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:317)
          at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280)
          at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293)
          at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
          at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
          at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:584)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:511)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:441)
          at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
          at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
          at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
          at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
          at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
          at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
          at java.lang.Thread.run(Unknown Source)
          }}

          I see no error in the nginx (reverse proxy) logs. I only see the GET /wsagent/ in the access logs.

          Weirdly, the last GET I see in the access logs on the the wsagents endoint does not fit with the error. It's like the agent did not reconnect. The agent logs in the master say that the "Connection terminated" (see above) and it's still connected (the mtime of the latest agent log file - slave.log - on the master is Apr 9 09:46).

          We used to run this agent with a direct TCP connection (with Jenkins 2.204.5 and remoting 3.36.1) and we were not facing the issue. We need to switch to websocket connection to avoid streaming the TCP connection in the nginx reverse-proxy.

          Is there anything I can do to better analyze what's going on?
          We've been facing a couple of build failure because of agent disconnection for no apparent reasons. Failures are following the same pattern:

          The build logs say (time of error):

          {noformat}
          9:45:50 FATAL: command execution failed
          java.nio.channels.ClosedChannelException
          at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
          at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
          ...
          Caused: java.io.IOException: Backing channel 'our-windows10-agent' is disconnected.
          at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)
          {noformat}

          On the agent, the logs say:

          {noformat}
          Apr 09, 2020 9:45:50 AM hudson.Launcher$RemoteLaunchCallable$1 join
          INFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@11010e5e:our-windows10-agent
          hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": Remote call on our-windows10-agent failed. The channel is closing down or has closed down
                  at hudson.remoting.Channel.call(Channel.java:991)
                  at hudson.remoting.Channel.syncIO(Channel.java:1730)
          ...
          Caused by: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": channel is already closed
                  at hudson.remoting.Engine$1AgentEndpoint.onClose(Engine.java:590)
                  at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1251)

          Apr 09, 2020 9:45:50 AM hudson.remoting.UserRequest perform
          WARNING: LinkageError while performing UserRequest:UserRPCRequest:hudson.Launcher$RemoteProcess.join[](4)
          java.lang.LinkageError: Failed to load hudson.util.ProcessTree
                  at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:391)
          ...
          Caused by: java.lang.NoClassDefFoundError: hudson/util/ProcessTreeRemoting$IProcessTree
                  at java.lang.ClassLoader.defineClass1(Native Method)
          ...
          Caused by: java.lang.ClassNotFoundException: hudson.util.ProcessTreeRemoting$IProcessTree
                  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
          {noformat}

          On the master, the agent logs say (file slave.log.1, mtime = Apr 9 09:45):

          {noformat}
          Inbound agent connected from x.x.x.x
          Remoting version: 4.3
          This is a Windows agent
          onOnline: class org.jenkinsci.modules.slave_installer.impl.ComputerListenerImpl reported an exception: java.net.MalformedURLException: no protocol: jnlpJars/slave.jar
          Agent successfully connected and online
          ERROR: Connection terminated
          java.nio.channels.ClosedChannelException
          at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
          at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
          at jenkins.websocket.WebSockets$$Lambda$282/00000000940934C0.invoke(Unknown Source)
          at com.sun.proxy.$Proxy74.onWebSocketClose(Unknown Source)
          at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:119)
          at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:389)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:317)
          at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280)
          at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293)
          at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
          at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
          at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
          at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:584)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:511)
          at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:441)
          at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
          at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
          at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
          at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
          at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
          at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
          at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
          at java.lang.Thread.run(Unknown Source)
          {noformat}

          I see no error in the nginx (reverse proxy) logs. I only see the GET /wsagent/ in the access logs.

          Weirdly, the last GET I see in the access logs on the the wsagents endoint does not fit with the error. It's like the agent did not reconnect. The agent logs in the master say that the "Connection terminated" (see above) and it's still connected (the mtime of the latest agent log file - slave.log - on the master is Apr 9 09:46).

          We used to run this agent with a direct TCP connection (with Jenkins 2.204.5 and remoting 3.36.1) and we were not facing the issue. We need to switch to websocket connection to avoid streaming the TCP connection in the nginx reverse-proxy.

          Is there anything I can do to better analyze what's going on?
          mbarbero Mikaël Barbero made changes -
          Resolution Not A Defect [ 7 ]
          Status Open [ 1 ] Closed [ 6 ]

            People

            Assignee:
            jthompson Jeff Thompson
            Reporter:
            mbarbero Mikaël Barbero
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: