Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63804

master to agent connection keeps breaking every 3-4 hours

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Component/s: remoting
    • Labels:
      None
    • Environment:
      Jenkins: 2.249.1
      Master Node: Linux RHEL 8.1
      Master Java Version: 1.8.0_242
      Slave System: macOS Catalina, Version 10.15.6
      Slave Java Version: 1.8.0_261
    • Similar Issues:

      Description

       

      Hi Team, We are using JNLP to connect Mac agent to Linux master node.

      Jenkins agent keeps disconnecting frequently, and we are getting below logs in master.

      Can you please suggest how to resolve this? What are the steps to further triage the same.

      Some of the questions we are trying to answer is:

      • What is EOFException?
      • Why does agent tries to connect to master when its already connected?
      • Why does eventually the ping / connection fails?

       

      We keep seeing this pattern in logs too often and too frequently. Any help would be appreciated.

      Results are same even if we try any of the below options:

      • Connected using Launch agent from Browser
      • Connected by starting automator in Mac which runs shell/zsh to run agent.jar
      • Connected by running plist in Mac

       

      Connection #xxx failed: java.io.EOFException
      Sep 29, 2020 2:45:21 AM 
      INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
      Accepted JNLP4-connect connection #xxx from x.x.x.x/x.x.x.x:57215
      Sep 29, 2020 2:45:21 AM 
      INFO org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer on
      Recv[JNLP4-connect connection from x.x.x.x/x.x.x.x:57215] 
      Refusing headers from remote: <agent_name> is already connected to this master. 
      Rejecting this connection.Sep 29, 2020 2:45:31 AM 
      INFO hudson.TcpSlaveAgentListener$ConnectionHandler runConnection #xxx failed: java.io.EOFException
      Sep 29, 2020 2:45:31 AM INFO hudson.TcpSlaveAgentListener$ConnectionHandler runAccepted JNLP4-connect connection #xxx from x.x.x.x/x.x.x.x:57218
      Sep 29, 2020 2:45:32 AM 
      INFO org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecv[JNLP4-connect connection from x.x.x.x/x.x.x.x] Refusing headers from remote: <agent_name> is already connected to this master. 
      Rejecting this connection.Sep 29, 2020 2:45:32 AM INFO hudson.slaves.ChannelPinger$1 onDeadPing failed. Terminating the channel JNLP4-connect connection from x.x.x.x/x.x.x.x:57015.
      java.util.concurrent.TimeoutException: Ping started at 1601318492966 hasn't completed by 1601318732966        at hudson.remoting.PingThread.ping(PingThread.java:134)        at hudson.remoting.PingThread.run(PingThread.java:90)
      

       

       

        Attachments

          Activity

          Hide
          ashisharma888 Ashish Sharma added a comment -

          Jeff Thompson Checking if you would have any suggestion or recommendation to help fix this issue? 

          Show
          ashisharma888 Ashish Sharma added a comment - Jeff Thompson Checking if you would have any suggestion or recommendation to help fix this issue? 
          Hide
          jthompson Jeff Thompson added a comment -

          Problems like this almost always originate in system or networking problems. They can result from all sorts of different issues, so it's extremely difficult to make any suggestions without doing a deep troubleshooting on the system and networking. Sometimes it's a matter of resource overload. In some cases it's because of some plugin or combination of plugins causing an error or condition that ends up disrupting things. I would guess that's not the case for you, but it could be.

          Probably what's going on here is that an agent properly connects. Something disrupts that connection so the agent terminates. The controller hasn't yet been notified that the other end of the socket has closed. It refuses the connection because there is already an open connection registered to that connection. After some period of time the TCP layer times out the connection and the controller handles the close. After that time, the connection can be reestablished from the agent.

          Your best approach is to try and track down what happens in that second step, when the connection is broken, probably by something in the environment.

          Show
          jthompson Jeff Thompson added a comment - Problems like this almost always originate in system or networking problems. They can result from all sorts of different issues, so it's extremely difficult to make any suggestions without doing a deep troubleshooting on the system and networking. Sometimes it's a matter of resource overload. In some cases it's because of some plugin or combination of plugins causing an error or condition that ends up disrupting things. I would guess that's not the case for you, but it could be. Probably what's going on here is that an agent properly connects. Something disrupts that connection so the agent terminates. The controller hasn't yet been notified that the other end of the socket has closed. It refuses the connection because there is already an open connection registered to that connection. After some period of time the TCP layer times out the connection and the controller handles the close. After that time, the connection can be reestablished from the agent. Your best approach is to try and track down what happens in that second step, when the connection is broken, probably by something in the environment.

            People

            Assignee:
            jthompson Jeff Thompson
            Reporter:
            ashisharma888 Ashish Sharma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: