• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core

      Our team provides Jenkins as a managed service to internal teams. We have previously been running the following setup:

      • Single VM per controller, with the controller in a container
      • HTTPS reverse proxy
      • JNLP connecting directly to the controller VM using "tunnel connection through" as the HTTPS reverse proxy didn't serve JNLP connections

      We recently switched to running all of our Jenkins controller on Kubernetes, and we share a common Azure load balancer for them all. This load balancer listens for HTTP connections and performs host-based-routing, and also listens on a unique JNLP port for each controller instance.

      We've had CI teams report issues concerning node disconnections. We looked into the issue and noticed that it happened on nodes that run long-running processes where nothing is printed to stdout for long periods of time. We were able to solve it for many users by increasing the TCP idle connection timeout setting on the load balancer from 4 to 30 minutes, but we still have builds that run longer than this without any output. Now the issue is pointing to a faulty (or lack of) TCP keepalive functionality in the agent.

      I would expect the agent to send TCP keepalive packets to the controller every n seconds (configurable) in order to assure the load balancer that the connection is active.

      The errors we see in the controller console log look like this:

       

      03:19:33  Cannot contact NODE_NAME: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3095990c:JNLP4-connect connection from NODE_IP:54165": Remote call on JNLP4-connect connection from NODE_IP:54165 failed. The channel is closing down or has closed down 

       

       

          [JENKINS-71259] Agents don't appear to send TCP keepalive

          Markus Winter added a comment -

          There is the ChannelPinger that by default sends something over the channel every 5 minutes to ensure that the connection stays open. Maybe enable the logger for this class to see if anything unusual is happening there.

          Markus Winter added a comment - There is the ChannelPinger  that by default sends something over the channel every 5 minutes to ensure that the connection stays open. Maybe enable the logger for this class to see if anything unusual is happening there.

          Emil added a comment -

          mawinter69 5 minutes is too rarely. Load balancer TCP timeouts are typically less than a minute. Raising the load balancer timeout is not desirable. It would be great if this pinger could ping more frequently.

          Emil added a comment - mawinter69 5 minutes is too rarely. Load balancer TCP timeouts are typically less than a minute. Raising the load balancer timeout is not desirable. It would be great if this pinger could ping more frequently.

          Markus Winter added a comment -

          You can modify the interval by setting the following properties at startup:

          java -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Dhudson.slaves.ChannelPinger.pingTimeoutSeconds=20 -jar jenkins.war
          

          Markus Winter added a comment - You can modify the interval by setting the following properties at startup: java -Dhudson.slaves.ChannelPinger.pingIntervalSeconds=30 -Dhudson.slaves.ChannelPinger.pingTimeoutSeconds=20 -jar jenkins.war

          Emil added a comment -

          mawinter69 that's great - thank you!

          Emil added a comment - mawinter69 that's great - thank you!

            Unassigned Unassigned
            brovoca Emil
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: