Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48106

java.io.EOFException continue occur when using health check with aws network load balancer

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Minor Minor
    • core
    • Jenkins ver. 2.90

      Hi, we currently using AWS Network Load Balancer (http://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html) for jenkins jnlp service port auto discovery.

      But we continues got this exceptions.

      Nov 20, 2017 10:55:50 AM FINE hudson.TcpSlaveAgentListener
      Accepted connection #41,949 from /10.240.0.4:24748
      Nov 20, 2017 10:55:50 AM WARNING hudson.TcpSlaveAgentListener$ConnectionHandler run
      Connection #41949 failed java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:241)
      

      slaves although are still working normally.

          [JENKINS-48106] java.io.EOFException continue occur when using health check with aws network load balancer

          Owen Mehegan added a comment -

          Dupe of JENKINS-46893?

          Owen Mehegan added a comment - Dupe of JENKINS-46893 ?

          Jeff Thompson added a comment -

          Are you still seeing this issue? Can you provide further information about what is happening?

          Jeff Thompson added a comment - Are you still seeing this issue? Can you provide further information about what is happening?

          Jeff Thompson added a comment -

          Closing for lack of response providing sufficient reproduction or diagnostic information.

          Jeff Thompson added a comment - Closing for lack of response providing sufficient reproduction or diagnostic information.

          Aaron Trout added a comment - - edited

          I'm also hitting this. I am guessing that Jenkins is unhappy about the TCP health check probes that the NLB is sending it. 

          My limited understanding of Jenkins internals is that on this JNLP / agent port there are multiple protocols (HTTP for JNLP protocol discovery, various JNLP versions). Since the load balancer is just L4, it is doing a simple TCP health check. I have attached a pcap of the health checks coming from the AWS NLB; it just sets up a TCP connection and immediately tears it down again without sending any data over the connection. nlb-hc.pcap

          During this time we are getting one instance of the below error for every new TCP connection received:

           

          Nov 21, 2018 11:27:52 AM hudson.TcpSlaveAgentListener$ConnectionHandler run
           WARNING: Connection #2576 failed
           java.io.EOFException
           at java.io.DataInputStream.readFully(DataInputStream.java:197)
           at java.io.DataInputStream.readFully(DataInputStream.java:169)
           at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:242)
          

           

           I have a similar setup, running Jenkins in Kubernetes with a `Service` of type `LoadBalancer` and the annotation to make it use the new NLB (rather than ELB classic). Because of this, the problem gets worse the larger your cluster is, since the NLB will run the health check against every node in the cluster (and kube-proxy will forward all those requests to the single Jenkins instance).

          Aaron Trout added a comment - - edited I'm also hitting this. I am guessing that Jenkins is unhappy about the TCP health check probes that the NLB is sending it.  My limited understanding of Jenkins internals is that on this JNLP / agent port there are multiple protocols (HTTP for JNLP protocol discovery, various JNLP versions). Since the load balancer is just L4, it is doing a simple TCP health check. I have attached a pcap of the health checks coming from the AWS NLB; it just sets up a TCP connection and immediately tears it down again without sending any data over the connection.  nlb-hc.pcap During this time we are getting one instance of the below error for every new TCP connection received:   Nov 21, 2018 11:27:52 AM hudson.TcpSlaveAgentListener$ConnectionHandler run WARNING: Connection #2576 failed java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:242)    I have a similar setup, running Jenkins in Kubernetes with a `Service` of type `LoadBalancer` and the annotation to make it use the new NLB (rather than ELB classic). Because of this, the problem gets worse the larger your cluster is, since the NLB will run the health check against every node in the cluster (and kube-proxy will forward all those requests to the single Jenkins instance).

          Aaron Trout added a comment - - edited

          By the way, a MUCH easier way to reproduce this is to just use netcat against the JNLP port:

          Run a local Jenkins:

          $ docker run -p 50000:50000 jenkins/jenkins:lts

          Wait for the "Jenkins is fully up and running" message, then in another terminal:

          $ echo | nc localhost 50000
          

          causes:

          WARNING: Connection #1 failed
          java.io.EOFException
                  at java.io.DataInputStream.readFully(DataInputStream.java:197)
                  at java.io.DataInputStream.readFully(DataInputStream.java:169)
                  at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:242)
          

          Aaron Trout added a comment - - edited By the way, a MUCH easier way to reproduce this is to just use netcat against the JNLP port: Run a local Jenkins: $ docker run -p 50000:50000 jenkins/jenkins:lts Wait for the "Jenkins is fully up and running" message, then in another terminal: $ echo | nc localhost 50000 causes: WARNING: Connection #1 failed java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:242)

          chris vest added a comment -

          I am seeing the same issue as aaron465

          chris vest added a comment - I am seeing the same issue as aaron465

          Aaron Trout added a comment -

          Sorry for posting three comments in a row, but to summarise; potentially not a jenkins bug. It doesn't break anything, just the log is super spammy in certain environments... Could raise the log level to work around that if it is a problem for you.

          Arguably though, the case where a TCP connection is established but no data is sent should be handled by the agent port listener so that we don't trigger this exception...

          Either way though, from jenkins point of view this is nothing to do with kubernetes or load balancers.

          Aaron Trout added a comment - Sorry for posting three comments in a row, but to summarise; potentially not a jenkins bug. It doesn't break anything, just the log is super spammy in certain environments... Could raise the log level to work around that if it is a problem for you. Arguably though, the case where a TCP connection is established but no data is sent should be handled by the agent port listener so that we don't trigger this exception... Either way though, from jenkins point of view this is nothing to do with kubernetes or load balancers.

          Jeff Thompson added a comment -

          Interesting. Thanks for the analysis, aaron465. If we could figure out the appropriate behavior, we could do something about it. Maybe just lowering the level for that log message would be the right thing to do. I've done that for a number of messages lately that were overly aggressive when they were first implemented. I don't know if there are any situations in which that message might be useful, though. Perhaps, log it only if some specific condition is met. Maybe ignore the EOFException on the attempt to read the header but not if it comes later. I'm inclined to do that to clean up these spammy messages in these environments.

          Any thoughts?

          Jeff Thompson added a comment - Interesting. Thanks for the analysis, aaron465 . If we could figure out the appropriate behavior, we could do something about it. Maybe just lowering the level for that log message would be the right thing to do. I've done that for a number of messages lately that were overly aggressive when they were first implemented. I don't know if there are any situations in which that message might be useful, though. Perhaps, log it only if some specific condition is met. Maybe ignore the EOFException on the attempt to read the header but not if it comes later. I'm inclined to do that to clean up these spammy messages in these environments. Any thoughts?

          Jeff Thompson added a comment -

          Looking at other reports, it's not clear that downgrading or ignoring this error is the correct thing to do in all environments. In some environments this message looks like it provides meaningful information.

          It sounds like Owen was correct initially, this is a duplicate of JENKINS-46893.

          Without further ideas I don't know what we could do here and sounds like there are some possibilities for addressing this in the system deployment configuration.

          Jeff Thompson added a comment - Looking at other reports, it's not clear that downgrading or ignoring this error is the correct thing to do in all environments. In some environments this message looks like it provides meaningful information. It sounds like Owen was correct initially, this is a duplicate of JENKINS-46893 . Without further ideas I don't know what we could do here and sounds like there are some possibilities for addressing this in the system deployment configuration.

          Jeff Thompson added a comment -

          Closing this again as it appears to be Not a Defect, Duplicate, or Cannot Reproduce and no further information has been provided or explanation beyond the existing workarounds.

          Jeff Thompson added a comment - Closing this again as it appears to be Not a Defect, Duplicate, or Cannot Reproduce and no further information has been provided or explanation beyond the existing workarounds.

            jthompson Jeff Thompson
            protosschris Chris Lee
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: