Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-57119

maven-plugin random socket leak, leading to threads leak on slave and master

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • maven-plugin

      I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

       

      "Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
              - locked <0x0000000607135d08> (a [B)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
              - locked <0x0000000607062548> (a java.io.BufferedInputStream)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
      

        

      After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors. 

      I found no error in the master logs, slave logs or from any job that would relate to this.

      After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

       

      "Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
         java.lang.Thread.State: RUNNABLE
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
              at java.net.SocketInputStream.read(SocketInputStream.java:171)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at java.io.FilterInputStream.read(FilterInputStream.java:133)
              at java.io.FilterInputStream.read(FilterInputStream.java:107)
              at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)
      

       

       

      When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone. 

       

      tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java    
      

       

      This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

      Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114): 

       

      RST without leak (slave: 34531 | maven-interceptor: 47034):

      RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

       

      This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

      Adding socket timeout:

      I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

      Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

      Enabling socket keepalive:

      Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74

      I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

       

      ~# ss -o | grep FIN
      tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)
      
      ~# sysctl -a | grep keepalive
      net.ipv4.tcp_keepalive_intvl = 75
      net.ipv4.tcp_keepalive_probes = 9
      net.ipv4.tcp_keepalive_time = 7200

       

       Here's a PR with the change: https://github.com/jenkinsci/maven-plugin/pull/126

       

        1. rst-no-leak.png
          rst-no-leak.png
          148 kB
        2. rst-leak.png
          rst-leak.png
          145 kB
        3. normal-no-leak.png
          normal-no-leak.png
          116 kB

            glavoie Gabriel Lavoie
            glavoie Gabriel Lavoie
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: