Loading...

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: maven-plugin
Labels:
- maven-plugin-3.3
Environment:

Hide
Master Java version: openjdk version "1.8.0_212"
Docker image built "FROM jenkins/jenkins:2.171"
maven-plugin: 3.12
ssh-slave-plugin: 1.29.4
ec2-fleet-plugin: 1.1.9

Slave Java version: openjdk version "1.8.0_212"
Running on AMI: debian-stretch-hvm-x86_64-gp2-2019-02-19-26620
(https://wiki.debian.org/Cloud/AmazonEC2Image/Stretch)
Kernel: Linux <hostname> 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux

Show
Master Java version: openjdk version "1.8.0_212" Docker image built "FROM jenkins/jenkins:2.171" maven-plugin: 3.12 ssh-slave-plugin: 1.29.4 ec2-fleet-plugin: 1.1.9 Slave Java version: openjdk version "1.8.0_212" Running on AMI: debian-stretch-hvm-x86_64-gp2-2019-02-19-26620 ( https://wiki.debian.org/Cloud/AmazonEC2Image/Stretch ) Kernel: Linux <hostname> 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux

Similar Issues:

Show

I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:

"Channel reader thread: Channel to Maven [/var/lib/jenkins/tools/hudson.model.JDK/AdoptOpenJDK_11.0.1u13/jdk-11.0.1+13/bin/java, -Xmx512m, -cp, /var/lib/jenkins/maven33-agent.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/boot/plexus-classworlds-2.5.2.jar:/var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9/conf/logging, jenkins.maven3.agent.Maven33Main, /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/3.3.9/3.3.9, /var/lib/jenkins/remoting.jar, /var/lib/jenkins/maven33-interceptor.jar, /var/lib/jenkins/maven3-interceptor-commons.jar, 45945]" #4754401 daemon prio=5 os_prio=0 tid=0x00007f3c6c0ee000 nid=0x4d09 in Object.wait() [0x00007f3bf49c3000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
        - locked <0x0000000607135d08> (a [B)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        - locked <0x0000000607062548> (a java.io.BufferedInputStream)
        at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:91)
        at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
        at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
        at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors.

I found no error in the master logs, slave logs or from any job that would relate to this.

After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:

"Stream reader: maven process at Socket[addr=/127.0.0.1,port=60304,localport=45945]" #76026 prio=5 os_prio=0 tid=0x00007fc7c40c6000 nid=0x4ab9 runnable [0x00007fc7f3bfa000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:107)
        at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)

When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone.

tcp6       0      0 127.0.0.1:45945         127.0.0.1:60304         FIN_WAIT2   17711/java

This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:

Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114):

RST without leak (slave: 34531 | maven-interceptor: 47034):

RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):

This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.

Adding socket timeout:

I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.

Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.

Enabling socket keepalive:

Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74

I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:

~# ss -o | grep FIN
tcp    FIN-WAIT-2 0      0                                                     ::ffff:127.0.0.1:39615                                                               ::ffff:127.0.0.1:34032                 timer:(keepalive,83min,0)

~# sysctl -a | grep keepalive
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Here's a PR with the change: https://github.com/jenkinsci/maven-plugin/pull/126

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

normal-no-leak.png
116 kB
2019-04-19 12:28
rst-leak.png
145 kB
2019-04-19 12:28
rst-no-leak.png
148 kB
2019-04-19 12:28

Details

Description

Attachments

Attachments

Activity

People

Dates