Status: Closed (View Workflow)
Environment:Master Java version: openjdk version "1.8.0_212"
Docker image built "FROM jenkins/jenkins:2.171"
Slave Java version: openjdk version "1.8.0_212"
Running on AMI: debian-stretch-hvm-x86_64-gp2-2019-02-19-26620
Kernel: Linux <hostname> 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux
Master Java version: openjdk version "1.8.0_212" Docker image built "FROM jenkins/jenkins:2.171" maven-plugin: 3.12 ssh-slave-plugin: 1.29.4 ec2-fleet-plugin: 1.1.9 Slave Java version: openjdk version "1.8.0_212" Running on AMI: debian-stretch-hvm-x86_64-gp2-2019-02-19-26620 ( https://wiki.debian.org/Cloud/AmazonEC2Image/Stretch ) Kernel: Linux <hostname> 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux
I've had to deal with a leak of "Channel reader thread: Channel to Maven" for a few months now. On the master, these appear as the following, stuck on a read() operation, without the attached "Executor" thread that exists when a job is running:
After about 3 weeks, we get between 3000-4000 of those, eventually leading to OutOfMemory or file descriptor related errors.
I found no error in the master logs, slave logs or from any job that would relate to this.
After some digging, I found that those threads are attached to "proxy" threads on the slaves, also stuck on a read() operation with the following stack:
When this happens, the socket that the thread tries to read() is always stuck in FIN_WAIT2 state indefinitely, with the other end of the connection (maven-interceptor) gone.
This seems to occur at the end of a Maven job execution, when the socket is being closed(). I have not been able to reproduce this state voluntarily, but a few appear in per hour in our environment. I captured a tcpdump and noticed the following patterns:
Most common closure pattern without leak (slave: 46415 | maven-interceptor: 59114):
RST without leak (slave: 34531 | maven-interceptor: 47034):
RST leading to a leak (slave: 45097, stack traces above | maven-interceptor: 40492):
This looks like a race condition depending on which side first attempt to close the connection. The strange part is that the read() operation in the StreamCopyThread never returns with "-1", which should occur when a FIN is received, nor the SocketException for the RST packet.
I noticed that there is no timeout set on the socket created in the remoting process. Had similar issues in the past with stateful network devices dropping connection without notifying both ends and this usually helps in such cases.
Not a good option here as all the users of the socket would need to handle SocketTimeoutException, and it seems that it's hard to control all the users. From running the test suite, it's not just the StreamCopyThread that needs to handle it.
Next option was to enable the socket KEEPALIVE and let the network stack kill the socket when it detects that the remote end doesn't answer. This shouldn't impact the normal reading flow, only cause a SocketException when a socket gets stuck in the FIN_WAIT2 state. This exception is currently ignored in the StreamCopyThread: https://github.com/jenkinsci/jenkins/blob/2767b00146ce2ff2738b7fd7c6db95a26b8f9f39/core/src/main/java/hudson/util/StreamCopyThread.java#L74
I'm currently running a patched version with the above fix and it seems to work properly. While this doesn't stop the leak, leaked threads do not live forever anymore on both the slave and the master. We can also control the behavior by adjusting kernel parameters on the slaves:
Here's a PR with the change: https://github.com/jenkinsci/maven-plugin/pull/126