-
Bug
-
Resolution: Fixed
-
Major
-
Powered by SuggestiMate
I have a job that dynamically creates jobs using the CLI. I have noticed that since installing the job, which verifies the existing of jobs by calling 'get-job', Jenkins is leaking file descriptors. I am currently making around 40 calls per build, which runs on every CVS commit. I have a job setup to monitor the number of FD's in /proc/$jenkins_pid/fd. Calling garbage collection in the JVM doesn't release the FD's and thus the only cure is to restart Jenkins before the number of files reaches the Open file ulimit. I have set my ulimit to 65356 so I don't have to reboot so frequently. I restarted Jenkins at 7:49 this morning and the file descriptor count is currently at 6147 files; it's now 12:10 in the afternoon, so it's been steadily leaking FD's at approximately 1500 per hour.
- is related to
-
JENKINS-23572 Repeated calls to jenkins cli results in Too many open files exception on the master
-
- Resolved
-
[JENKINS-23248] CLI calls are causing file descriptor leaks.
— /tmp/jenk_fd.1 2014-06-01 21:03:34.006155887 +0100
+++ /tmp/jenk_fd.2 2014-06-01 21:09:27.053382015 +0100
@@ -52342 +52341,0 @@
-l-wx------ 1 kcc users 64 Jun 1 20:45 57103 -> socket:[282820885]
@@ -55657 +55655,0 @@
lr-x----- 1 kcc users 64 Jun 1 20:59 60087 -> /user1/jenkins/jobs/scm-poll-jenkins-branch-monitor/builds/2014-06-01_21-01-01/log
@@ -55662 +55660 @@
lr-x----- 1 kcc users 64 Jun 1 21:01 60091 -> pipe:[282533977]
+lr-x------ 1 kcc users 64 Jun 1 21:01 60091 -> socket:[283188722]
@@ -55747 +55745 @@
lrwx----- 1 kcc users 64 Jun 1 21:03 60169 -> socket:[282813330]
+lrwx------ 1 kcc users 64 Jun 1 21:03 60169 -> socket:[283188811]
@@ -55807 +55804,0 @@
lrwx----- 1 kcc users 64 Jun 1 21:03 60222 -> socket:[282819646]
@@ -55824,0 +55822 @@
+lrwx------ 1 kcc users 64 Jun 1 21:04 60239 -> socket:[282821225]
@@ -55825,0 +55824,5 @@
+lrwx------ 1 kcc users 64 Jun 1 21:04 60240 -> socket:[282821552]
+lrwx------ 1 kcc users 64 Jun 1 21:05 60241 -> socket:[282821858]
+l-wx------ 1 kcc users 64 Jun 1 21:05 60242 -> socket:[282947156]
+lrwx------ 1 kcc users 64 Jun 1 21:07 60243 -> socket:[282947065]
+lr-x------ 1 kcc users 64 Jun 1 21:05 60244 -> socket:[283309559]
This is just a small sample within several minutes of one another. I can send a larger delta if needs be. The leak is now at >60000 files, which you can see from the line numbers in the diff.
Running garbage collection in the JVM doesn't clear them down either. I think I mentioned this already.
I know what the cause was. I found this article: https://wiki.jenkins-ci.org/display/JENKINS/Spawning+processes+from+build
I'd say the article needs updating. It assumes that the bug/feature is only prevalent for sub processes that hold pipes open that become detached from the main process. This is not entirely true, since I was able to reproduce the same effect by running the CLI utility as a foreground process from a job spawned on the Jenkins server. The output and input from the CLI was attached directly to the build pipeline. There was no detaching or backgrounding going on, simply execute and return the exit code. However, what I think may be happening (I am a Java noob, so this is complete speculation) is that the same bug/feature found in Java that causes the problem for build processes, was emitted by the Java process spawned from running the CLI utility. I suspect this created some kind of circular file descriptor reference inside the JVM, preventing the EOF from being transmitted by the CLI utility when it exited.
I have cured the problem completely, by modifying my wrapper script to explicitly close all descriptors apart from stdin, stdout and stderr on invocation, before running the Java util. Also, after running the util, stdin, stdout and stderr are then explicitly closed before the shell script exits. Since making this change, the file descriptor leak has been stable at 823 total file descriptors open.
My guess is that the CLI utility should be doing something similar.
I just logged jenkins-23572 about a week ago and it looks like this exact same issue. Happy to mark my a duplicate of this or run tests to determine the root cause. We are calling the CLI from within our groovy test code inside of our job while it's executing on the slaves. I don't see how I could work around this by closing file descriptors in my wrapper code that's executing on the slave since it's on the master where there are too many open file descriptors. Maybe your job was executing directly on the master?
damong Looks like a duplicate. If you could do a "binary search" for the responsible Jenkins release (1.546 good, 1.564 bad is too many possible versions) that would help.
Hi Daniel,
It may take me a few days to get back to you but yes i could run this test against the intermediary versions. I've cloned this vm off into an isolated environment for testing. Isn't there something else I could do at the same time to narrow it down? lsof just shows me that the file descriptors are leaking from java. Any sort of jenkins tracing I might want to enable at the same time that might narrow down where it's coming from?
Actually it looks like we'll be able to try this on Thursday.
damong: If you know how to search a heap dump for these, go for it. Otherwise, it's probably easiest narrowing it down to the first broken release.
If you then have Git and Maven (3.0.5+) available, check out https://github.com/jenkinsci/jenkins and try using git bisect to find the responsible commit. mvn -DskipTests=true clean verify && java -jar war/target/jenkins.war can be used to compile and run Jenkins from source.
From what I can tell, the problem is with the CLI, and specifically, its use of the Jenkins remoting library. By going back to the 1.546 version of the CLI and adjusting the version of the remoting library used, I determined that the leaks occur on version 2.38 but not on 2.37. Those two version are significantly different, so I'm not sure of the exact change that introduced the problem.
Adjusted issue metadata based on Matthew Reiter's comment. Damon G: If you could try doing the same to confirm you're seeing the same issue would be helpful.
Damon and I are working on the same system, so I can confirm he is seeing the same issue.
Hi Daniel, so are you all set with help? Matt pinpointed the version of the remoting library. Looks like there was a pretty major refactoring done between 2.37 and 2.38.
I pinged Kohsuke about it, but he was busy with something else at the time.
Hey, just wanted to add that I'm seeing this issue with as well. Running v. 1.566.
We are seeing an issue when calls to the jenkins-cli.jar from node leave a socket open each time it is called. We use these calls to update the external resource plugin and this can lead to hundreds of file descriptors leaked every day. The only way to free them up is to restart Jenkins.
lsof shows the unclosed socket as:
java 4222 tomcat6 804u sock 0,6 0t0 230747605 can't identify protocol
This is with Version 1.573
It maybe useful to wrote a test case and run automated bisect http://java.dzone.com/articles/automated-bug-finding-git ...
Confirmed. Based on the output from lsof, TCP connections are leaking.
Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/hudson/remoting/ChunkedInputStream.java
src/main/java/hudson/remoting/ChunkedOutputStream.java
http://jenkins-ci.org/commit/remoting/fc5cf1a6a91d255333eff4ec936d55e5719e773c
Log:
JENKINS-23248
ChunkcedInputStream.close() wasn't closing the underlying stream
Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/hudson/remoting/SocketChannelStream.java
http://jenkins-ci.org/commit/remoting/6c7d4d110362106cf897fef6485b80dbd3755d4c
Log:
JENKINS-23248 Making the close method more robust
Unlike the usual close() method, shutdownInput/Output throws an exception for the 2nd invocation.
It's better to silently do nothing instead of dying with IOException.
Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/hudson/remoting/SocketChannelStream.java
http://jenkins-ci.org/commit/remoting/c90fc463923cc7f4a8907a0b352204f3d561cc55
Log:
[FIXED JENKINS-23248] Seeing strange "Transport endpoint is not connected" exception
s.shutdownInput() fails with the following exception, even though s.isInputShutdown() is reporting false:
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:667)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:378)
at hudson.remoting.SocketChannelStream$1.close(SocketChannelStream.java:39)
at sun.nio.ch.ChannelInputStream.close(ChannelInputStream.java:113)
at javax.crypto.CipherInputStream.close(CipherInputStream.java:296)
at java.io.BufferedInputStream.close(BufferedInputStream.java:468)
at hudson.remoting.FlightRecorderInputStream.close(FlightRecorderInputStream.java:112)
at hudson.remoting.ChunkedInputStream.close(ChunkedInputStream.java:102)
at hudson.remoting.ChunkedCommandTransport.closeRead(ChunkedCommandTransport.java:50)
at hudson.remoting.Channel.terminate(Channel.java:795)
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:951)
at hudson.remoting.Channel$2.handle(Channel.java:475)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:60)
This bug report may be related: http://bugs.java.com/view_bug.do?bug_id=4516760
If we fail to call s.close(), a socket will leak, so swallowing this exception and have the code execute "s.close()"
Compare: https://github.com/jenkinsci/remoting/compare/d1905acf329e...c90fc463923c
Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
changelog.html
pom.xml
http://jenkins-ci.org/commit/jenkins/5b7d59a71aa01d0c430d98d0b6879c3662a6248b
Log:
[FIXED JENKINS-23248]
Integrated remoting.jar that fixes the problem
Integrated in jenkins_main_trunk #3560
[FIXED JENKINS-23248] (Revision 5b7d59a71aa01d0c430d98d0b6879c3662a6248b)
Result = SUCCESS
kohsuke : 5b7d59a71aa01d0c430d98d0b6879c3662a6248b
Files :
- changelog.html
- pom.xml
Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
pom.xml
http://jenkins-ci.org/commit/jenkins/0ae16264db1b07c011e76aee81442033caf5c72a
Log:
[FIXED JENKINS-23248]
Integrated remoting.jar that fixes the problem
(cherry picked from commit 5b7d59a71aa01d0c430d98d0b6879c3662a6248b)
Conflicts:
changelog.html
Is this really fixed?
I am on Jenkins ver. 2.32.3. Tested with script to run a dummy job --that seems to have leaked handles - one per invocation.
JAVA_CMD=java
CLI_JAR=/jenkins/jenkins-cli.jar
JOB_NAME=dummyJob
URL=http://testserver:8080/
for i in {1..1000}
do
$JAVA_CMD -jar $CLI_JAR -s $URL build $JOB_NAME
done
jayanmn There's really no point testing this on an outdated version of Jenkins when 2.46.2 contained a major rewrite of the CLI. If you're affected by an issue that looks like this one, start by upgrading to a recent version of Jenkins.
If the problem persists (given the age of this issue and major changes since), please file a new bug.
Thanks, great news.
I posted my case as this was claimed to be addressed long back in 1.563.... I will pick 2.46.2 and test it in next couple of days..
this was claimed to be addressed long back in 1.563
Right; an indicator you might be seeing a different bug, so please file a new issue if it's still present. It's very confusing to have multiple reports of the "same" issue that have very different causes.
tested with Jenkins ver. 2.60.2 - I will create a new bug as you suggested.
When you look at ls -la /proc/$PID/fd output, can you tell which files are left open? Can you take a diff between two points of time and give us the delta?