-
Bug
-
Resolution: Fixed
-
Blocker
-
Jenkins ver. 2.66 on Linux (Ubuntu 14.04)
-
Powered by SuggestiMate
Jenkins seems to keep a open file handle to the log file (job output) for every single build, even those who have been discarded by the "Discard old build policy".
This is a sample of the lsof output (whole file attached)
java 8870 jenkins 941w REG 252,0 1840 1332171 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50063/log (deleted) java 8870 jenkins 942w REG 252,0 2023 402006 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50044/log (deleted) java 8870 jenkins 943w REG 252,0 2193 1332217 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50101/log java 8870 jenkins 944w REG 252,0 2512 1332247 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50106/log java 8870 jenkins 945w REG 252,0 1840 1703994 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50067/log (deleted) java 8870 jenkins 946w REG 252,0 2350 1332230 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50092/log (deleted) java 8870 jenkins 947w REG 252,0 1840 402034 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50049/log (deleted) java 8870 jenkins 948w REG 252,0 1840 927855 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50080/log (deleted) java 8870 jenkins 949w REG 252,0 2195 1332245 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50095/log (deleted) java 8870 jenkins 950w REG 252,0 2326 1332249 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50107/log java 8870 jenkins 952w REG 252,0 2195 1332227 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50102/log java 8870 jenkins 953w REG 252,0 2154 1332254 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50109/log java 8870 jenkins 954w REG 252,0 2356 1332282 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50105/log
- duplicates
-
JENKINS-43199 Credentials Binding plugin causes File Descriptor leak
-
- Resolved
-
- is duplicated by
-
JENKINS-43199 Credentials Binding plugin causes File Descriptor leak
-
- Resolved
-
- is related to
-
JENKINS-48280 File handle leak
-
- Open
-
-
JENKINS-45903 transient file handle leak in LargeText.GzipAwareSession.isGzipStream(File)
-
- Resolved
-
- relates to
-
JENKINS-42934 Avoid using new FileInputStream / new FileOutputStream
-
- Closed
-
[JENKINS-45057] "too many files open": file handles leak, job output file not closed
Hi jonasatwork i've tried your test and what I get is 4 new open files rather than the 3 you suggested.
This is the output of the diff between two lsof execution interleaved by one job run with your code
> java 19008 jenkins 587r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log > java 19008 jenkins 589r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log > java 19008 jenkins 590r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log > java 19008 jenkins 592r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log
Hi bbonacci,
Can you post an example of your emr-termination-policy groovy code?
Please provide a list of installed plugins and a sample configuration file of an affected job.
Hi adamleggo,
the emr-termination-policy is a Freestyle job with a simple (bash) shell script.
So I've been digging down and I've narrowed down the problem.
It looks like when the option Use secret text(s) or file(s) is active the file handle leaks.
Steps to reproduce:
- create free style project
- add one step with shell script running "echo test"
- click on Use secret text(s) or file(s)
- save job
- count numbers of open files with lsof -p <pid> | wc -l
- build job
- count numbers of open files with lsof -p <pid> | wc -l
- repeat last two steps.
In my environment 1 file (the build log) handle is always leaked.
The "secrets" extensions has a feature for which if the secrets appear as output in the log they are replaced with "*******".
I guess somewhere in there, the log file isn't closed properly and the file handle leaks.
We're also seeing this behavior on our jenkins master running 2.60.1. Happy to provide any relevant information if I can be of help, just not sure what to get. Just to put in perspective, we're having to restart our master every ~4 days for one of our very busy jobs, with FD limit already increased to 10k. I believe we are seeing the same thing as Bruno, as we also have secrets bound to these jobs
We appear to have upgraded the relevant plugin (Credentials Binding Plugin) to 1.12, if that's relevant
We are also seeing this behavior on Jenkins 2.54 with credentials binding plugin 1.12.
Edit: this also seems to be the same issue: https://issues.jenkins-ci.org/browse/JENKINS-43199
Also, there is a discussion around a pull request here: https://github.com/jenkinsci/credentials-binding-plugin/pull/37
There are more and more reports in JENKINS-43199, and the maintainer rejects to apply the hotfix in his plugin. So we may have to fix it in the core ("we" === "Jenkins community", feel free to contribute)
Is it possible to downgrade a plugin to resolve the fd leak? Has anyone tried downgrading from credentials binding plugin 1.12 to credentials binding plugin 1.11, did it help resolve this issue? Thanks! We have to restart our Jenkins once a day.
My team has tried downgrading to 1.11, but it did not help any. We're hoping to take our first stab at making a pull request for this sometime this week – we've never done any jenkins core changes, however, so no promises
We are also hit hard by this running LTS 2.60.1. I wonder if this Blocker was fixed in 2.60.2? The Changelog does not look promising, but how can a LTS Version be released with this known issue . We do NOT have the Credentials Binding Plugin installed at all.
shahmishal This is an open-source project, there is no escalation process. The best way to help with this issue is to Participate and Contribute. In the Jenkins community we always encourage it.
P.S: If you want to do escalations, there are companies offering commercial support
I am also encountering this issue in our production environment, the master is hitting 16k FDs open every 5hrs which requires a restart of the jenkins service prior to that.
We started seeing the issue after upgrading Jenkins from 2.46.2 LTS to 2.60.2 LTS and SSH Slaves plugin from 1.9 to 1.20. Credentials Binding plugin was NOT upgraded.
If the bug is in Credentials Binding plugin why it didn't appear before?
Would be interesting to know whether this started between 2.52 (unaffected) and 2.53 (affected). If so, JENKINS-42934 would be a likely culprit. jonasatwork reported that 2.52 was the first to be affected, I wonder whether that report was off by one.
Currently I have no answer for that question.
What I can say the problem is reproducible on 2.60.2 but it isn't on 2.50.
On both versions if I start build of a job with the above mentioned groovy the number of file handles used by jenkins process increases but on 2.50 it also decreases after a while - it looks like not immediately after the job finished - but on 2.60.2 it only increases and never decreases.
I'll try to find some time to check for 2.52 and 2.53 or ask a colleague to do so.
EDIT: Earlier in this comment I was saying that the problem could be reproduced in 2.52. I was wrong. I accidentally shuffled the war's name and didn't notice the version. I apologize to anyone who took the time to verify this.
I tested this on 2.52 and 2.53 as saretter asked me:
In 2.52 could not reproduce the problem.
In 2.53 could reproduce it.
As danielbeck suggested, https://issues.jenkins-ci.org/browse/JENKINS-42934 might be related to this.
bbonacci carlescapdevila would be great if you can use git bisect to find out even more precisely which commit introduced this. If you're unclear on how to use it, I can provide/write a documentation for it. Thanks!
use git bisect to find out even more precisely which commit introduced this
Or just test https://github.com/jenkinsci/jenkins/commit/bde09f70afaf10d5e1453c257058a56b07556e8e which is assumed to break, and https://github.com/jenkinsci/jenkins/commit/0ddf2d5be77072264845a5f4cf197d91d32e4695 which is assumed to not break, to begin with, and see whether that's the cause.
Tested https://github.com/jenkinsci/jenkins/commit/bde09f70afaf10d5e1453c257058a56b07556e8e and it did indeed break, this one https://github.com/jenkinsci/jenkins/commit/0ddf2d5be77072264845a5f4cf197d91d32e4695 was OK.
By the way, I'm using Windows' "handle -s -p <jenkinsPID>" command to detect the file handles. The https://wiki.jenkins.io/display/JENKINS/File+Leak+Detector+Plugin does not show anything when the builds are over, but handle does show an increment of the files long after the builds are over.
UPDATE: Tested https://github.com/jenkinsci/jenkins/commit/a3ef5b6048d66e59e48455b48623e30c14be8df4 - OK
and then the next https://github.com/jenkinsci/jenkins/commit/f0cd7ae8ff269dd738e3377a62f3fbebebf9aef6 - has the issue, so this commit introduces the leak
carlescapdevila any chance you could try the attached patch against HEAD (probably easy to apply to most versions) and see if that resolves the issue. Seems like there may be some paths where the run's log stream does not get closed correctly
Tested against HEAD with the patch applied and no luck. Tested too against 2.53 and 2.60.1 (both of them with the patch) and same, the leak doesn't go away. Thank you very much for the effort nevertheless.
EDIT: I'm reproducing the issue according to jonasatwork's comment: https://issues.jenkins-ci.org/browse/JENKINS-45057?focusedCommentId=304877&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-304877 (but in my case I'm using windows so I check the file usage with the "handle" command). So could it be due to some kind of interaction with the Groovy Plugin?
carlescapdevila so one interesting thing is that most of the file handles look ok except for the emr-termination-policy files.
There are 406 file handles open of the type:
java 8870 jenkins 991w REG 252,0 2194 1332198 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50086/log (deleted)
So these are file handles open on a file that appears to be deleted!
406 of them to be precise:
$ grep "(deleted)" filesopen.txt | wc -l
406
And all but two of them are emr-termination-policy
$ grep "(deleted)" filesopen.txt | grep emr-termination-policy | wc -l 404 $ grep "(deleted)" filesopen.txt | grep optimus | wc -l 2
When we look at the file handles, these are WRITE file handles, so the file handle has to be opened inside Run.execute()
Just to confirm, these two jobs are Freestyle jobs and not Pipeline jobs?
Hmmm... digging some more, the Run.delete() method will do a rename of the build directory from /XXX/ to /.XXX so this looks very much like delete() is not waiting for the running job to complete
Hmmm I wonder if emr-termination-policy has:
- a short log rotation period,
- runs almost continually
- uses a Notifier that perhaps should be a Publisher?
What Post-build actions do you have configured carlescapdevila?
bbonacci sorry just realized that you are the one with the emr-termination-policy job
So https://github.com/jenkinsci/jenkins/pull/2953 should fix the credentials binding plugin issue in any case... though I think that it should be part of the contract of plugins annotating the console that they pass the close through, so in a sense it is a defensive core change and jglick should just merge https://github.com/jenkinsci/credentials-binding-plugin/pull/37
Hi, finally back after some holiday
Yes, the job is a freestyle job, just as I wrote. I took a quick look at the changes that was done in JENKINS-42934 and noticed that on a few places, the "close()" calls on created files has been removed completely, earlier on they lived in some "finally{}" blocks in the code. The reason for the change is ([https://bugs.openjdk.java.net/browse/JDK-8080225|https://bugs.openjdk.java.net/browse/JDK-8080225)]. I don't interpret that the calls to close() should be removed just because of this change.
jonasatwork pointers to the cases you believe a handle is escaping?
jonasatwork keep in mind that we moved from
InputStream is = ...; try { ... } finally { is.close(); }
to try-with-resources:
try (InputStream is = ...) {
...
}
So expect those close calls to be handled by try-with-resources
Ok, so the Groovy leak appears to be an issue with Stapler!!!
Groovy is querying the properties and discovers the `getLogText()` method which results in stapler opening the read handle using a FileInputStream... which is then pending finalization at which point the file handle will be released...
IOW this is a replica of JENKINS-42934 only against Stapler... it being https://github.com/stapler/stapler/blob/3ac71dce264da052186956ef06b772a91ca74d5e/core/src/main/java/org/kohsuke/stapler/framework/io/LargeText.java#L457-L467 that is responsible for the leak!!!
Well I suppose the workaround is to use the less obtuse
def jobName = build.parent.builds[0].envVars.JOB_NAME
If you are using a sandboxed script, well DefaultGroovyMethods.getProperties(Object) is already blacklisted so you could not make this mistake to begin with.
Loos like we had been hit by a different side-effect of the identified change. All "OutputStream" returned from instance of hudson.console.ConsoleLogFilter must close the wrapped OutputStream, when close is called. Could be it was expected before - now with core 2.53ff it leads to leak of file handles if you miss this.
This change fixed the issue for our plugin: https://github.com/SoftwareBuildService/log-file-filter-plugin/commit/c1148435a454aa5a3a72bab05c3a6996ea5f42f5
Should be fixed, but needs verification if there is a reproducible test case.
It didn't get into 2.60.3 since it was fixed/integrated too late. It will be a candidate for the next baseline
As jglick says, the patch in credentials binding 1.13 has been released, so the partial fix can be applied via the plugin update.
Above is the change in file handle usage after upgrading to CloudBees Jenkins Enterprise 2.60.2.2-rolling. Our workaround until the core version is released is to set ulimit -n very large and reboot at least weekly.
If a better interim solution is known, we'd love to hear it.
stevenatcisco what does cause it? If it is a Credentials Binding plugin, you can just update it (see the linked issues). Jenkins core just provides a generic fix for all cases, but plugins can be patched on their own without a need to bump the core. You can use http://file-leak-detector.kohsuke.org/ to triage the root cause
In Jenkins the patch will be available in 2.73.1 LTS. Regarding CloudBees Jenkins Enterprise, please contact the vendor's support
I suppose lts-candidate can be removed given that this is already in 2.73.
oleg_nenashev the File Leak Detector plugin (better than the linked standalone tool) would not be helpful here since we already know well where the file handle is opened, when the build starts. The issue is why it is not closed, which will depend on which console-affecting plugins are activated during the build.
oleg_nenashev: We tried using the File Leak Detector Plugin... it would not run, apparently it requires Oracle Java - we are using OpenJDK. The kohsuke leak detector when run crashed our Jenkins instance. It too seems to require Oracle Java.
Here is the job we are running hourly, and the results
{{ /* JOB TO PERIODICALLY CHECK FILE HANDLES */}}node('master') {
{{ sh '''rm -f lsof.txt }}
{{ lsof -u jenkins > lsof.txt}}
{{ cut -f 1 /proc/sys/fs/file-nr > filehandles.txt}}
{{ echo "$(cat filehandles.txt)=handles |" > numfiles.txt}}
{{ echo "$(wc -l < lsof.txt)=JenkLSOF |" >> numfiles.txt}}
{{ echo "$(grep -Fc \'(deleted)\' lsof.txt)=deleted " >> numfiles.txt}}
{{ cat numfiles.txt}}
{{ '''}}
{{ archiveArtifacts allowEmptyArchive: true, artifacts: '*.txt', caseSensitive: false}}
{{ result = readFile 'numfiles.txt'}}
{{ currentBuild.description = result}}
{{ fileHandlesInUse = readFile 'filehandles.txt'}}
{{ deleteDir()}}
{{ } // node}}
{{/******* RESULTS *******/ }}
{{ Aug 30, 2017 6:56 AM 9472=handles | 10554=JenkLSOF | 3621=deleted}}
{{ Aug 30, 2017 5:56 AM 9568=handles | 10654=JenkLSOF | 3557=deleted}}
{{ Aug 30, 2017 4:56 AM 9376=handles | 10521=JenkLSOF | 3524=deleted}}
{{ Aug 30, 2017 3:56 AM 9312=handles | 10417=JenkLSOF | 3462=deleted}}
{{ Aug 30, 2017 2:56 AM 9216=handles | 10358=JenkLSOF | 3401=deleted}}
{{ Aug 30, 2017 1:56 AM 9184=handles | 10276=JenkLSOF | 3338=deleted}}
{{ Aug 30, 2017 12:56 AM 9312=handles | 10406=JenkLSOF | 3303=deleted}}
{{ Aug 29, 2017 11:56 PM 9216=handles | 10338=JenkLSOF | 3236=deleted}}
{{ Aug 29, 2017 10:56 PM 9408=handles | 10423=JenkLSOF | 3198=deleted}}
{{ Aug 29, 2017 9:56 PM 8896=handles | 10042=JenkLSOF | 3137=deleted}}
{{ Aug 29, 2017 8:56 PM 9024=handles | 10138=JenkLSOF | 3098=deleted}}
{{ Aug 29, 2017 7:56 PM 9024=handles | 10243=JenkLSOF | 3028=deleted}}
{{ Aug 29, 2017 6:56 PM 8896=handles | 9948=JenkLSOF | 2981=deleted}}
{{ Aug 29, 2017 5:56 PM 8768=handles | 9879=JenkLSOF | 2913=deleted}}
{{ Aug 29, 2017 4:56 PM 8832=handles | 9879=JenkLSOF | 2844=deleted}}
{{ Aug 29, 2017 3:56 PM 8608=handles | 9731=JenkLSOF | 2773=deleted}}
{{ Aug 29, 2017 2:56 PM 8448=handles | 9587=JenkLSOF | 2741=deleted}}
{{ Aug 29, 2017 1:56 PM 8384=handles | 9556=JenkLSOF | 2681=deleted}}
{{ Aug 29, 2017 12:56 PM 8192=handles | 9452=JenkLSOF | 2650=deleted}}
{{ Aug 29, 2017 11:56 AM 8096=handles | 9306=JenkLSOF | 2590=deleted}}
{{ Aug 29, 2017 1:56 AM 8064=handles | 8921=JenkLSOF | 2081=deleted}}
The "deleted" items are all log entries like those described in the original incident.
NOTE: I have opened an incident under our support contract, but have posted details here in case they may help to diagnose the root cause. Is there another tool we can use? Or would the LSOF output over many hours be sufficient?
Here is confirmation that the upgrade resolved the leak... mostly.
We notice in the last 48 hours, there have been 6 file handle leaks. That would have been 100s previously.
Even 6 leaks is quite suspicious, but I'd guess we cannot do anything with it without File Leak Detector
oleg_nenashev After upgrade to Jenkins 2.73.3 the issue became less severe but still we have to restart our Jenkins instance once a week (for 2.60 it was once a day).
Here's the summary of 2 lsof runs with 1 day between them. The list of top files:
Nov-17:
100632 slave.log 32294 log 7685 timestamps 4193 random 3635 urandom
Nov-18:
708532 log 297707 timestamps 98280 slave.log 90675 Common.groovy 85995 BobHelper.groovy
Does it give you more information to find the cause? Unfortunately it's a bit hard for me to provide the file leak detector plugin output because we use openjdk
I have found a solution for the code Jonas provided, I am not sure if it fixes the problem for Bruno since no groovy example has been provided.
Problem code:
import hudson.model.*
def thr = Thread.currentThread()
def build = thr?.executable
def jobName = build.parent.builds[0].properties.get("envVars").get("JOB_NAME")
def jobNr = build.parent.builds[0].properties.get("envVars").get("BUILD_NUMBER")
println "This is " + jobName + " running for the $jobNr:th time"
Fixed code:
import hudson.model.*
def jobName = build.environment.get("JOB_NAME")
def jobNr = build.environment.get("BUILD_NUMBER")
println "This is " + jobName + " running for the $jobNr:th time"
No open files found after the fixed job is run.
The build object is already available for the script to use, so getting it from the currentThread causes a problem. Not sure why.