Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45057

"too many files open": file handles leak, job output file not closed

      Jenkins seems to keep a open file handle to the log file (job output) for every single build, even those who have been discarded by the "Discard old build policy".

       

      This is a sample of the lsof output (whole file attached)

      java 8870 jenkins 941w REG 252,0 1840 1332171 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50063/log (deleted)
      java 8870 jenkins 942w REG 252,0 2023 402006 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50044/log (deleted)
      java 8870 jenkins 943w REG 252,0 2193 1332217 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50101/log
      java 8870 jenkins 944w REG 252,0 2512 1332247 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50106/log
      java 8870 jenkins 945w REG 252,0 1840 1703994 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50067/log (deleted)
      java 8870 jenkins 946w REG 252,0 2350 1332230 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50092/log (deleted)
      java 8870 jenkins 947w REG 252,0 1840 402034 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50049/log (deleted)
      java 8870 jenkins 948w REG 252,0 1840 927855 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50080/log (deleted)
      java 8870 jenkins 949w REG 252,0 2195 1332245 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50095/log (deleted)
      java 8870 jenkins 950w REG 252,0 2326 1332249 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50107/log
      java 8870 jenkins 952w REG 252,0 2195 1332227 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50102/log
      java 8870 jenkins 953w REG 252,0 2154 1332254 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50109/log
      java 8870 jenkins 954w REG 252,0 2356 1332282 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50105/log
      

       

          [JENKINS-45057] "too many files open": file handles leak, job output file not closed

          Mike Delaney added a comment -

          I'm seeing this as well on Jenkins 2.60.1 LTS on Ubuntu 14.04. When using Jenkins 2.46.2 LTS

          Mike Delaney added a comment - I'm seeing this as well on Jenkins 2.60.1 LTS on Ubuntu 14.04. When using Jenkins 2.46.2 LTS

          Oleg Nenashev added a comment -

          It may possible happen if Groovy overrides Build Log appenders via log decorators, but I am not sure why Groovy plugin would need it

          Oleg Nenashev added a comment - It may possible happen if Groovy overrides Build Log appenders via log decorators, but I am not sure why Groovy plugin would need it

          Jonas Jonsson added a comment -

          From a colleague:  The problem doesn't exist in Jenkins-2.51 (and Groovy-2.0).  Jenkins-2.52 has the problem.

          Hi Jonas, I didn't have a problem with Jenkins 2.51 and Groovy 2.0, but the problem occurred with Jenkins 2.52 and Groovy 2.0. I will downgrade Groovy to a previous version and try these two versions of Jenkins to work out the differences. Regards

          Jonas Jonsson added a comment - From a colleague:  The problem doesn't exist in Jenkins-2.51 (and Groovy-2.0).  Jenkins-2.52 has the problem. Hi Jonas, I didn't have a problem with Jenkins 2.51 and Groovy 2.0, but the problem occurred with Jenkins 2.52 and Groovy 2.0. I will downgrade Groovy to a previous version and try these two versions of Jenkins to work out the differences. Regards

          Daniel Beck added a comment -

          Notably there's absolutely nothing of interest in 2.52: Just a major overhaul of the German localization, other localization fixes, removal of the most incomplete localizations, and this one change in the actual code:

          https://github.com/jenkinsci/jenkins/compare/jenkins-2.51...jenkins-2.52#diff-9fafdcd0712c5a5dab3acb4ea168515aR272

          So this seems to be unrelated to core.

          Daniel Beck added a comment - Notably there's absolutely nothing of interest in 2.52: Just a major overhaul of the German localization, other localization fixes, removal of the most incomplete localizations, and this one change in the actual code: https://github.com/jenkinsci/jenkins/compare/jenkins-2.51...jenkins-2.52#diff-9fafdcd0712c5a5dab3acb4ea168515aR272 So this seems to be unrelated to core.

          Adam Leggo added a comment -

          I have found a solution for the code Jonas provided, I am not sure if it fixes the problem for Bruno since no groovy example has been provided.

          Problem code:

          import hudson.model.*

          def thr = Thread.currentThread()
          def build = thr?.executable
          def jobName = build.parent.builds[0].properties.get("envVars").get("JOB_NAME")
          def jobNr = build.parent.builds[0].properties.get("envVars").get("BUILD_NUMBER")
          println "This is " + jobName + " running for the $jobNr:th time"

           

          Fixed code:

          import hudson.model.*

          def jobName = build.environment.get("JOB_NAME")
          def jobNr = build.environment.get("BUILD_NUMBER")
          println "This is " + jobName + " running for the $jobNr:th time"

           

          No open files found after the fixed job is run.

          The build object is already available for the script to use, so getting it from the currentThread causes a problem. Not sure why.

          Adam Leggo added a comment - I have found a solution for the code Jonas provided, I am not sure if it fixes the problem for Bruno since no groovy example has been provided. Problem code: import hudson.model.* def thr = Thread.currentThread() def build = thr?.executable def jobName = build.parent.builds [0] .properties.get("envVars").get("JOB_NAME") def jobNr = build.parent.builds [0] .properties.get("envVars").get("BUILD_NUMBER") println "This is " + jobName + " running for the $jobNr:th time"   Fixed code: import hudson.model.* def jobName = build.environment.get("JOB_NAME") def jobNr = build.environment.get("BUILD_NUMBER") println "This is " + jobName + " running for the $jobNr:th time"   No open files found after the fixed job is run. The build object is already available for the script to use, so getting it from the currentThread causes a problem. Not sure why.

          Bruno Bonacci added a comment -

          Hi jonasatwork i've tried your test and what I get is 4 new open files rather than the 3 you suggested.

           

          This is the output of the diff between two lsof execution interleaved by one job run with your code

          > java 19008 jenkins 587r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log
          > java 19008 jenkins 589r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log
          > java 19008 jenkins 590r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log
          > java 19008 jenkins 592r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log
          

          Bruno Bonacci added a comment - Hi jonasatwork i've tried your test and what I get is 4 new open files rather than the 3 you suggested.   This is the output of the diff between two lsof execution interleaved by one job run with your code > java 19008 jenkins 587r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log > java 19008 jenkins 589r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log > java 19008 jenkins 590r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log > java 19008 jenkins 592r REG 252,0 503 395865 /data/jenkins/jobs/automation/jobs/test-open-files/builds/7/log

          Adam Leggo added a comment -

          Hi bbonacci,

          Can you post an example of your emr-termination-policy groovy code?
          Please provide a list of installed plugins and a sample configuration file of an affected job.

          Adam Leggo added a comment - Hi bbonacci , Can you post an example of your emr-termination-policy groovy code? Please provide a list of installed plugins and a sample configuration file of an affected job.

          Bruno Bonacci added a comment - - edited

          Hi adamleggo,
          the emr-termination-policy is a Freestyle job with a simple (bash) shell script.
          So I've been digging down and I've narrowed down the problem.
          It looks like when the option Use secret text(s) or file(s) is active the file handle leaks.

          Steps to reproduce:

          1. create free style project
          2. add one step with shell script running "echo test"
          3. click on Use secret text(s) or file(s)
          4. save job
          5. count numbers of open files with lsof -p <pid> | wc -l
          6. build job
          7. count numbers of open files with lsof -p <pid> | wc -l
          8. repeat last two steps.

          In my environment 1 file (the build log) handle is always leaked.

          Bruno Bonacci added a comment - - edited Hi adamleggo , the emr-termination-policy is a Freestyle job with a simple (bash) shell script. So I've been digging down and I've narrowed down the problem. It looks like when the option Use secret text(s) or file(s) is active the file handle leaks. Steps to reproduce: create free style project add one step with shell script running "echo test" click on Use secret text(s) or file(s) save job count numbers of open files with lsof -p <pid> | wc -l build job count numbers of open files with lsof -p <pid> | wc -l repeat last two steps. In my environment 1 file (the build log) handle is always leaked.

          Bruno Bonacci added a comment - - edited

          The "secrets" extensions has a feature for which if the secrets appear as output in the log they are replaced with "*******".
          I guess somewhere in there, the log file isn't closed properly and the file handle leaks.

          Bruno Bonacci added a comment - - edited The "secrets" extensions has a feature for which if the secrets appear as output in the log they are replaced with "*******". I guess somewhere in there, the log file isn't closed properly and the file handle leaks.

          Mike Delaney added a comment -

          I see this behavior without using "secrets" extension.

          Mike Delaney added a comment - I see this behavior without using "secrets" extension.

          Abhishek Mukherjee added a comment - - edited

          We're also seeing this behavior on our jenkins master running 2.60.1. Happy to provide any relevant information if I can be of help, just not sure what to get. Just to put in perspective, we're having to restart our master every ~4 days for one of our very busy jobs, with FD limit already increased to 10k. I believe we are seeing the same thing as Bruno, as we also have secrets bound to these jobs

          Abhishek Mukherjee added a comment - - edited We're also seeing this behavior on our jenkins master running 2.60.1. Happy to provide any relevant information if I can be of help, just not sure what to get. Just to put in perspective, we're having to restart our master every ~4 days for one of our very busy jobs, with FD limit already increased to 10k. I believe we are seeing the same thing as Bruno, as we also have secrets bound to these jobs

          We appear to have upgraded the relevant plugin (Credentials Binding Plugin) to 1.12, if that's relevant

          Abhishek Mukherjee added a comment - We appear to have upgraded the relevant plugin (Credentials Binding Plugin) to 1.12, if that's relevant

          kyle evans added a comment - - edited

          We are also seeing this behavior on Jenkins 2.54 with credentials binding plugin 1.12.

           

          Edit: this also seems to be the same issue: https://issues.jenkins-ci.org/browse/JENKINS-43199

          Also, there is a discussion around a pull request here: https://github.com/jenkinsci/credentials-binding-plugin/pull/37

          kyle evans added a comment - - edited We are also seeing this behavior on Jenkins 2.54 with credentials binding plugin 1.12.   Edit: this also seems to be the same issue: https://issues.jenkins-ci.org/browse/JENKINS-43199 Also, there is a discussion around a pull request here:  https://github.com/jenkinsci/credentials-binding-plugin/pull/37

          Oleg Nenashev added a comment -

          There are more and more reports in JENKINS-43199, and the maintainer rejects to apply the hotfix in his plugin. So we may have to fix it in the core ("we" === "Jenkins community", feel free to contribute)

          Oleg Nenashev added a comment - There are more and more reports in JENKINS-43199 , and the maintainer rejects to apply the hotfix in his plugin. So we may have to fix it in the core ("we" === "Jenkins community", feel free to contribute)

          mishal shah added a comment -

          Is it possible to downgrade a plugin to resolve the fd leak? Has anyone tried downgrading from credentials binding plugin 1.12 to credentials binding plugin 1.11, did it help resolve this issue? Thanks! We have to restart our Jenkins once a day.

          mishal shah added a comment - Is it possible to downgrade a plugin to resolve the fd leak? Has anyone tried downgrading from credentials binding plugin 1.12 to credentials binding plugin 1.11, did it help resolve this issue? Thanks! We have to restart our Jenkins once a day.

          My team has tried downgrading to 1.11, but it did not help any. We're hoping to take our first stab at making a pull request for this sometime this week – we've never done any jenkins core changes, however, so no promises

          Abhishek Mukherjee added a comment - My team has tried downgrading to 1.11, but it did not help any. We're hoping to take our first stab at making a pull request for this sometime this week – we've never done any jenkins core changes, however, so no promises

          mishal shah added a comment -

          abhishekmukherg Good Luck! Looking forward to your fix and Thanks! 

          mishal shah added a comment - abhishekmukherg Good Luck! Looking forward to your fix and Thanks! 

          We are also hit hard by this running LTS 2.60.1. I wonder if this Blocker was fixed in 2.60.2? The Changelog does not look promising, but how can a LTS Version be released with this known issue . We do NOT have the Credentials Binding Plugin installed at all.

          Andreas Mandel added a comment - We are also hit hard by this running LTS 2.60.1. I wonder if this Blocker was fixed in 2.60.2? The Changelog does not look promising, but how can a LTS Version be released with this known issue . We do NOT have the Credentials Binding Plugin installed at all.

          mishal shah added a comment - - edited

          bbonacci, What is the process on getting this issue escalated to be fixed soon?

          mishal shah added a comment - - edited bbonacci , What is the process on getting this issue escalated to be fixed soon?

          Oleg Nenashev added a comment -

          shahmishal This is an open-source project, there is no escalation process. The best way to help with this issue is to Participate and Contribute. In the Jenkins community we always encourage it.

          P.S: If you want to do escalations, there are companies offering commercial support

           

          Oleg Nenashev added a comment - shahmishal This is an open-source project, there is no escalation process. The best way to help with this issue is to Participate and Contribute . In the Jenkins community we always encourage it. P.S: If you want to do escalations, there are companies offering commercial support  

          mishal shah added a comment -

          oleg_nenashev Thanks!

          mishal shah added a comment - oleg_nenashev Thanks!

          Alex Raddas added a comment -

          I am also encountering this issue in our production environment, the master is hitting 16k FDs open every 5hrs which requires a restart of the jenkins service prior to that.

          Alex Raddas added a comment - I am also encountering this issue in our production environment, the master is hitting 16k FDs open every 5hrs which requires a restart of the jenkins service prior to that.

          We started seeing the issue after upgrading Jenkins from 2.46.2 LTS to 2.60.2 LTS and SSH Slaves plugin from 1.9 to 1.20. Credentials Binding plugin was NOT upgraded.

          If the bug is in Credentials Binding plugin why it didn't appear before?

          Volodymyr Sobotovych added a comment - We started seeing the issue after upgrading Jenkins from 2.46.2 LTS to 2.60.2 LTS and SSH Slaves plugin from 1.9 to 1.20. Credentials Binding plugin was NOT upgraded. If the bug is in Credentials Binding plugin why it didn't appear before?

          Daniel Beck added a comment -

          Would be interesting to know whether this started between 2.52 (unaffected) and 2.53 (affected). If so, JENKINS-42934 would be a likely culprit. jonasatwork reported that 2.52 was the first to be affected, I wonder whether that report was off by one.

          Daniel Beck added a comment - Would be interesting to know whether this started between 2.52 (unaffected) and 2.53 (affected). If so, JENKINS-42934 would be a likely culprit. jonasatwork reported that 2.52 was the first to be affected, I wonder whether that report was off by one.

          Sascha Retter added a comment -

          Currently I have no answer for that question.

          What I can say the problem is reproducible on 2.60.2 but it isn't on 2.50.

          On both versions if I start build of a job with the above mentioned groovy the number of file handles used by jenkins process increases but on 2.50 it also decreases after a while - it looks like not immediately after the job finished - but on 2.60.2 it only increases and never decreases.

          I'll try to find some time to check for 2.52 and 2.53 or ask a colleague to do so.

          Sascha Retter added a comment - Currently I have no answer for that question. What I can say the problem is reproducible on 2.60.2 but it isn't on 2.50. On both versions if I start build of a job with the above mentioned groovy the number of file handles used by jenkins process increases but on 2.50 it also decreases after a while - it looks like not immediately after the job finished - but on 2.60.2 it only increases and never decreases. I'll try to find some time to check for 2.52 and 2.53 or ask a colleague to do so.

          Carles Capdevila added a comment - - edited

          EDIT: Earlier in this comment I was saying that the problem could be reproduced in 2.52. I was wrong. I accidentally shuffled the war's name and didn't notice the version. I apologize to anyone who took the time to verify this.

           

          I tested this on 2.52 and 2.53 as saretter asked me:

          In 2.52 could not reproduce the problem.

          In 2.53 could reproduce it.

          As danielbeck suggested, https://issues.jenkins-ci.org/browse/JENKINS-42934 might be related to this.

          Carles Capdevila added a comment - - edited EDIT: Earlier in this comment I was saying that the problem could be reproduced in 2.52. I was wrong. I accidentally shuffled the war's name and didn't notice the version. I apologize to anyone who took the time to verify this.   I tested this on 2.52 and 2.53 as saretter asked me: In 2.52 could not reproduce the problem. In 2.53 could reproduce it. As danielbeck suggested, https://issues.jenkins-ci.org/browse/JENKINS-42934 might be related to this.

          bbonacci carlescapdevila would be great if you can use git bisect to find out even more precisely which commit introduced this. If you're unclear on how to use it, I can provide/write a documentation for it. Thanks!

          Baptiste Mathus added a comment - bbonacci carlescapdevila would be great if you can use git bisect to find out even more precisely which commit introduced this. If you're unclear on how to use it, I can provide/write a documentation for it. Thanks!

          Daniel Beck added a comment -

          use git bisect to find out even more precisely which commit introduced this

          Or just test https://github.com/jenkinsci/jenkins/commit/bde09f70afaf10d5e1453c257058a56b07556e8e which is assumed to break, and https://github.com/jenkinsci/jenkins/commit/0ddf2d5be77072264845a5f4cf197d91d32e4695 which is assumed to not break, to begin with, and see whether that's the cause.

          Daniel Beck added a comment - use git bisect to find out even more precisely which commit introduced this Or just test https://github.com/jenkinsci/jenkins/commit/bde09f70afaf10d5e1453c257058a56b07556e8e which is assumed to break, and https://github.com/jenkinsci/jenkins/commit/0ddf2d5be77072264845a5f4cf197d91d32e4695 which is assumed to not break, to begin with, and see whether that's the cause.

          Carles Capdevila added a comment - - edited

          Tested https://github.com/jenkinsci/jenkins/commit/bde09f70afaf10d5e1453c257058a56b07556e8e and it did indeed break, this one https://github.com/jenkinsci/jenkins/commit/0ddf2d5be77072264845a5f4cf197d91d32e4695 was OK.

           

          By the way, I'm using Windows' "handle -s -p <jenkinsPID>" command to detect the file handles. The https://wiki.jenkins.io/display/JENKINS/File+Leak+Detector+Plugin does not show anything when the builds are over, but handle does show an increment of the files long after the builds are over.

           

          UPDATE: Tested https://github.com/jenkinsci/jenkins/commit/a3ef5b6048d66e59e48455b48623e30c14be8df4  - OK

          and then the next https://github.com/jenkinsci/jenkins/commit/f0cd7ae8ff269dd738e3377a62f3fbebebf9aef6 - has the issue, so this commit introduces the leak

          Carles Capdevila added a comment - - edited Tested https://github.com/jenkinsci/jenkins/commit/bde09f70afaf10d5e1453c257058a56b07556e8e and it did indeed break, this one https://github.com/jenkinsci/jenkins/commit/0ddf2d5be77072264845a5f4cf197d91d32e4695 was OK.   By the way, I'm using Windows' "handle -s -p <jenkinsPID>" command to detect the file handles. The https://wiki.jenkins.io/display/JENKINS/File+Leak+Detector+Plugin does not show anything when the builds are over, but handle does show an increment of the files long after the builds are over.   UPDATE: Tested https://github.com/jenkinsci/jenkins/commit/a3ef5b6048d66e59e48455b48623e30c14be8df4  - OK and then the next https://github.com/jenkinsci/jenkins/commit/f0cd7ae8ff269dd738e3377a62f3fbebebf9aef6 - has the issue, so this commit introduces the leak

          Daniel Beck added a comment -

          Daniel Beck added a comment - stephenconnolly PTAL

          carlescapdevila any chance you could try the attached patch against HEAD (probably easy to apply to most versions) and see if that resolves the issue. Seems like there may be some paths where the run's log stream does not get closed correctly

          jenkins-45057.patch

          Stephen Connolly added a comment - carlescapdevila any chance you could try the attached patch against HEAD (probably easy to apply to most versions) and see if that resolves the issue. Seems like there may be some paths where the run's log stream does not get closed correctly jenkins-45057.patch

          Carles Capdevila added a comment - - edited

          Tested against HEAD with the patch applied and no luck. Tested too against 2.53 and 2.60.1 (both of them with the patch) and same, the leak doesn't go away. Thank you very much for the effort nevertheless.

          EDIT: I'm reproducing the issue according to jonasatwork's comment: https://issues.jenkins-ci.org/browse/JENKINS-45057?focusedCommentId=304877&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-304877 (but in my case I'm using windows so I check the file usage with the "handle" command). So could it be due to some kind of interaction with the Groovy Plugin?

          Carles Capdevila added a comment - - edited Tested against HEAD with the patch applied and no luck. Tested too against 2.53 and 2.60.1 (both of them with the patch) and same, the leak doesn't go away. Thank you very much for the effort nevertheless. EDIT: I'm reproducing the issue according to jonasatwork 's comment: https://issues.jenkins-ci.org/browse/JENKINS-45057?focusedCommentId=304877&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-304877 (but in my case I'm using windows so I check the file usage with the "handle" command). So could it be due to some kind of interaction with the Groovy Plugin?

          carlescapdevila so one interesting thing is that most of the file handles look ok except for the emr-termination-policy files.

          There are 406 file handles open of the type:

          java    8870 jenkins  991w   REG              252,0        2194 1332198 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50086/log (deleted)

          So these are file handles open on a file that appears to be deleted!

          406 of them to be precise:

          $ grep "(deleted)" filesopen.txt | wc -l
               406

          And all but two of them are emr-termination-policy 

          $ grep "(deleted)" filesopen.txt | grep emr-termination-policy | wc -l
               404
          $ grep "(deleted)" filesopen.txt | grep optimus | wc -l
                 2

          When we look at the file handles, these are WRITE file handles, so the file handle has to be opened inside Run.execute()

          Just to confirm, these two jobs are Freestyle jobs and not Pipeline jobs?

           

          Stephen Connolly added a comment - carlescapdevila so one interesting thing is that most of the file handles look ok except for the emr-termination-policy files. There are 406 file handles open of the type: java 8870 jenkins 991w REG 252,0 2194 1332198 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50086/log (deleted) So these are file handles open on a file that appears to be deleted! 406 of them to be precise: $ grep "(deleted)" filesopen.txt | wc -l 406 And all but two of them are emr-termination-policy   $ grep "(deleted)" filesopen.txt | grep emr-termination-policy | wc -l 404 $ grep "(deleted)" filesopen.txt | grep optimus | wc -l 2 When we look at the file handles, these are WRITE file handles, so the file handle has to be opened inside Run.execute() Just to confirm, these two jobs are Freestyle jobs and not Pipeline jobs?  

          Hmmm... digging some more, the Run.delete() method will do a rename of the build directory from /XXX/ to /.XXX so this looks very much like delete() is not waiting for the running job to complete

          Stephen Connolly added a comment - Hmmm... digging some more, the Run.delete() method will do a rename of the build directory from /XXX/ to /.XXX so this looks very much like delete() is not waiting for the running job to complete

          Hmmm I wonder if emr-termination-policy has:

          • a short log rotation period,
          • runs almost continually
          • uses a Notifier that perhaps should be a Publisher?

          What Post-build actions do you have configured carlescapdevila?

          Stephen Connolly added a comment - Hmmm I wonder if emr-termination-policy has: a short log rotation period, runs almost continually uses a Notifier that perhaps should be a Publisher? What Post-build actions do you have configured carlescapdevila ?

          bbonacci sorry just realized that you are the one with the emr-termination-policy job

          Stephen Connolly added a comment - bbonacci sorry just realized that you are the one with the emr-termination-policy job

          So https://github.com/jenkinsci/jenkins/pull/2953 should fix the credentials binding plugin issue in any case... though I think that it should be part of the contract of plugins annotating the console that they pass the close through, so in a sense it is a defensive core change and jglick should just merge https://github.com/jenkinsci/credentials-binding-plugin/pull/37

          Stephen Connolly added a comment - So  https://github.com/jenkinsci/jenkins/pull/2953  should fix the credentials binding plugin issue in any case... though I think that it should be part of the contract of plugins annotating the console that they pass the close through, so in a sense it is a defensive core change and jglick should just merge  https://github.com/jenkinsci/credentials-binding-plugin/pull/37

          Jonas Jonsson added a comment -

          Hi, finally back after some holiday

          Yes, the job is a freestyle job, just as I wrote.  I took a quick look at the changes that was done in JENKINS-42934 and noticed that on a few places, the "close()" calls on created files has been removed completely, earlier on they lived in some "finally{}" blocks in the code.  The reason for the change is ([https://bugs.openjdk.java.net/browse/JDK-8080225|https://bugs.openjdk.java.net/browse/JDK-8080225)]. I don't interpret that the calls to close() should be removed just because of this change.

          Jonas Jonsson added a comment - Hi, finally back after some holiday Yes, the job is a freestyle job, just as I wrote.  I took a quick look at the changes that was done in JENKINS-42934 and noticed that on a few places, the "close()" calls on created files has been removed completely, earlier on they lived in some "finally{}" blocks in the code.  The reason for the change is ([https://bugs.openjdk.java.net/browse/JDK-8080225| https://bugs.openjdk.java.net/browse/JDK-8080225 )]. I don't interpret that the calls to close() should be removed just because of this change.

          jonasatwork pointers to the cases you believe a handle is escaping?

          Stephen Connolly added a comment - jonasatwork pointers to the cases you believe a handle is escaping?

          jonasatwork keep in mind that we moved from 

          InputStream is = ...;
          try {
            ...
          } finally {
            is.close();
          }

          to try-with-resources:

          try (InputStream is = ...) {
            ...
          }

          So expect those close calls to be handled by try-with-resources

           

          Stephen Connolly added a comment - jonasatwork keep in mind that we moved from  InputStream is = ...; try { ... } finally { is.close(); } to try-with-resources: try (InputStream is = ...) { ... } So expect those close calls to be handled by try-with-resources  

          Jonas Jonsson added a comment -

          Sorry, I'm not really up to date with java ...

          Jonas Jonsson added a comment - Sorry, I'm not really up to date with java ...

          Ok, so the Groovy leak appears to be an issue with Stapler!!!

          Groovy is querying the properties and discovers the `getLogText()` method which results in stapler opening the read handle using a FileInputStream... which is then pending finalization at which point the file handle will be released...

          IOW this is a replica of JENKINS-42934 only against Stapler... it being https://github.com/stapler/stapler/blob/3ac71dce264da052186956ef06b772a91ca74d5e/core/src/main/java/org/kohsuke/stapler/framework/io/LargeText.java#L457-L467 that is responsible for the leak!!!

          Stephen Connolly added a comment - Ok, so the Groovy leak appears to be an issue with Stapler!!! Groovy is querying the properties and discovers the `getLogText()` method which results in stapler opening the read handle using a FileInputStream... which is then pending finalization at which point the file handle will be released... IOW this is a replica of JENKINS-42934 only against Stapler... it being https://github.com/stapler/stapler/blob/3ac71dce264da052186956ef06b772a91ca74d5e/core/src/main/java/org/kohsuke/stapler/framework/io/LargeText.java#L457-L467  that is responsible for the leak!!!

          Jesse Glick added a comment -

          Well I suppose the workaround is to use the less obtuse

          def jobName = build.parent.builds[0].envVars.JOB_NAME

          If you are using a sandboxed script, well DefaultGroovyMethods.getProperties(Object) is already blacklisted so you could not make this mistake to begin with.

          Jesse Glick added a comment - Well I suppose the workaround is to use the less obtuse def jobName = build.parent.builds[0].envVars.JOB_NAME If you are using a sandboxed script, well DefaultGroovyMethods.getProperties(Object) is already blacklisted so you could not make this mistake to begin with.

          Loos like we had been hit by a different side-effect of the identified change. All "OutputStream" returned from instance of hudson.console.ConsoleLogFilter must close the wrapped OutputStream, when close is called. Could be it was expected before - now with core 2.53ff it leads to leak of file handles if you miss this.

          This change fixed the issue for our plugin: https://github.com/SoftwareBuildService/log-file-filter-plugin/commit/c1148435a454aa5a3a72bab05c3a6996ea5f42f5

          Andreas Mandel added a comment - Loos like we had been hit by a different side-effect of the identified change. All "OutputStream" returned from instance of hudson.console.ConsoleLogFilter must close the wrapped OutputStream, when close is called. Could be it was expected before - now with core 2.53ff it leads to leak of file handles if you miss this. This change fixed the issue for our plugin: https://github.com/SoftwareBuildService/log-file-filter-plugin/commit/c1148435a454aa5a3a72bab05c3a6996ea5f42f5

          Oleg Nenashev added a comment - - edited

          It should be solved by https://github.com/jenkinsci/jenkins/pull/2954 in the 2.73

          Oleg Nenashev added a comment - - edited It should be solved by https://github.com/jenkinsci/jenkins/pull/2954 in the 2.73

          Jesse Glick added a comment -

          Should be fixed, but needs verification if there is a reproducible test case.

          Jesse Glick added a comment - Should be fixed, but needs verification if there is a reproducible test case.

          Oleg Nenashev added a comment -

          It didn't get into 2.60.3 since it was fixed/integrated too late. It will be a candidate for the next baseline

          Oleg Nenashev added a comment - It didn't get into 2.60.3 since it was fixed/integrated too late. It will be a candidate for the next baseline

          Oleg Nenashev added a comment -

          As jglick says, the patch in credentials binding 1.13 has been released, so the partial fix can be applied via the plugin update.

           

          Oleg Nenashev added a comment - As jglick says, the patch in credentials binding 1.13 has been released, so the partial fix can be applied via the plugin update.  

          Above is the change in file handle usage after upgrading to CloudBees Jenkins Enterprise 2.60.2.2-rolling. Our workaround until the core version is released is to set ulimit -n very large and reboot at least weekly.

          If a better interim solution is known, we'd love to hear it.

          Steven Christenson added a comment - Above is the change in file handle usage after upgrading to CloudBees Jenkins Enterprise 2.60.2.2-rolling. Our workaround until the core version is released is to set ulimit -n very large and reboot at least weekly. If a better interim solution is known, we'd love to hear it.

          Oleg Nenashev added a comment - - edited

          stevenatcisco what does cause it? If it is a Credentials Binding plugin, you can just update it (see the linked issues). Jenkins core just provides a generic fix for all cases, but plugins can be patched on their own without a need to bump the core. You can use http://file-leak-detector.kohsuke.org/ to triage the root cause

          In Jenkins the patch will be available in 2.73.1 LTS. Regarding CloudBees Jenkins Enterprise, please contact the vendor's support

          Oleg Nenashev added a comment - - edited stevenatcisco what does cause it? If it is a Credentials Binding plugin, you can just update it (see the linked issues). Jenkins core just provides a generic fix for all cases, but plugins can be patched on their own without a need to bump the core. You can use http://file-leak-detector.kohsuke.org/ to triage the root cause In Jenkins the patch will be available in 2.73.1 LTS. Regarding CloudBees Jenkins Enterprise, please contact the vendor's support

          Jesse Glick added a comment -

          I suppose lts-candidate can be removed given that this is already in 2.73.

          oleg_nenashev the File Leak Detector plugin (better than the linked standalone tool) would not be helpful here since we already know well where the file handle is opened, when the build starts. The issue is why it is not closed, which will depend on which console-affecting plugins are activated during the build.

          Jesse Glick added a comment - I suppose lts-candidate can be removed given that this is already in 2.73. oleg_nenashev the File Leak Detector plugin (better than the linked standalone tool) would not be helpful here since we already know well where the file handle is opened, when the build starts. The issue is why it is not closed, which will depend on which console-affecting plugins are activated during the build.

          Daniel Beck added a comment -

          Right, the Stapler one is tracked in JENKINS-45903.

          Daniel Beck added a comment - Right, the Stapler one is tracked in JENKINS-45903 .

          oleg_nenashev: We tried using the File Leak Detector Plugin... it would not run, apparently it requires Oracle Java - we are using OpenJDK. The kohsuke leak detector when run crashed our Jenkins instance. It too seems to require Oracle Java.

          Here is the job we are running hourly, and the results

          {{ /* JOB TO PERIODICALLY CHECK FILE HANDLES */}}node('master') {
          {{ sh '''rm -f lsof.txt }}
          {{ lsof -u jenkins > lsof.txt}}
          {{ cut -f 1 /proc/sys/fs/file-nr > filehandles.txt}}
          {{ echo "$(cat filehandles.txt)=handles |" > numfiles.txt}}
          {{ echo "$(wc -l < lsof.txt)=JenkLSOF |" >> numfiles.txt}}
          {{ echo "$(grep -Fc \'(deleted)\' lsof.txt)=deleted " >> numfiles.txt}}
          {{ cat numfiles.txt}}
          {{ '''}}
          {{ archiveArtifacts allowEmptyArchive: true, artifacts: '*.txt', caseSensitive: false}}
          {{ result = readFile 'numfiles.txt'}}
          {{ currentBuild.description = result}}
          {{ fileHandlesInUse = readFile 'filehandles.txt'}}
          {{ deleteDir()}}
          {{ } // node}}

          {{/******* RESULTS *******/ }}
          {{ Aug 30, 2017 6:56 AM 9472=handles | 10554=JenkLSOF | 3621=deleted}}
          {{ Aug 30, 2017 5:56 AM 9568=handles | 10654=JenkLSOF | 3557=deleted}}
          {{ Aug 30, 2017 4:56 AM 9376=handles | 10521=JenkLSOF | 3524=deleted}}
          {{ Aug 30, 2017 3:56 AM 9312=handles | 10417=JenkLSOF | 3462=deleted}}
          {{ Aug 30, 2017 2:56 AM 9216=handles | 10358=JenkLSOF | 3401=deleted}}
          {{ Aug 30, 2017 1:56 AM 9184=handles | 10276=JenkLSOF | 3338=deleted}}
          {{ Aug 30, 2017 12:56 AM 9312=handles | 10406=JenkLSOF | 3303=deleted}}
          {{ Aug 29, 2017 11:56 PM 9216=handles | 10338=JenkLSOF | 3236=deleted}}
          {{ Aug 29, 2017 10:56 PM 9408=handles | 10423=JenkLSOF | 3198=deleted}}
          {{ Aug 29, 2017 9:56 PM 8896=handles | 10042=JenkLSOF | 3137=deleted}}
          {{ Aug 29, 2017 8:56 PM 9024=handles | 10138=JenkLSOF | 3098=deleted}}
          {{ Aug 29, 2017 7:56 PM 9024=handles | 10243=JenkLSOF | 3028=deleted}}
          {{ Aug 29, 2017 6:56 PM 8896=handles | 9948=JenkLSOF | 2981=deleted}}
          {{ Aug 29, 2017 5:56 PM 8768=handles | 9879=JenkLSOF | 2913=deleted}}
          {{ Aug 29, 2017 4:56 PM 8832=handles | 9879=JenkLSOF | 2844=deleted}}
          {{ Aug 29, 2017 3:56 PM 8608=handles | 9731=JenkLSOF | 2773=deleted}}
          {{ Aug 29, 2017 2:56 PM 8448=handles | 9587=JenkLSOF | 2741=deleted}}
          {{ Aug 29, 2017 1:56 PM 8384=handles | 9556=JenkLSOF | 2681=deleted}}
          {{ Aug 29, 2017 12:56 PM 8192=handles | 9452=JenkLSOF | 2650=deleted}}
          {{ Aug 29, 2017 11:56 AM 8096=handles | 9306=JenkLSOF | 2590=deleted}}
          {{ Aug 29, 2017 1:56 AM 8064=handles | 8921=JenkLSOF | 2081=deleted}}

          The "deleted" items are all log entries like those described in the original incident. 

          NOTE: I have opened an incident under our support contract, but have posted details here in case they may help to diagnose the root cause.  Is there another tool we can use?  Or would the LSOF output over many hours be sufficient?

          Steven Christenson added a comment - oleg_nenashev : We tried using the File Leak Detector Plugin... it would not run, apparently it requires Oracle Java - we are using OpenJDK. The kohsuke leak detector when run crashed our Jenkins instance. It too seems to require Oracle Java. Here is the job we are running hourly, and the results {{ /* JOB TO PERIODICALLY CHECK FILE HANDLES */}} node('master') { {{ sh '''rm -f lsof.txt }} {{ lsof -u jenkins > lsof.txt}} {{ cut -f 1 /proc/sys/fs/file-nr > filehandles.txt}} {{ echo "$(cat filehandles.txt)=handles |" > numfiles.txt}} {{ echo "$(wc -l < lsof.txt)=JenkLSOF |" >> numfiles.txt}} {{ echo "$(grep -Fc \'(deleted)\' lsof.txt)=deleted " >> numfiles.txt}} {{ cat numfiles.txt}} {{ '''}} {{ archiveArtifacts allowEmptyArchive: true, artifacts: '*.txt', caseSensitive: false}} {{ result = readFile 'numfiles.txt'}} {{ currentBuild.description = result}} {{ fileHandlesInUse = readFile 'filehandles.txt'}} {{ deleteDir()}} {{ } // node}} {{/******* RESULTS *******/ }} {{ Aug 30, 2017 6:56 AM 9472=handles | 10554=JenkLSOF | 3621=deleted}} {{ Aug 30, 2017 5:56 AM 9568=handles | 10654=JenkLSOF | 3557=deleted}} {{ Aug 30, 2017 4:56 AM 9376=handles | 10521=JenkLSOF | 3524=deleted}} {{ Aug 30, 2017 3:56 AM 9312=handles | 10417=JenkLSOF | 3462=deleted}} {{ Aug 30, 2017 2:56 AM 9216=handles | 10358=JenkLSOF | 3401=deleted}} {{ Aug 30, 2017 1:56 AM 9184=handles | 10276=JenkLSOF | 3338=deleted}} {{ Aug 30, 2017 12:56 AM 9312=handles | 10406=JenkLSOF | 3303=deleted}} {{ Aug 29, 2017 11:56 PM 9216=handles | 10338=JenkLSOF | 3236=deleted}} {{ Aug 29, 2017 10:56 PM 9408=handles | 10423=JenkLSOF | 3198=deleted}} {{ Aug 29, 2017 9:56 PM 8896=handles | 10042=JenkLSOF | 3137=deleted}} {{ Aug 29, 2017 8:56 PM 9024=handles | 10138=JenkLSOF | 3098=deleted}} {{ Aug 29, 2017 7:56 PM 9024=handles | 10243=JenkLSOF | 3028=deleted}} {{ Aug 29, 2017 6:56 PM 8896=handles | 9948=JenkLSOF | 2981=deleted}} {{ Aug 29, 2017 5:56 PM 8768=handles | 9879=JenkLSOF | 2913=deleted}} {{ Aug 29, 2017 4:56 PM 8832=handles | 9879=JenkLSOF | 2844=deleted}} {{ Aug 29, 2017 3:56 PM 8608=handles | 9731=JenkLSOF | 2773=deleted}} {{ Aug 29, 2017 2:56 PM 8448=handles | 9587=JenkLSOF | 2741=deleted}} {{ Aug 29, 2017 1:56 PM 8384=handles | 9556=JenkLSOF | 2681=deleted}} {{ Aug 29, 2017 12:56 PM 8192=handles | 9452=JenkLSOF | 2650=deleted}} {{ Aug 29, 2017 11:56 AM 8096=handles | 9306=JenkLSOF | 2590=deleted}} {{ Aug 29, 2017 1:56 AM 8064=handles | 8921=JenkLSOF | 2081=deleted}} The "deleted" items are all log entries like those described in the original incident.  NOTE: I have opened an incident under our support contract, but have posted details here in case they may help to diagnose the root cause.  Is there another tool we can use?  Or would the LSOF output over many hours be sufficient?

          Here is confirmation that the upgrade resolved the leak... mostly.

          We notice in the last 48 hours, there have been 6 file handle leaks. That would have been 100s previously.

          Steven Christenson added a comment - Here is confirmation that the upgrade resolved the leak... mostly. We notice in the last 48 hours, there have been 6 file handle leaks. That would have been 100s previously.

          Oleg Nenashev added a comment -

          Even 6 leaks is quite suspicious, but I'd guess we cannot do anything with it without File Leak Detector

          Oleg Nenashev added a comment - Even 6 leaks is quite suspicious, but I'd guess we cannot do anything with it without File Leak Detector

          oleg_nenashev After upgrade to Jenkins 2.73.3 the issue became less severe but still we have to restart our Jenkins instance once a week (for 2.60 it was once a day).

          Here's the summary of 2 lsof runs with 1 day between them. The list of top files:

          Nov-17:

          100632 slave.log
          32294 log
          7685 timestamps
          4193 random
          3635 urandom

          Nov-18:

          708532 log
          297707 timestamps
          98280 slave.log
          90675 Common.groovy
          85995 BobHelper.groovy
          

          Does it give you more information to find the cause? Unfortunately it's a bit hard for me to provide the file leak detector plugin output because we use openjdk

          Volodymyr Sobotovych added a comment - oleg_nenashev After upgrade to Jenkins 2.73.3 the issue became less severe but still we have to restart our Jenkins instance once a week (for 2.60 it was once a day). Here's the summary of 2 lsof runs with 1 day between them. The list of top files: Nov-17: 100632 slave.log 32294 log 7685 timestamps 4193 random 3635 urandom Nov-18: 708532 log 297707 timestamps 98280 slave.log 90675 Common.groovy 85995 BobHelper.groovy Does it give you more information to find the cause? Unfortunately it's a bit hard for me to provide the file leak detector plugin output because we use openjdk

            jglick Jesse Glick
            bbonacci Bruno Bonacci
            Votes:
            13 Vote for this issue
            Watchers:
            28 Start watching this issue

              Created:
              Updated:
              Resolved: