Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45057

"too many files open": file handles leak, job output file not closed

      Jenkins seems to keep a open file handle to the log file (job output) for every single build, even those who have been discarded by the "Discard old build policy".

       

      This is a sample of the lsof output (whole file attached)

      java 8870 jenkins 941w REG 252,0 1840 1332171 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50063/log (deleted)
      java 8870 jenkins 942w REG 252,0 2023 402006 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50044/log (deleted)
      java 8870 jenkins 943w REG 252,0 2193 1332217 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50101/log
      java 8870 jenkins 944w REG 252,0 2512 1332247 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50106/log
      java 8870 jenkins 945w REG 252,0 1840 1703994 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50067/log (deleted)
      java 8870 jenkins 946w REG 252,0 2350 1332230 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50092/log (deleted)
      java 8870 jenkins 947w REG 252,0 1840 402034 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50049/log (deleted)
      java 8870 jenkins 948w REG 252,0 1840 927855 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50080/log (deleted)
      java 8870 jenkins 949w REG 252,0 2195 1332245 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/.50095/log (deleted)
      java 8870 jenkins 950w REG 252,0 2326 1332249 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50107/log
      java 8870 jenkins 952w REG 252,0 2195 1332227 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50102/log
      java 8870 jenkins 953w REG 252,0 2154 1332254 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50109/log
      java 8870 jenkins 954w REG 252,0 2356 1332282 /data/jenkins/jobs/automation/jobs/emr-termination-policy/builds/50105/log
      

       

          [JENKINS-45057] "too many files open": file handles leak, job output file not closed

          Oleg Nenashev added a comment -

          It didn't get into 2.60.3 since it was fixed/integrated too late. It will be a candidate for the next baseline

          Oleg Nenashev added a comment - It didn't get into 2.60.3 since it was fixed/integrated too late. It will be a candidate for the next baseline

          Oleg Nenashev added a comment -

          As jglick says, the patch in credentials binding 1.13 has been released, so the partial fix can be applied via the plugin update.

           

          Oleg Nenashev added a comment - As jglick says, the patch in credentials binding 1.13 has been released, so the partial fix can be applied via the plugin update.  

          Above is the change in file handle usage after upgrading to CloudBees Jenkins Enterprise 2.60.2.2-rolling. Our workaround until the core version is released is to set ulimit -n very large and reboot at least weekly.

          If a better interim solution is known, we'd love to hear it.

          Steven Christenson added a comment - Above is the change in file handle usage after upgrading to CloudBees Jenkins Enterprise 2.60.2.2-rolling. Our workaround until the core version is released is to set ulimit -n very large and reboot at least weekly. If a better interim solution is known, we'd love to hear it.

          Oleg Nenashev added a comment - - edited

          stevenatcisco what does cause it? If it is a Credentials Binding plugin, you can just update it (see the linked issues). Jenkins core just provides a generic fix for all cases, but plugins can be patched on their own without a need to bump the core. You can use http://file-leak-detector.kohsuke.org/ to triage the root cause

          In Jenkins the patch will be available in 2.73.1 LTS. Regarding CloudBees Jenkins Enterprise, please contact the vendor's support

          Oleg Nenashev added a comment - - edited stevenatcisco what does cause it? If it is a Credentials Binding plugin, you can just update it (see the linked issues). Jenkins core just provides a generic fix for all cases, but plugins can be patched on their own without a need to bump the core. You can use http://file-leak-detector.kohsuke.org/ to triage the root cause In Jenkins the patch will be available in 2.73.1 LTS. Regarding CloudBees Jenkins Enterprise, please contact the vendor's support

          Jesse Glick added a comment -

          I suppose lts-candidate can be removed given that this is already in 2.73.

          oleg_nenashev the File Leak Detector plugin (better than the linked standalone tool) would not be helpful here since we already know well where the file handle is opened, when the build starts. The issue is why it is not closed, which will depend on which console-affecting plugins are activated during the build.

          Jesse Glick added a comment - I suppose lts-candidate can be removed given that this is already in 2.73. oleg_nenashev the File Leak Detector plugin (better than the linked standalone tool) would not be helpful here since we already know well where the file handle is opened, when the build starts. The issue is why it is not closed, which will depend on which console-affecting plugins are activated during the build.

          Daniel Beck added a comment -

          Right, the Stapler one is tracked in JENKINS-45903.

          Daniel Beck added a comment - Right, the Stapler one is tracked in JENKINS-45903 .

          oleg_nenashev: We tried using the File Leak Detector Plugin... it would not run, apparently it requires Oracle Java - we are using OpenJDK. The kohsuke leak detector when run crashed our Jenkins instance. It too seems to require Oracle Java.

          Here is the job we are running hourly, and the results

          {{ /* JOB TO PERIODICALLY CHECK FILE HANDLES */}}node('master') {
          {{ sh '''rm -f lsof.txt }}
          {{ lsof -u jenkins > lsof.txt}}
          {{ cut -f 1 /proc/sys/fs/file-nr > filehandles.txt}}
          {{ echo "$(cat filehandles.txt)=handles |" > numfiles.txt}}
          {{ echo "$(wc -l < lsof.txt)=JenkLSOF |" >> numfiles.txt}}
          {{ echo "$(grep -Fc \'(deleted)\' lsof.txt)=deleted " >> numfiles.txt}}
          {{ cat numfiles.txt}}
          {{ '''}}
          {{ archiveArtifacts allowEmptyArchive: true, artifacts: '*.txt', caseSensitive: false}}
          {{ result = readFile 'numfiles.txt'}}
          {{ currentBuild.description = result}}
          {{ fileHandlesInUse = readFile 'filehandles.txt'}}
          {{ deleteDir()}}
          {{ } // node}}

          {{/******* RESULTS *******/ }}
          {{ Aug 30, 2017 6:56 AM 9472=handles | 10554=JenkLSOF | 3621=deleted}}
          {{ Aug 30, 2017 5:56 AM 9568=handles | 10654=JenkLSOF | 3557=deleted}}
          {{ Aug 30, 2017 4:56 AM 9376=handles | 10521=JenkLSOF | 3524=deleted}}
          {{ Aug 30, 2017 3:56 AM 9312=handles | 10417=JenkLSOF | 3462=deleted}}
          {{ Aug 30, 2017 2:56 AM 9216=handles | 10358=JenkLSOF | 3401=deleted}}
          {{ Aug 30, 2017 1:56 AM 9184=handles | 10276=JenkLSOF | 3338=deleted}}
          {{ Aug 30, 2017 12:56 AM 9312=handles | 10406=JenkLSOF | 3303=deleted}}
          {{ Aug 29, 2017 11:56 PM 9216=handles | 10338=JenkLSOF | 3236=deleted}}
          {{ Aug 29, 2017 10:56 PM 9408=handles | 10423=JenkLSOF | 3198=deleted}}
          {{ Aug 29, 2017 9:56 PM 8896=handles | 10042=JenkLSOF | 3137=deleted}}
          {{ Aug 29, 2017 8:56 PM 9024=handles | 10138=JenkLSOF | 3098=deleted}}
          {{ Aug 29, 2017 7:56 PM 9024=handles | 10243=JenkLSOF | 3028=deleted}}
          {{ Aug 29, 2017 6:56 PM 8896=handles | 9948=JenkLSOF | 2981=deleted}}
          {{ Aug 29, 2017 5:56 PM 8768=handles | 9879=JenkLSOF | 2913=deleted}}
          {{ Aug 29, 2017 4:56 PM 8832=handles | 9879=JenkLSOF | 2844=deleted}}
          {{ Aug 29, 2017 3:56 PM 8608=handles | 9731=JenkLSOF | 2773=deleted}}
          {{ Aug 29, 2017 2:56 PM 8448=handles | 9587=JenkLSOF | 2741=deleted}}
          {{ Aug 29, 2017 1:56 PM 8384=handles | 9556=JenkLSOF | 2681=deleted}}
          {{ Aug 29, 2017 12:56 PM 8192=handles | 9452=JenkLSOF | 2650=deleted}}
          {{ Aug 29, 2017 11:56 AM 8096=handles | 9306=JenkLSOF | 2590=deleted}}
          {{ Aug 29, 2017 1:56 AM 8064=handles | 8921=JenkLSOF | 2081=deleted}}

          The "deleted" items are all log entries like those described in the original incident. 

          NOTE: I have opened an incident under our support contract, but have posted details here in case they may help to diagnose the root cause.  Is there another tool we can use?  Or would the LSOF output over many hours be sufficient?

          Steven Christenson added a comment - oleg_nenashev : We tried using the File Leak Detector Plugin... it would not run, apparently it requires Oracle Java - we are using OpenJDK. The kohsuke leak detector when run crashed our Jenkins instance. It too seems to require Oracle Java. Here is the job we are running hourly, and the results {{ /* JOB TO PERIODICALLY CHECK FILE HANDLES */}} node('master') { {{ sh '''rm -f lsof.txt }} {{ lsof -u jenkins > lsof.txt}} {{ cut -f 1 /proc/sys/fs/file-nr > filehandles.txt}} {{ echo "$(cat filehandles.txt)=handles |" > numfiles.txt}} {{ echo "$(wc -l < lsof.txt)=JenkLSOF |" >> numfiles.txt}} {{ echo "$(grep -Fc \'(deleted)\' lsof.txt)=deleted " >> numfiles.txt}} {{ cat numfiles.txt}} {{ '''}} {{ archiveArtifacts allowEmptyArchive: true, artifacts: '*.txt', caseSensitive: false}} {{ result = readFile 'numfiles.txt'}} {{ currentBuild.description = result}} {{ fileHandlesInUse = readFile 'filehandles.txt'}} {{ deleteDir()}} {{ } // node}} {{/******* RESULTS *******/ }} {{ Aug 30, 2017 6:56 AM 9472=handles | 10554=JenkLSOF | 3621=deleted}} {{ Aug 30, 2017 5:56 AM 9568=handles | 10654=JenkLSOF | 3557=deleted}} {{ Aug 30, 2017 4:56 AM 9376=handles | 10521=JenkLSOF | 3524=deleted}} {{ Aug 30, 2017 3:56 AM 9312=handles | 10417=JenkLSOF | 3462=deleted}} {{ Aug 30, 2017 2:56 AM 9216=handles | 10358=JenkLSOF | 3401=deleted}} {{ Aug 30, 2017 1:56 AM 9184=handles | 10276=JenkLSOF | 3338=deleted}} {{ Aug 30, 2017 12:56 AM 9312=handles | 10406=JenkLSOF | 3303=deleted}} {{ Aug 29, 2017 11:56 PM 9216=handles | 10338=JenkLSOF | 3236=deleted}} {{ Aug 29, 2017 10:56 PM 9408=handles | 10423=JenkLSOF | 3198=deleted}} {{ Aug 29, 2017 9:56 PM 8896=handles | 10042=JenkLSOF | 3137=deleted}} {{ Aug 29, 2017 8:56 PM 9024=handles | 10138=JenkLSOF | 3098=deleted}} {{ Aug 29, 2017 7:56 PM 9024=handles | 10243=JenkLSOF | 3028=deleted}} {{ Aug 29, 2017 6:56 PM 8896=handles | 9948=JenkLSOF | 2981=deleted}} {{ Aug 29, 2017 5:56 PM 8768=handles | 9879=JenkLSOF | 2913=deleted}} {{ Aug 29, 2017 4:56 PM 8832=handles | 9879=JenkLSOF | 2844=deleted}} {{ Aug 29, 2017 3:56 PM 8608=handles | 9731=JenkLSOF | 2773=deleted}} {{ Aug 29, 2017 2:56 PM 8448=handles | 9587=JenkLSOF | 2741=deleted}} {{ Aug 29, 2017 1:56 PM 8384=handles | 9556=JenkLSOF | 2681=deleted}} {{ Aug 29, 2017 12:56 PM 8192=handles | 9452=JenkLSOF | 2650=deleted}} {{ Aug 29, 2017 11:56 AM 8096=handles | 9306=JenkLSOF | 2590=deleted}} {{ Aug 29, 2017 1:56 AM 8064=handles | 8921=JenkLSOF | 2081=deleted}} The "deleted" items are all log entries like those described in the original incident.  NOTE: I have opened an incident under our support contract, but have posted details here in case they may help to diagnose the root cause.  Is there another tool we can use?  Or would the LSOF output over many hours be sufficient?

          Here is confirmation that the upgrade resolved the leak... mostly.

          We notice in the last 48 hours, there have been 6 file handle leaks. That would have been 100s previously.

          Steven Christenson added a comment - Here is confirmation that the upgrade resolved the leak... mostly. We notice in the last 48 hours, there have been 6 file handle leaks. That would have been 100s previously.

          Oleg Nenashev added a comment -

          Even 6 leaks is quite suspicious, but I'd guess we cannot do anything with it without File Leak Detector

          Oleg Nenashev added a comment - Even 6 leaks is quite suspicious, but I'd guess we cannot do anything with it without File Leak Detector

          oleg_nenashev After upgrade to Jenkins 2.73.3 the issue became less severe but still we have to restart our Jenkins instance once a week (for 2.60 it was once a day).

          Here's the summary of 2 lsof runs with 1 day between them. The list of top files:

          Nov-17:

          100632 slave.log
          32294 log
          7685 timestamps
          4193 random
          3635 urandom

          Nov-18:

          708532 log
          297707 timestamps
          98280 slave.log
          90675 Common.groovy
          85995 BobHelper.groovy
          

          Does it give you more information to find the cause? Unfortunately it's a bit hard for me to provide the file leak detector plugin output because we use openjdk

          Volodymyr Sobotovych added a comment - oleg_nenashev After upgrade to Jenkins 2.73.3 the issue became less severe but still we have to restart our Jenkins instance once a week (for 2.60 it was once a day). Here's the summary of 2 lsof runs with 1 day between them. The list of top files: Nov-17: 100632 slave.log 32294 log 7685 timestamps 4193 random 3635 urandom Nov-18: 708532 log 297707 timestamps 98280 slave.log 90675 Common.groovy 85995 BobHelper.groovy Does it give you more information to find the cause? Unfortunately it's a bit hard for me to provide the file leak detector plugin output because we use openjdk

            jglick Jesse Glick
            bbonacci Bruno Bonacci
            Votes:
            13 Vote for this issue
            Watchers:
            28 Start watching this issue

              Created:
              Updated:
              Resolved: