Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-19022

GIT Plugin (any version) heavily bloats memory use and size of build.xml with "BuildData" fields

    XMLWordPrintable

Details

    Description

      Hello everyone.

      Months ago, we've noticed a bug/issue with the GIT plug-in. Previously, it was only a minor nuisance but now, it causes each build that we start to use up ~3MB of main memory and ~5MB of disk space in the build.xml.

      The issue is due to the following behaviour of the GIT plug-in:
      For every build that has the GIT SCM defined, it retrieves the list of branches in the remote repository. For each branch, it retrieves the last build in Jenkins that was run against this branch.

      This information is then stored in the Build object in form of the "BuildData" field. This means, that the full list of all branches, plus their last builds is stored in each and every build – thus using up main memory and using up disk space in the "build.xml" file allocated for the build.

      It uses this information to populate a page for the build with the association of branches to builds:
      http://<SERVER>/job/<JOBNAME>/<BUILD-ID>/git/?

      For normal repositories, this data is relatively small, as only a limited number of unmerged branches exist. Unfortunately, we use GIT in an automated manner, where thousands of tags and branches are spawned without merging back into the mainline.

      This means that each build saves several hundred to thousand pointless key-value pairs for GIT branches and Jenkins builds that serve no purpose whatsoever.

      In our case, this means – as outlined above – we waste 3MB of RAM per build and 5 MB of disk space. With 10k builds per day, you can imagine that this is quite a predicament.

      As a workaround, we've written a Jenkins job that removes the tags contained in "<hudson.plugins.git.util.BuildData>" in the "build.xml". This cuts down its size from 5MB down to 16kB (~0.156MB). This of course also greatly boosts the speed of deserealizing the builds from disk.

      Our request would be: Either remove the collections/deserialization of this (from our POV) pointless data, or make its generation optional via a configuration option.

      Best regards,
      Martin Schröder
      Intel Mobile Communications GmbH

      Attachments

        Issue Links

          Activity

            markewaite I think the "BuildData" structure has been heavily refactored isn't it? Should this be closed maybe? Thanks

            batmat Baptiste Mathus added a comment - markewaite I think the "BuildData" structure has been heavily refactored isn't it? Should this be closed maybe? Thanks
            markewaite Mark Waite added a comment - - edited

            Unfortunately batmat, the three attempts (two by ndeloof and one by jekeller ) were unable to significantly refactor BuildData in a compatible fashion. The most recent attempt by jekeller passed multiple months of my testing but showed compatibility issues in the accidental release of git plugin 4.0.0-rc.

            The changes were reverted before the release of git plugin 4.0.0.

            The git plugin documentation now includes instructions as a system groovy script that removes BuildData. See https://plugins.jenkins.io/git/#remove-git-plugin-buildsbybranch-builddata-script

            markewaite Mark Waite added a comment - - edited Unfortunately batmat , the three attempts (two by ndeloof and one by jekeller ) were unable to significantly refactor BuildData in a compatible fashion. The most recent attempt by jekeller passed multiple months of my testing but showed compatibility issues in the accidental release of git plugin 4.0.0-rc. The changes were reverted before the release of git plugin 4.0.0. The git plugin documentation now includes instructions as a system groovy script that removes BuildData. See https://plugins.jenkins.io/git/#remove-git-plugin-buildsbybranch-builddata-script
            jekeller Jacob Keller added a comment -

            batmat the refactor was reverted because it had unexpected side effects.

            My solution involved doing a search/lookup mechanism against all old builds and "rebuilding" the build data every job. This works but slows down significantly once you have a lot of jobs.

            I believe a better solution exists using a plugin-specific XML file, so we basically just stop storing the build data per-build and start storing it per-job as a separate file. I've thought about it on-and-off for a while but never got around to trying to implement it.

            jekeller Jacob Keller added a comment - batmat the refactor was reverted because it had unexpected side effects. My solution involved doing a search/lookup mechanism against all old builds and "rebuilding" the build data every job. This works but slows down significantly once you have a lot of jobs. I believe a better solution exists using a plugin-specific XML file, so we basically just stop storing the build data per-build and start storing it per-job as a separate file. I've thought about it on-and-off for a while but never got around to trying to implement it.
            jjardina Jason Jardina added a comment - - edited

            markewaite I ran that script you listed and it kicked several, meaning over 50, old builds that had been built previously.  I use regex to scan my repositories by naming convention using git polling.  A build is kicked when commit hash has changed on a regex named branch.  I am glad I ran that on my older code server and not my currently shipping code.  That script is dangerous.  It may solve your problems, but it definitely does not solve mine.  I have to have the build history in order for Jenkins to know what it has built previously, so it doesn't get stuck in a build loop.  That script is like sticking a loaded gun to Jenkins head and pulling the trigger.  Before you tell everyone to run that script and delete their build data, you should warn them they may see unexpected results, exactly like I saw when we updated to git plugin 4.0.0-rc that was accidentally released in the wild last year.

            The best solution I found is to only keep 10-20 build history on Jenkins by using Discard Old Builds, log rotation settings.  That lets me keep my current git history, without the history file size getting out of hand and slowing builds/reboots. 

            jjardina Jason Jardina added a comment - - edited markewaite I ran that script you listed and it kicked several, meaning over 50, old builds that had been built previously.  I use regex to scan my repositories by naming convention using git polling.  A build is kicked when commit hash has changed on a regex named branch.  I am glad I ran that on my older code server and not my currently shipping code.  That script is dangerous.  It may solve your problems, but it definitely does not solve mine.  I have to have the build history in order for Jenkins to know what it has built previously, so it doesn't get stuck in a build loop.  That script is like sticking a loaded gun to Jenkins head and pulling the trigger.  Before you tell everyone to run that script and delete their build data, you should warn them they may see unexpected results, exactly like I saw when we updated to git plugin 4.0.0-rc that was accidentally released in the wild last year. The best solution I found is to only keep 10-20 build history on Jenkins by using Discard Old Builds, log rotation settings.  That lets me keep my current git history, without the history file size getting out of hand and slowing builds/reboots. 
            jekeller Jacob Keller added a comment -

            jjardina, yes that's part of the problem. The current build data solution is stored as a map once per build. The script there will delete all build data to conserve on memory and reduce the bloat. The ultimate issue is not that the single map is that much space but that every build keeps a map of history up to that point. Ultimately the issue is that this scales by N^2. If we have 10k builds, we have roughly N^2 (i know it's slightly less since it's more like n*(n-1)/2 ) number of things being stored in a map.

            I firmly believe that the git plugin should be modified to store this data per job in an XmlFile in the job root. This way, we can maintain this history (as you and many others obviously require), while avoiding both the cost-complexity of storing the build data repeatably and of trying to rebuild the data from previous jobs.

            This task shouldn't be too difficult, but it does require someone investing time, and unfortunately I don't have time to work on this at $DAYJOB right now, so it's not something I can commit to doing in a timely manner.

            Now, one could argue that the git plugin shouldn't be saving data about builds which have been deleted, but that's neither here nor their as clearly people desire this behavior and it's how the plugin has behaved for many years now.

            jekeller Jacob Keller added a comment - jjardina , yes that's part of the problem. The current build data solution is stored as a map once per build. The script there will delete all build data to conserve on memory and reduce the bloat. The ultimate issue is not that the single map is that much space but that every build keeps a map of history up to that point. Ultimately the issue is that this scales by N^2. If we have 10k builds, we have roughly N^2 (i know it's slightly less since it's more like n*(n-1)/2 ) number of things being stored in a map. I firmly believe that the git plugin should be modified to store this data per job in an XmlFile in the job root. This way, we can maintain this history (as you and many others obviously require), while avoiding both the cost-complexity of storing the build data repeatably and of trying to rebuild the data from previous jobs. This task shouldn't be too difficult, but it does require someone investing time, and unfortunately I don't have time to work on this at $DAYJOB right now, so it's not something I can commit to doing in a timely manner. Now, one could argue that the git plugin shouldn't be saving data about builds which have been deleted, but that's neither here nor their as clearly people desire this behavior and it's how the plugin has behaved for many years now.

            People

              Unassigned Unassigned
              mhschroe Martin Schröder
              Votes:
              39 Vote for this issue
              Watchers:
              89 Start watching this issue

              Dates

                Created:
                Updated: