Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49931

Heap Histogram Collection Destabilizes Masters

      When we added Heap Stats collection (https://issues.jenkins-ci.org/browse/JENKINS-22791) in v2.42 it appears that we inadvertently caused a major performance-stability regression if the histogram is collected regularly.

      How? Well, this gathers a live heap histogram. This appears to triggers a Full GC. This is visible in GC logs because they show the following cause:

      > [Full GC (Heap Inspection Initiated GC).

      Now, because this is a FullGC and not a concurrent or young-gen GC, and we're generally using G1 GC, the slow Serial garbage collector is used for FullGC. This is a NON-concurrent GC mode, meaning the application is fully paused until it completes, and it is SINGLE-threaded, meaning rather than 1 GB/s per CPU of GC throughput, we get <1 GB/s total. It also cleans and compacts the entire heap rather than just part of it as with other modes.

      So, with 15 GB of used heap that means a pause of up ~15s. This matches behavior observed in the wild.

      I am rating this as critical because on larger-scale production masters a hang that long can cause job failures, visible UI hangs, HTTP request timeouts, and other issues – it should result in Surable Task failures for Pipelines, for example.

      Proposed solution: only gather the live heap histogram when a user is explicitly requesting a support bundle (disable it by default).

          [JENKINS-49931] Heap Histogram Collection Destabilizes Masters

          Sam Van Oort created issue -
          Emilio Escobar made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]
          Emilio Escobar made changes -
          Status Original: In Progress [ 3 ] New: In Review [ 10005 ]
          Emilio Escobar made changes -
          Remote Link New: This issue links to "PR-134 (Web Link)" [ 20224 ]
          Emilio Escobar made changes -
          Remote Link New: This issue links to "PR-135 (Web Link)" [ 20225 ]
          Jesse Glick made changes -
          Remote Link New: This issue links to "Page (Jenkins Wiki)" [ 20235 ]
          Jesse Glick made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: In Review [ 10005 ] New: Resolved [ 5 ]
          Emilio Escobar made changes -
          Link New: This issue is related to JENKINS-50008 [ JENKINS-50008 ]
          Emilio Escobar made changes -
          Link New: This issue relates to JENKINS-50010 [ JENKINS-50010 ]
          Arnaud Héritier made changes -
          Status Original: Resolved [ 5 ] New: Closed [ 6 ]

            escoem Emilio Escobar
            svanoort Sam Van Oort
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: