Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56595

Regression: higher than usual CPU usage with 2.164.1

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Major Major
    • core
    • None

      Ever since upgrading to 2.164.1 (current LTS) from 2.150.3, we are experiencing higher than usual CPU usage which also caused a crash:

       

       

       

          [JENKINS-56595] Regression: higher than usual CPU usage with 2.164.1

          Daniel Beck added a comment -

          This report does not contain nearly enough information for us to investigate further.

          Daniel Beck added a comment - This report does not contain nearly enough information for us to investigate further.

          Ryan Taylor added a comment -

          I can also confirm this behavior on 2.164.1

          Symptoms are:

          • gerrit queue is blocked
          • CPU utilization is all user and pegged to 100%
          • IO drops to nothing

           

          The UI was completely locked up ~925 and was restarted at 937.  System CPU spike at 933 was inspecting jenkins logs before reboot.

           

          This happens every few days so I should be able to capture more information on the next cycle.  I assume a thread dump is the best resource to ascertain what's happening here??

          Ryan Taylor added a comment - I can also confirm this behavior on 2.164.1 Symptoms are: gerrit queue is blocked CPU utilization is all user and pegged to 100% IO drops to nothing   The UI was completely locked up ~925 and was restarted at 937.  System CPU spike at 933 was inspecting jenkins logs before reboot.   This happens every few days so I should be able to capture more information on the next cycle.  I assume a thread dump is the best resource to ascertain what's happening here??

          Tyler Pickett added a comment -

          I was on vacation last week when rtaylor_instructure commented or he would have included a set of thread traces  I've attached one that I captured from our instance the week before last.

          Tyler Pickett added a comment - I was on vacation last week when rtaylor_instructure commented or he would have included a set of thread traces  I've attached one that I captured from our instance the week before last.

          FWIW:

          • we doubled the RAM from 4GB to 8GB instance (with ~50% JVM allocation) - it lasted longer without a crash but ultimately gave in after a couple of days
          • doubled the RAM again, from 8GB to 16GB instance (with ~50% JVM allocation) - so far its lasting

           

          So for us that was quite a bump on resource requirements. Doubling is something one might be expecting (also because we were growing with more repositories) - but a 4x bump seemed a bit out of the norm.

          As long as JVM has enough RAM, the CPU usage is kept relatively low - so for now we are not seeing any issues.

          Günter Grodotzki added a comment - FWIW: we doubled the RAM from 4GB to 8GB instance (with ~50% JVM allocation) - it lasted longer without a crash but ultimately gave in after a couple of days doubled the RAM again, from 8GB to 16GB instance (with ~50% JVM allocation) - so far its lasting   So for us that was quite a bump on resource requirements. Doubling is something one might be expecting (also because we were growing with more repositories) - but a 4x bump seemed a bit out of the norm. As long as JVM has enough RAM, the CPU usage is kept relatively low - so for now we are not seeing any issues.

          Tyler Pickett added a comment -

          Additional info has been added, and we're willing to collect additional data as necessary.

          Tyler Pickett added a comment - Additional info has been added, and we're willing to collect additional data as necessary.

          Ryan Taylor added a comment -

          Can also confirm this behavior w/ 2.164.2 as well.

          Ryan Taylor added a comment - Can also confirm this behavior w/ 2.164.2 as well.

          Tyler Pickett added a comment -

          After having our Jenkins instance go down 4 times today we rolled back to 2.150.3 and immediately dropped from pegging the 4GB heap limit we have set to comfortably hovering around 2GB. I've collected several thread dumps over the last couple of weeks and will start trying to track down where the issue is.

          Tyler Pickett added a comment - After having our Jenkins instance go down 4 times today we rolled back to 2.150.3 and immediately dropped from pegging the 4GB heap limit we have set to comfortably hovering around 2GB. I've collected several thread dumps over the last couple of weeks and will start trying to track down where the issue is.

          Matthew Hall added a comment - - edited

          This sounds almost exactly like our failed upgrade attempt from 2.121.3 to 2.150.3. We were comfortably running with 4GB heap, and then suddenly the CPU was pegged, UI was slow to unresponsive. Had to revert back to 2.121.3. I did grab 1 threaddump threaddump-1552681298816.tdump before reverting. But if you're looking for root cause changes, may want to include >= 2.121.3.

          Matthew Hall added a comment - - edited This sounds almost exactly like our failed upgrade attempt from 2.121.3 to 2.150.3. We were comfortably running with 4GB heap, and then suddenly the CPU was pegged, UI was slow to unresponsive. Had to revert back to 2.121.3. I did grab 1 threaddump threaddump-1552681298816.tdump before reverting. But if you're looking for root cause changes, may want to include >= 2.121.3.

          Shannon Kerr added a comment - - edited

          Pretty sure we are now seeing this or something very similar after our recent upgrade to 2.150.3. We had nothing like this back when we were on 2.138.1.  I spread out some of our jobs and multibranch scans to try and mitigate the load. (fyi: running in Jenkins docker container) Of course this doesn't match up with the original reporter who said it wasn't seen until an upgrade from 2.150.3. I'm considering rolling back to 2.138.x at this point as it is disrupting my team.

          Shannon Kerr added a comment - - edited Pretty sure we are now seeing this or something very similar after our recent upgrade to 2.150.3. We had nothing like this back when we were on 2.138.1.  I spread out some of our jobs and multibranch scans to try and mitigate the load. (fyi: running in Jenkins docker container) Of course this doesn't match up with the original reporter who said it wasn't seen until an upgrade from 2.150.3. I'm considering rolling back to 2.138.x at this point as it is disrupting my team.

            Unassigned Unassigned
            lifeofguenter Günter Grodotzki
            Votes:
            7 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: