Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23244

Slave build history page has no data and spawns a ton of very long-lived blocking threads on the master

      So I went to try to see the usage for a slave on builds.apache.org, and the page had no builds on it. I eventually noticed the "Calculation in progress" bit and thought "Oh, ok, I'll leave this up and check again later". That was a mistake. Now there are 30+ threads on the master like the ones in https://gist.github.com/abayer/88e390e3f0859f8b64e2 - i.e., a whole ton of HTTP POST requests to /computer/foo/timeline/data, all but one blocking on the one that's running, and the one that's running takes a long time to finish.

      This means (a) that the build history page for a slave is useless and (b) that we're churning CPU/IO and, I'm guessing, doing so repeatedly without caching, since when I check it now, even an hour and a half later, there's no data on the page.

          [JENKINS-23244] Slave build history page has no data and spawns a ton of very long-lived blocking threads on the master

          Adjusting the priority since it only affects relatively unvisited pages of large deployments.

          Kohsuke Kawaguchi added a comment - Adjusting the priority since it only affects relatively unvisited pages of large deployments.

          Looking at the thread dump, the call stack indicates this call resulted in loading all the build records (via AbstractLazyLoadRunMap.all), which looks suspicious.
          I'd think this operation would only require walking newer build records.

          "Handling POST /computer/hadoop4/timeline/data/ : http-bio-8090-exec-895" Id=22685 Group=main RUNNABLE
          	at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
          	at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
          	at java.io.File.exists(File.java:813)
          	at hudson.model.RunMap.retrieve(RunMap.java:219)
          	at hudson.model.RunMap.retrieve(RunMap.java:59)
          	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:687)
          	-  locked hudson.model.RunMap@3fa6ce65
          	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:649)
          	-  locked hudson.model.RunMap@3fa6ce65
          	at jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:381)
          	at hudson.model.AbstractBuild.getPreviousBuild(AbstractBuild.java:219)
          	at hudson.tasks.Fingerprinter$FingerprintAction.compact(Fingerprinter.java:360)
          	at hudson.tasks.Fingerprinter$FingerprintAction.onLoad(Fingerprinter.java:349)
          	at hudson.model.Run.onLoad(Run.java:337)
          	at hudson.model.RunMap.retrieve(RunMap.java:223)
          	at hudson.model.RunMap.retrieve(RunMap.java:59)
          	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:687)
          	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:670)
          	at jenkins.model.lazy.AbstractLazyLoadRunMap.all(AbstractLazyLoadRunMap.java:622)
          	-  locked hudson.model.RunMap@3fa6ce65
          	at jenkins.model.lazy.AbstractLazyLoadRunMap.entrySet(AbstractLazyLoadRunMap.java:277)
          	at java.util.AbstractMap$2$1.<init>(AbstractMap.java:378)
          	at java.util.AbstractMap$2.iterator(AbstractMap.java:377)
          	at hudson.util.RunList.iterator(RunList.java:97)
          	at com.google.common.collect.Iterables$15.apply(Iterables.java:1128)
          	at com.google.common.collect.Iterables$15.apply(Iterables.java:1125)
          	at com.google.common.collect.Iterators$8.next(Iterators.java:812)
          	at com.google.common.collect.Iterators$MergingIterator.<init>(Iterators.java:1306)
          	at com.google.common.collect.Iterators.mergeSorted(Iterators.java:1274)
          	at com.google.common.collect.Iterables$14.iterator(Iterables.java:1113)
          	at com.google.common.collect.Iterables$UnmodifiableIterable.iterator(Iterables.java:94)
          	at com.google.common.collect.Iterables$6.iterator(Iterables.java:585)
          	at hudson.util.RunList$2.iterator(RunList.java:210)
          	at hudson.util.RunList$2.iterator(RunList.java:210)
          	at com.google.common.collect.Iterables$6.iterator(Iterables.java:585)
          	at hudson.util.RunList.iterator(RunList.java:97)
          	at hudson.model.BuildTimelineWidget.doData(BuildTimelineWidget.java:63)
          

          Kohsuke Kawaguchi added a comment - Looking at the thread dump, the call stack indicates this call resulted in loading all the build records (via AbstractLazyLoadRunMap.all ), which looks suspicious. I'd think this operation would only require walking newer build records. "Handling POST /computer/hadoop4/timeline/data/ : http-bio-8090-exec-895" Id=22685 Group=main RUNNABLE at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.exists(File.java:813) at hudson.model.RunMap.retrieve(RunMap.java:219) at hudson.model.RunMap.retrieve(RunMap.java:59) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:687) - locked hudson.model.RunMap@3fa6ce65 at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:649) - locked hudson.model.RunMap@3fa6ce65 at jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:381) at hudson.model.AbstractBuild.getPreviousBuild(AbstractBuild.java:219) at hudson.tasks.Fingerprinter$FingerprintAction.compact(Fingerprinter.java:360) at hudson.tasks.Fingerprinter$FingerprintAction.onLoad(Fingerprinter.java:349) at hudson.model.Run.onLoad(Run.java:337) at hudson.model.RunMap.retrieve(RunMap.java:223) at hudson.model.RunMap.retrieve(RunMap.java:59) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:687) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:670) at jenkins.model.lazy.AbstractLazyLoadRunMap.all(AbstractLazyLoadRunMap.java:622) - locked hudson.model.RunMap@3fa6ce65 at jenkins.model.lazy.AbstractLazyLoadRunMap.entrySet(AbstractLazyLoadRunMap.java:277) at java.util.AbstractMap$2$1.<init>(AbstractMap.java:378) at java.util.AbstractMap$2.iterator(AbstractMap.java:377) at hudson.util.RunList.iterator(RunList.java:97) at com.google.common.collect.Iterables$15.apply(Iterables.java:1128) at com.google.common.collect.Iterables$15.apply(Iterables.java:1125) at com.google.common.collect.Iterators$8.next(Iterators.java:812) at com.google.common.collect.Iterators$MergingIterator.<init>(Iterators.java:1306) at com.google.common.collect.Iterators.mergeSorted(Iterators.java:1274) at com.google.common.collect.Iterables$14.iterator(Iterables.java:1113) at com.google.common.collect.Iterables$UnmodifiableIterable.iterator(Iterables.java:94) at com.google.common.collect.Iterables$6.iterator(Iterables.java:585) at hudson.util.RunList$2.iterator(RunList.java:210) at hudson.util.RunList$2.iterator(RunList.java:210) at com.google.common.collect.Iterables$6.iterator(Iterables.java:585) at hudson.util.RunList.iterator(RunList.java:97) at hudson.model.BuildTimelineWidget.doData(BuildTimelineWidget.java:63)

          Andrew Bayer added a comment -

          fwiw, it's now looking a lot better - no blocked threads, build history's showing up for all slaves now, so far as I can tell.

          Andrew Bayer added a comment - fwiw, it's now looking a lot better - no blocked threads, build history's showing up for all slaves now, so far as I can tell.

          Daniel Beck added a comment -

          abayer: What changed?

          Daniel Beck added a comment - abayer : What changed?

          Andrew Bayer added a comment -

          Nothing - just time after startup and first attempt to load it.

          Andrew Bayer added a comment - Nothing - just time after startup and first attempt to load it.

          Jesse Glick added a comment -

          Yup.

          Jesse Glick added a comment - Yup.

          Ivan Kalinin added a comment -

          We are still experiencing great deal of trouble with slave buld history thing.

          I just tried to open that for one slave and got all the Jenkins master locked up UI-side.

          The thread that calls `AbstractLazyLoadRunMap.load` goes on foverer (yes, we have a great deal of builds), but somehow other threads from the UI pool keep getting locked. Eventually, Jenkins became unresponsive altogether – but the jobs were still running.

          Maybe we could use a separate thread pool for this kind of stuff so it wont lock all the UI threads?

          BTW, we are running current LTS

          Ivan Kalinin added a comment - We are still experiencing great deal of trouble with slave buld history thing. I just tried to open that for one slave and got all the Jenkins master locked up UI-side. The thread that calls `AbstractLazyLoadRunMap.load` goes on foverer (yes, we have a great deal of builds), but somehow other threads from the UI pool keep getting locked. Eventually, Jenkins became unresponsive altogether – but the jobs were still running. Maybe we could use a separate thread pool for this kind of stuff so it wont lock all the UI threads? BTW, we are running current LTS

          Steps to reproduce:
          1. Display slave builds history page. Wait for it to render, there should be a small progress bar with "Computation in progress" hint
          2. Request any other page (e.g. the main page) - it will hang

          Sample thread dump illustrating the problem attached.
          Thread 30745 is processing request for slave builds history (http://jenkins/computer/slave_name/builds)
          All other requests now hang on jenkins.model.lazy.AbstractLazyLoadRunMap.load for up to 2 minutes in our case.

          Lukasz Karnasiewicz added a comment - Steps to reproduce: 1. Display slave builds history page. Wait for it to render, there should be a small progress bar with "Computation in progress" hint 2. Request any other page (e.g. the main page) - it will hang Sample thread dump illustrating the problem attached. Thread 30745 is processing request for slave builds history ( http://jenkins/computer/slave_name/builds ) All other requests now hang on jenkins.model.lazy.AbstractLazyLoadRunMap.load for up to 2 minutes in our case.

          I'm seeing this in our installation. It severely impacts the repsonsiveness of the system.

          Matthew Mitchell added a comment - I'm seeing this in our installation. It severely impacts the repsonsiveness of the system.

          (FYI this installation is around 6-7k builds a day)

          Even in the case of walking newer builds, it seems like this woudl be super expensive. Maybe it's better to keep an index of buildname/number to machine to avoid loading the metadata at all?

          Matthew Mitchell added a comment - (FYI this installation is around 6-7k builds a day) Even in the case of walking newer builds, it seems like this woudl be super expensive. Maybe it's better to keep an index of buildname/number to machine to avoid loading the metadata at all?

          Code changed in jenkins
          User: Akbashev Alexander
          Path:
          core/src/main/resources/hudson/model/BuildTimelineWidget/control.jelly
          http://jenkins-ci.org/commit/jenkins/2a0ac4f0989407a20e277444a7737e9c5f7ea78a
          Log:
          [FIX JENKINS-23244] Slave build history page has no data and spawns a ton of very long-lived blocking threads on the master (#2584)

          Mainly commit are doing two things:
          1) Show only selected (visible) builds
          2) Query build one-by-one - not it parallel

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Akbashev Alexander Path: core/src/main/resources/hudson/model/BuildTimelineWidget/control.jelly http://jenkins-ci.org/commit/jenkins/2a0ac4f0989407a20e277444a7737e9c5f7ea78a Log: [FIX JENKINS-23244] Slave build history page has no data and spawns a ton of very long-lived blocking threads on the master (#2584) Mainly commit are doing two things: 1) Show only selected (visible) builds 2) Query build one-by-one - not it parallel

          Code changed in jenkins
          User: Akbashev Alexander
          Path:
          core/src/main/resources/hudson/model/BuildTimelineWidget/control.jelly
          http://jenkins-ci.org/commit/jenkins/4421d1b94d143956475f20a03c63fb1a367321f2
          Log:
          [FIX JENKINS-23244] Slave build history page has no data and spawns a ton of very long-lived blocking threads on the master (#2584)

          Mainly commit are doing two things:
          1) Show only selected (visible) builds
          2) Query build one-by-one - not it parallel
          (cherry picked from commit 2a0ac4f0989407a20e277444a7737e9c5f7ea78a)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Akbashev Alexander Path: core/src/main/resources/hudson/model/BuildTimelineWidget/control.jelly http://jenkins-ci.org/commit/jenkins/4421d1b94d143956475f20a03c63fb1a367321f2 Log: [FIX JENKINS-23244] Slave build history page has no data and spawns a ton of very long-lived blocking threads on the master (#2584) Mainly commit are doing two things: 1) Show only selected (visible) builds 2) Query build one-by-one - not it parallel (cherry picked from commit 2a0ac4f0989407a20e277444a7737e9c5f7ea78a)

            jimilian Alexander A
            abayer Andrew Bayer
            Votes:
            8 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated:
              Resolved: