Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-21605

Logging all UpstreamCause's floods Jenkins in large setups

      This bug is the same as https://issues.jenkins-ci.org/browse/JENKINS-15747 but it happens for us with Jenkins version 1.548 again (while it was working in the mean time).

      In cases where there are multiple paths though the ob dependencies logging all upstream causes generates a huge build log.
      The example job linked with this ticket has a log which is nearly 110MB in size because of that.

      In the previous ticket the issue has been addressed by "not only cap total number of transitive upstream causes but also avoid redundantly storing information about upstream causes listed elsewhere".

          [JENKINS-21605] Logging all UpstreamCause's floods Jenkins in large setups

          Jesse Glick added a comment -

          Showing a shallow list of upstream causes from every upstream job is not a real problem in and of itself, because the set of upstream projects is of a fixed size. The problem is when you start showing a somewhat deeper graph, and the upstream projects themselves have interdependencies, because then you can get into an exponentially large set of causes. Duplicated graph nodes should be pruned, and/or the total graph size bounded.

          Jesse Glick added a comment - Showing a shallow list of upstream causes from every upstream job is not a real problem in and of itself, because the set of upstream projects is of a fixed size. The problem is when you start showing a somewhat deeper graph, and the upstream projects themselves have interdependencies, because then you can get into an exponentially large set of causes. Duplicated graph nodes should be pruned, and/or the total graph size bounded.

          Dirk Thomas added a comment - - edited

          I have updated the Groovy script to generate a job topology like this:

          01 depends on root
          02 depends on 01
          ...
          40 depends on 39
          leaf depends on: 01-40

          The result can be found here: http://54.183.26.131:8080/view/jenkins21605/job/jenkins21605_leaf/lastSuccessfulBuild/
          The build log is already 84 KB in size (just for logging the upstream causes).

          I think the "best" approach to address this is not to add another threshold but simply give the user the option not to log the recursive upstream cause.
          It is perfectly fine to list all "Started by upstream project" lines for all the upstream dependencies.
          But optionally not outputting the ever increasing "originally caused by" would reduce the output to linear size (linear to the number of upstream dependencies).
          And the user can still navigate the hierarchy of upstream causes by following the upstream build links.

          Do you agree that this test is sufficient?
          Do you agree that the proposed option would be a reasonable approach to address the problem or do you see a better approach?

          Dirk Thomas added a comment - - edited I have updated the Groovy script to generate a job topology like this: 01 depends on root 02 depends on 01 ... 40 depends on 39 leaf depends on: 01-40 The result can be found here: http://54.183.26.131:8080/view/jenkins21605/job/jenkins21605_leaf/lastSuccessfulBuild/ The build log is already 84 KB in size (just for logging the upstream causes). I think the "best" approach to address this is not to add another threshold but simply give the user the option not to log the recursive upstream cause. It is perfectly fine to list all "Started by upstream project" lines for all the upstream dependencies. But optionally not outputting the ever increasing "originally caused by" would reduce the output to linear size (linear to the number of upstream dependencies). And the user can still navigate the hierarchy of upstream causes by following the upstream build links. Do you agree that this test is sufficient? Do you agree that the proposed option would be a reasonable approach to address the problem or do you see a better approach?

          Jesse Glick added a comment -

          No new options. Just limit the amount of information displayed to some reasonable threshold, as in JENKINS-15747.

          Jesse Glick added a comment - No new options. Just limit the amount of information displayed to some reasonable threshold, as in JENKINS-15747 .

          Dirk Thomas added a comment -

          But it should show ALL direct upstream projects which triggered the build.
          And in combination with that the current threshold for the recursion depth (of 11) is already resulting in too big output.

          So what kind of threshold are you proposing then?

          Dirk Thomas added a comment - But it should show ALL direct upstream projects which triggered the build. And in combination with that the current threshold for the recursion depth (of 11) is already resulting in too big output. So what kind of threshold are you proposing then?

          Jesse Glick added a comment -

          it should show ALL direct upstream projects which triggered the build

          Probably yes.

          in combination with that the current threshold for the recursion depth (of 11) is already resulting in too big output

          Right, that is the bug here.

          what kind of threshold are you proposing

          No exact proposal, but perhaps something like: always display all direct upstream causes, and then use DeeplyNestedUpstreamCause for indirect causes when either a certain depth is exceeded, or the total number of non-direct causes exceeds some threshold.

          In fact this is exactly what JENKINS-15747 is supposed to do (it was JENKINS-14814 which fixed only the depth). There is a test demonstrating that it works at least in some circumstances. Perhaps you are hitting some other corner case. Add a test for it and the fix should follow.

          Jesse Glick added a comment - it should show ALL direct upstream projects which triggered the build Probably yes. in combination with that the current threshold for the recursion depth (of 11) is already resulting in too big output Right, that is the bug here. what kind of threshold are you proposing No exact proposal, but perhaps something like: always display all direct upstream causes, and then use DeeplyNestedUpstreamCause for indirect causes when either a certain depth is exceeded, or the total number of non-direct causes exceeds some threshold. In fact this is exactly what JENKINS-15747 is supposed to do (it was JENKINS-14814 which fixed only the depth). There is a test demonstrating that it works at least in some circumstances. Perhaps you are hitting some other corner case. Add a test for it and the fix should follow.

          Dirk Thomas added a comment - - edited

          I will describe two concrete cases to have a baseline for the further discussion.

          Case (A) ( https://gist.github.com/dirk-thomas/9bbd47397e48ef3ceef8 ):

          A job "leaf" has only a single upstream dependency on "before_leaf".
          And "before leaf" has many (in this example 40: "01" to "40") upstream dependencies.
          Each upstream dependency "N" has "N-1" as its upstream dependency.

          The "before_leaf" job will list all 40 upstream causes.
          Each upstream cause on its own is limited to a recursive depth of 10 (according to `MAX_DEPTH`).

          The "leaf" job has a single upstream cause ("before_leaf").
          The `Set<String> traversed` in the `UpstreamClause` prevents listing repeated upstream causes of the single upstream cause.

          Case (B) ( https://gist.github.com/dirk-thomas/37febb42abeb8631f946 ):

          A job "leaf" has only a single upstream dependency on "before_leaf".
          And "before leaf" has several (in this example 5: "a15" to "e15") upstream dependencies.
          Each upstream dependency "xN" has "xN-1" as its upstream dependency.

          Recursive upstream causes are usually "terminated" by a `DeeplyNestedUpstreamCause` when `MAX_DEPTH` is reached.
          `MAX_LEAF` prevents adding a `DeeplyNestedUpstreamCause` at the end of the recursion once the number of different causes has reached 25 addresses (`MAX_LEAF`).
          This can be seen in the "leaf" of of case (B).
          (I don't understand why skipping the `DeeplyNestedUpstreamCause` when aborting the recursion makes a big different though - it does not affect the log size significantly and contains valuable information (that the recursion has been aborted)).

          Based on these I identified two problems.

          Problem (A): limitation of performing the thresholds in the `UpstreamCause`:

          The "before_leaf" job of case (A) has 40 upstream causes.
          While each on its own does some logic for limiting the information each separate `UpstreamCause` instance does not know about its siblings.
          Therefore it can not adjust the level of information shown in the case that there are many siblings.
          This is not "fixable" in the `UpstreamCause` class itself.
          This would require some changes in the code handling the upstream causes to pass in information e.g. the number of siblings (which arguably a `UpstreamCause` should not need to know about).
          (The problem is the same for the "before_leaf" job of case (B).)

          Problem (B): the depth threshold is independent from the number of upstream causes:

          The "leaf" job of case (B) has only a single upstream cause.
          But this upstream cause outputs every upstream cause up to the recursion limit.
          This results in N x 10 upstream causes where N is the number of upstream causes of the single upstream cause of the job.
          A "combined" limit would probably make much more sense in this case.
          E.g. limit each recursion to not 10 but potentially less if the number of sibling upstream causes on the first level increases.

          (I am unable to provide a Java unit test since I lack the experience programming in Java but the Groovy examples should be verify specific and hopefully easy to transfer into a unit test by an experienced Jenkins/Java programmer.)

          Dirk Thomas added a comment - - edited I will describe two concrete cases to have a baseline for the further discussion. Case (A) ( https://gist.github.com/dirk-thomas/9bbd47397e48ef3ceef8 ): A job "leaf" has only a single upstream dependency on "before_leaf". And "before leaf" has many (in this example 40: "01" to "40") upstream dependencies. Each upstream dependency "N" has "N-1" as its upstream dependency. The "before_leaf" job will list all 40 upstream causes. Each upstream cause on its own is limited to a recursive depth of 10 (according to `MAX_DEPTH`). The "leaf" job has a single upstream cause ("before_leaf"). The `Set<String> traversed` in the `UpstreamClause` prevents listing repeated upstream causes of the single upstream cause. Case (B) ( https://gist.github.com/dirk-thomas/37febb42abeb8631f946 ): A job "leaf" has only a single upstream dependency on "before_leaf". And "before leaf" has several (in this example 5: "a15" to "e15") upstream dependencies. Each upstream dependency "xN" has "xN-1" as its upstream dependency. Recursive upstream causes are usually "terminated" by a `DeeplyNestedUpstreamCause` when `MAX_DEPTH` is reached. `MAX_LEAF` prevents adding a `DeeplyNestedUpstreamCause` at the end of the recursion once the number of different causes has reached 25 addresses (`MAX_LEAF`). This can be seen in the "leaf" of of case (B). (I don't understand why skipping the `DeeplyNestedUpstreamCause` when aborting the recursion makes a big different though - it does not affect the log size significantly and contains valuable information (that the recursion has been aborted)). Based on these I identified two problems. Problem (A): limitation of performing the thresholds in the `UpstreamCause`: The "before_leaf" job of case (A) has 40 upstream causes. While each on its own does some logic for limiting the information each separate `UpstreamCause` instance does not know about its siblings. Therefore it can not adjust the level of information shown in the case that there are many siblings. This is not "fixable" in the `UpstreamCause` class itself. This would require some changes in the code handling the upstream causes to pass in information e.g. the number of siblings (which arguably a `UpstreamCause` should not need to know about). (The problem is the same for the "before_leaf" job of case (B).) Problem (B): the depth threshold is independent from the number of upstream causes: The "leaf" job of case (B) has only a single upstream cause. But this upstream cause outputs every upstream cause up to the recursion limit. This results in N x 10 upstream causes where N is the number of upstream causes of the single upstream cause of the job. A "combined" limit would probably make much more sense in this case. E.g. limit each recursion to not 10 but potentially less if the number of sibling upstream causes on the first level increases. (I am unable to provide a Java unit test since I lack the experience programming in Java but the Groovy examples should be verify specific and hopefully easy to transfer into a unit test by an experienced Jenkins/Java programmer.)

          Jesse Glick added a comment -

          Thank you for the analysis. I do not personally expect to have time to work on a fix, but perhaps someone else will.

          Jesse Glick added a comment - Thank you for the analysis. I do not personally expect to have time to work on a fix, but perhaps someone else will.

          Dirk Thomas added a comment -

          Can someone provide some insight how the problem (A) could be addressed?

          If we could sketch the desired solution together I might try to work on a patch for it.
          But currently the algorithmic approach is not clear to me - especially how to perform the limiting without affecting to many external classes (outside of `UpstreamCause`.

          Dirk Thomas added a comment - Can someone provide some insight how the problem (A) could be addressed? If we could sketch the desired solution together I might try to work on a patch for it. But currently the algorithmic approach is not clear to me - especially how to perform the limiting without affecting to many external classes (outside of `UpstreamCause`.

          Dirk Thomas added a comment -

          Since the previous URL is about to become unavailable I will add a current one which demonstrates the ridiculous amount of redundant information being logged: http://build.ros.org/job/Ibin_uT64__desktop_full__ubuntu_trusty_amd64__binary/1/

          Dirk Thomas added a comment - Since the previous URL is about to become unavailable I will add a current one which demonstrates the ridiculous amount of redundant information being logged: http://build.ros.org/job/Ibin_uT64__desktop_full__ubuntu_trusty_amd64__binary/1/

          Dirk Thomas added a comment -

          Another case with the console output containing 338,653 lines with nothing more than `Started by upstream project` and `originally caused by` lines: http://build.ros.org/job/Mrel_sync-packages-to-testing_bionic_amd64/529/

          Dirk Thomas added a comment - Another case with the console output containing 338,653 lines with nothing more than `Started by upstream project` and `originally caused by` lines:  http://build.ros.org/job/Mrel_sync-packages-to-testing_bionic_amd64/529/

            Unassigned Unassigned
            dthomas Dirk Thomas
            Votes:
            5 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: