-
Bug
-
Resolution: Fixed
-
Critical
This bug is the same as https://issues.jenkins-ci.org/browse/JENKINS-15747 but it happens for us with Jenkins version 1.548 again (while it was working in the mean time).
In cases where there are multiple paths though the ob dependencies logging all upstream causes generates a huge build log.
The example job linked with this ticket has a log which is nearly 110MB in size because of that.
In the previous ticket the issue has been addressed by "not only cap total number of transitive upstream causes but also avoid redundantly storing information about upstream causes listed elsewhere".
[JENKINS-21605] Logging all UpstreamCause's floods Jenkins in large setups
Can someone please look into this issue?
The original ticket was already fixed (15747) and this regression makes our logging nearly unusable.
Since you have fixed the original ticket for us could you please look into fixing this regression?
Do not really have time for it. File a pull request if you know how to fix it, or if you are a CloudBees support customer open a ticket.
Could someone please explain to me what the critical bug is here? Checking a few of the linked builds rarely shows more than a dozen or so causes, even the linked 813 only has a <100KB build log and the build index page loads without problems.
There might well have been a build with a log build log because of that (the now missing original one), but it seems to have been a freak accident, possibly due to a job not getting scheduled, not something that regularly breaks production for you.
The problem is that the repeated logging of all upstream causes can result in extremely large build logs.
As mentioned in the referenced original ticket that can be even up to 100 MB for a single job.
Having numerous jobs / builds with such extensive logs will basically make Jenkins completely unable to operate.
The bug has been fixed before but has reappeared since then.
A recent example has "only" 1.5 MB: http://jenkins.ros.org/job/ros-hydro-metapackages_binarydeb_precise_amd64/1102/consoleFull
Just pick a random upstream build (e.g. the first line `project "ros-hydro-map-server_binarydeb_precise_amd64" build number 52`) which triggered this jobs and see how many repetitions are in the log.
On of our latest builds generated a 154 MB `log` file.
The corresponding `build.xml` file is another 108 MB.
The actual output of the build is only a few KB.
Please see the example job yourself: http://54.183.26.131:8080/view/Manage/job/Irel_sync-packages-to-testing_saucy_i386/1/console
Is this ever happening when it's not the first build of a job that's downstream from hundreds of programmatically triggered builds that are also #1 of their respective jobs and form a very tightly coupled dependency graph? I mean, 'Admin' triggered many builds that were downstream from a dozen others, which probably only happens once when you initially set up your projects.
Because, if you have such a setup and workflow, and cannot handle a single 100+ MB log file, you're doing something wrong IMO.
What I mean is: You mention that this results in unnecessarily long log files, but don't explain why this is a super serious problem. I regularly have log files in the gigabytes, and while they're somewhat annoying to view (definitely not browser compatible in all their glory) I just download them and I'm good. In the new example, the console tail shown by default includes everything that's relevant. I just don't see the huge problem here. Disk space?
Obviously the logged data is highly redundant. And this bug has been fixed before. The actual output of the job is only a few KB so exploding it to hundreds of MB is a serious misuse of resources.
Our Jenkins deployments have tens of thousands of jobs which have a lot of downstream dependencies between them.
The longer these dependency chains get to longer and more redundant the output gets. It does not only scale linear with the dependency hierarchy but exponentially because of the many repetitions of the same dependencies.
(Why are we doing that? Each job build a certain Debian package and afterwards triggers its downstream dependencies since we can not guarantee ABI stability. )
The example job is one of the many but each and every job has the same problem. The deeper in the dependency graph the more severe the problem is. Since we trigger these jobs multiple times per day that results in a lot of GBs of unnecessary data which keep growing with every job we run and the master has to keep them.
We don't want to truncate the number of kept builds because we value the information available in the (useful part of the) build log. We also can't provide unlimited fast storage to the master.
The argument that the "console tail shown by default includes everything that's relevant" is simply wrong in a lot of cases.
What if the content of interest is just a little bit above the tail?
As you pointed out it is basically not feasible to view it in the browser anymore.
But why do I have to download hundreds of MBs which basically contain 99% garbage?
Yes. It's inconvenient. I don't deny that. It's no different from long SCM checkout logs (e.g. when Subversion plugin downloads a few thousand files) that could be considered unnecessary. Or verbose tools writing tons of data to standard out.
Again, why is this such a huge problem, deserving the priority used for crashes, loss of data, and memory leaks?
If you would like to reduce the severity e.g. to "Major" please feel free to do so.
But I disagree on it being only an "inconvenience" which is not worth to be addressed.
For us it makes Jenkins almost unusable for large scale deployments.
Either our Jenkins master dies at some point because it runs out of storage or we have to accept "loss of data" in terms of removing build logs way earlier than it should be necessary.
Arguably Jenkins would benefit from a global object to not print upstream triggers at all.
The same as SVN might benefit from an option to be less verbose about what it checked out.
But the current listing of highly redundant upstream triggers is simply wrong.
It (the redundant part) does not contain any information at all!
The especially disappointing part is that it was fixed before and nobody seems to care about the regression for now a year.
I fear that this will not change with the severity lowered and the ticket will remain unresolved forever.
because it runs out of storage
Maybe this is a viable solution for you for jobs that are affected?
https://github.com/jenkinsci/compress-buildlog-plugin
Have you considered other workarounds at all, like having "Trigger" jobs that only get all the upstream triggers, and then trigger the actual downstream job via an API call? It's not as nice, but it's simple and works. Since you consider trigger information irrelevant, this would simply kill the chain off. (As would using the API technique for the default trigger mechanism, but that may be more inconvenient.) Dump the trigger job's logs earlier.
The especially disappointing part is that it was fixed before and nobody seems to care about the regression for now a year.
My guess would be that the Cloudbees folks change something for their customers if it's not actively harmful even for weak/questionable reasons, and that this just doesn't make the relevance threshold otherwise.
Thank you for pointing out potential workarounds.
The compression might be an option but also add additional load on the master. I don't know if that would be acceptable without further testing. (Update: after taking a brief look at the repository / plugin page I don't think this is mature enough to be used in production.)
Replacing the project dependencies with a custom trigger mechanism wouldn't be difficult (since we generate the jobs anyway) but I don't want to remove that valuable information.
This seems to be working around core Jenkins elements.
Also the trigger reason in the log is nice - if it would not contain the exponentially exploded redundant information.
Like JENKINS-15747 it is clearly a bug. The fix of that issue involved bounding the depth of recursion; perhaps in your case you are hitting unbounded breadth, which is not currently checked. Given the lack of votes, I am guessing that few people have the particular job configuration to trigger this bug. If you need a fix, file a pull request with test; offer a bug bounty; or request help from a commercial support provider such as CloudBees.
I have created a Groovy script which creates some jobs which demonstrate the unbounded breadth you mentioned: https://gist.github.com/dirk-thomas/630084eefb44baa79f15
An example job can also be found here: http://54.183.26.131:8080/view/jenkins21605/job/jenkins21605_leaf/1/
Showing a shallow list of upstream causes from every upstream job is not a real problem in and of itself, because the set of upstream projects is of a fixed size. The problem is when you start showing a somewhat deeper graph, and the upstream projects themselves have interdependencies, because then you can get into an exponentially large set of causes. Duplicated graph nodes should be pruned, and/or the total graph size bounded.
I have updated the Groovy script to generate a job topology like this:
01 depends on root
02 depends on 01
...
40 depends on 39
leaf depends on: 01-40
The result can be found here: http://54.183.26.131:8080/view/jenkins21605/job/jenkins21605_leaf/lastSuccessfulBuild/
The build log is already 84 KB in size (just for logging the upstream causes).
I think the "best" approach to address this is not to add another threshold but simply give the user the option not to log the recursive upstream cause.
It is perfectly fine to list all "Started by upstream project" lines for all the upstream dependencies.
But optionally not outputting the ever increasing "originally caused by" would reduce the output to linear size (linear to the number of upstream dependencies).
And the user can still navigate the hierarchy of upstream causes by following the upstream build links.
Do you agree that this test is sufficient?
Do you agree that the proposed option would be a reasonable approach to address the problem or do you see a better approach?
No new options. Just limit the amount of information displayed to some reasonable threshold, as in JENKINS-15747.
But it should show ALL direct upstream projects which triggered the build.
And in combination with that the current threshold for the recursion depth (of 11) is already resulting in too big output.
So what kind of threshold are you proposing then?
it should show ALL direct upstream projects which triggered the build
Probably yes.
in combination with that the current threshold for the recursion depth (of 11) is already resulting in too big output
Right, that is the bug here.
what kind of threshold are you proposing
No exact proposal, but perhaps something like: always display all direct upstream causes, and then use DeeplyNestedUpstreamCause for indirect causes when either a certain depth is exceeded, or the total number of non-direct causes exceeds some threshold.
In fact this is exactly what JENKINS-15747 is supposed to do (it was JENKINS-14814 which fixed only the depth). There is a test demonstrating that it works at least in some circumstances. Perhaps you are hitting some other corner case. Add a test for it and the fix should follow.
I will describe two concrete cases to have a baseline for the further discussion.
Case (A) ( https://gist.github.com/dirk-thomas/9bbd47397e48ef3ceef8 ):
A job "leaf" has only a single upstream dependency on "before_leaf".
And "before leaf" has many (in this example 40: "01" to "40") upstream dependencies.
Each upstream dependency "N" has "N-1" as its upstream dependency.
The "before_leaf" job will list all 40 upstream causes.
Each upstream cause on its own is limited to a recursive depth of 10 (according to `MAX_DEPTH`).
The "leaf" job has a single upstream cause ("before_leaf").
The `Set<String> traversed` in the `UpstreamClause` prevents listing repeated upstream causes of the single upstream cause.
Case (B) ( https://gist.github.com/dirk-thomas/37febb42abeb8631f946 ):
A job "leaf" has only a single upstream dependency on "before_leaf".
And "before leaf" has several (in this example 5: "a15" to "e15") upstream dependencies.
Each upstream dependency "xN" has "xN-1" as its upstream dependency.
Recursive upstream causes are usually "terminated" by a `DeeplyNestedUpstreamCause` when `MAX_DEPTH` is reached.
`MAX_LEAF` prevents adding a `DeeplyNestedUpstreamCause` at the end of the recursion once the number of different causes has reached 25 addresses (`MAX_LEAF`).
This can be seen in the "leaf" of of case (B).
(I don't understand why skipping the `DeeplyNestedUpstreamCause` when aborting the recursion makes a big different though - it does not affect the log size significantly and contains valuable information (that the recursion has been aborted)).
Based on these I identified two problems.
Problem (A): limitation of performing the thresholds in the `UpstreamCause`:
The "before_leaf" job of case (A) has 40 upstream causes.
While each on its own does some logic for limiting the information each separate `UpstreamCause` instance does not know about its siblings.
Therefore it can not adjust the level of information shown in the case that there are many siblings.
This is not "fixable" in the `UpstreamCause` class itself.
This would require some changes in the code handling the upstream causes to pass in information e.g. the number of siblings (which arguably a `UpstreamCause` should not need to know about).
(The problem is the same for the "before_leaf" job of case (B).)
Problem (B): the depth threshold is independent from the number of upstream causes:
The "leaf" job of case (B) has only a single upstream cause.
But this upstream cause outputs every upstream cause up to the recursion limit.
This results in N x 10 upstream causes where N is the number of upstream causes of the single upstream cause of the job.
A "combined" limit would probably make much more sense in this case.
E.g. limit each recursion to not 10 but potentially less if the number of sibling upstream causes on the first level increases.
(I am unable to provide a Java unit test since I lack the experience programming in Java but the Groovy examples should be verify specific and hopefully easy to transfer into a unit test by an experienced Jenkins/Java programmer.)
Thank you for the analysis. I do not personally expect to have time to work on a fix, but perhaps someone else will.
Can someone provide some insight how the problem (A) could be addressed?
If we could sketch the desired solution together I might try to work on a patch for it.
But currently the algorithmic approach is not clear to me - especially how to perform the limiting without affecting to many external classes (outside of `UpstreamCause`.
Since the previous URL is about to become unavailable I will add a current one which demonstrates the ridiculous amount of redundant information being logged: http://build.ros.org/job/Ibin_uT64__desktop_full__ubuntu_trusty_amd64__binary/1/
Another case with the console output containing 338,653 lines with nothing more than `Started by upstream project` and `originally caused by` lines: http://build.ros.org/job/Mrel_sync-packages-to-testing_bionic_amd64/529/
Is there any chance this critical regression will be addressed? This makes Jenkins basically unusable in large setups. And going back to a one year old release where the problem was addressed is not really an option with the latest security fixes.