Hi everyone. A little info on this issue that we've been hitting as well. We're using jenkins and it's pipeline plugin to run our regression and ci builds for a SoC (System-on-chip) we're designing. As a side effect of what we're doing - some tests (under the parallel step) take a very long time to complete and are very CPU-intensive. Here's some statistics I've noticed:
- CI builds that have a small amount (~10) of short tests under parallel step almost always complete successfully.
- Regression runs that take a few days to complete and have a huge list of longer tests almost always hit this bug.
- If the jenkins was restarted fresh, chances to hit this bug are somewhat lower. I came to think it was related to a memory leak in jenkins test-results-analyzer plugin at first, which (due to our huge, sometime over 500MiB log files) crashed jenkins often at fist, until I started filtering logs to mitigate the issue (https://issues.jenkins-ci.org/browse/JENKINS-34134). Tuning GC options seemed to further improve things.
- Since we're running builds on NFS filesystem to later distribute and execute tests on different servers, thus balancing the load, I first considered filesystem lags due to other tasks being run on our servers. Adding this seemed to improve things a little bit. The disk load is not evenly distributed. It's heavy on the start of the test (when the model is loaded to memory and during reset due to extensive logging), later it's usually very low throughout the whole run.
5. Another thing I've noticed, some tests seem to randomly fail. According to the simulation log, the simulation process was just killed (by something other than OOM), rather that getting stuck, and that happens quite fast (e.g. the simulation haven't even completed the reset sequence, which takes a few seconds of real time, but usually spews a lot of info into the log).
6. At the Blue Ocean UI all fails in parallel steps seem to be grouped to the very bottom half of the screen, after 150+ of successfully executed tests. Since these are mostly in the order they're being started (I assume), it's making me believe, that it might be somehow related to either GC taking a lot of time or some memory leaking here and there.
A grand total would be of 5-10 tests of 300+ randomly crashing, and almost always the annoying java.lang.interruptedException with the whole run (That took 3-4 days to complete!) just freezing.
A full regression that takes a few days to complete almost always triggers the bug. I'm willing to help solve this issue, but since it requires some sacred knowledge about jenkins' guts I do not have, I can either post my pipeline script (which will most likely be useless, due to a huge amount of our internal tools being called), or give some of the proposed fixes a try and post the results, since I have the right environment.