-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
Powered by SuggestiMate
The Jenkins core tests takes 2 hours to complete on Linux and 4 hours to complete on windows. In the Jenkins git client plugin the tests take 8.5 minutes to complete on Linux and 21 minutes on windows. A successful solution would be to try to at least locate a way to decrease this time disparity between these two Operating Systems when Jenkin’s tests are run. If we find this solution faster than expected we may increase the scope of our performance investigation.
https://ci.jenkins.io/job/Core/job/jenkins/job/master/lastSuccessfulBuild/
https://ci.jenkins.io/job/Core/job/jenkins/job/master/lastSuccessfulBuild/
https://ci.jenkins.io/job/Plugins/job/git-client-plugin/job/master/lastBuild/pipeline-graph/
https://ci.jenkins.io/job/Plugins/job/git-client-plugin/job/master/lastBuild/pipeline-graph/
[JENKINS-72655] Decreasing Time disparity between Linux and Windows automated test run time
This isn't an issue with Jenkins infrastructure. It is an issue that highlights a measurable performance difference between Windows and Linux for the same tests, even when using the same types of processors and same types of disc drives. I consistently see slower performance on the newer Windows computers at my house than on older Linux computers. The examples from ci.jenkins.io are examples of something similiar.
I don't know the root reason why Jenkins tests of the git client plugin and some other plugins are much faster on Linux than on Windows, but I invited them to consider that difference as a candidate for a performance investigation.
using the same types of processors and same types of disc drives. I I consistently see slower performance on the newer Windows computers at my house than on older Linux computers. The examples from ci.jenkins.io are examples of something similiar.
Quick googling finds e.g. https://www.reddit.com/r/linux/comments/w7no0p/why_are_most_operations_in_windows_much_slower/ and https://github.com/Microsoft/WSL/issues/873#issuecomment-425272829 and while neither is a good answer for this issue as filed (one is mostly about WSL, while the other has no definitive solution), I would start by trying to determine whether IO isn't just slower in general on Windows as these references indicate, and what we're seeing here is downstream from that.
Hi Daniel,
I am part of Joshua's team. I agree with the findings from your first reddit link, Windows IO is generally slower than Linux because Windows is typically stepping on its own toes in many different ways that all kind of buildup to slow IO. Briefly skimming the WSL issue, the /mnt/ drives are under Windows protection, crossing the Windows/Linux filesystem "membrane" causes a drastic slowdown as Windows files != Linux files and there is conversion and checks that take place behind the scenes when file operations are crossing that. I remember reading long ago one of the big slowdowns is WSL doing file conversions to unix style and then back to dos style every time a file is accessed across the "membrane".
We will begin investigating if the test slowness is a property of WIndows and what the exact property/properties are that cause this.
I work at Microsoft and develop on Windows/Linux in my day to day, and I witness the exact same issues occurring in our test and build environments. I will see if I can find a company contact that might be able to provide some more info into how we mitigate the discrepancy. One thing I am thinking off the top of my head is what if we use Tiny11 as the test OS? Should be able to remove some of the process bloat that way and bypass any anti-virus scans. This isn't ideal for local dev setups that are on full versions of Windows, but would reduce the pipeline time and improve the time it takes to check in with Windows tests enabled.
For local setups if this is indeed a windows property, we can narrow it down and provide recommendations for devs on OS settings to improve the performance, such as exclusion folders for antimaware, etc.
Let me know your thoughts on above.
One thing I am thinking off the top of my head is what if we use Tiny11 as the test OS? Should be able to remove some of the process bloat that way and bypass any anti-virus scans. This … would reduce the pipeline time and improve the time it takes to check in with Windows tests enabled.
Hence my confusion whether this is a general or Jenkins infra issue
To clarify, I don't want to get in the way of people wanting to solve problems and improve the situation, whether for Jenkins infra or more generally. We just get many improperly filed issues, and the first version of this looked like it was in the wrong place (and therefore likely not seen by folks who would be able to assist with infra).
Yup makes sense, we are still learning the Jenkins contribution guidelines and just general layout of the repo. We will do some more investigating on the issue to see if we can get a better sense of where the issue lies and refile or tag along to an open issue. Looks like you might have opened something related on the infra side here: Windows agents are soooooooooo slooooooooooooooooooow · Issue #3117 · jenkins-infra/helpdesk · GitHub?
cherczeg Right, that issue has some interested parties commenting already, feel free to get involved there for any infra side things. I'll also poke colleagues working in infra for awareness.
Hi Jenkins Team, my team focused our attention on the git-client-plugin, trying to reduce the unit testing times. We found a significant portion of the test runtime was taken by fetching git-client-plugin repo. Since the test themselves were using the git client repo itself to test various git commands. Since the repo will continue to grow the tests will run slower and slower, so we made a smaller repo in its place to check the git commands. We were able to implement it on this current branch off of the git-client-plugin repo here: https://github.com/chrisherczeg/oss-performance-git-client-plugin/pull/1. We only implemented this changes on the GitClientTest, GitClientClonteTest, JGitAPIImplTest, and JGitApacheAPIImplTest. And bench marked them using these scripts we developed: https://github.com/chrisherczeg/git-cilent-plugin-benchmark/tree/non_exclusion/cherc/20240417120858.= and here https://github.com/chrisherczeg/git-cilent-plugin-benchmark/tree/exclusion/cherc/20240417120858.
Due to timing constraints with our class we weren’t able to integrate the change into all the other test cases, and when you run all the test ,and not just the tests we updated, you get failures with the tests trying to locate the repo branch since we did not update it for all tests. For the tests we did update we saw the time reduce by 18.6% for the GitClientTest, 12.72% fir GitClientCloneTest, 17.6% for the JGitAPIImplTest, and 20.16% for the JGitApacheAPIImplTest.
Apart from the small repo approach we did find that Microsoft Dev Drive which is a built in feature in Windows 11(can be activated in General Settings) had a time reduction of 11.7% when running the GitClientCloneTest. But due to timing we just continued our focus at look the smaller repo approach as a time saver.
Hopefully, these findings are useful.
Thanks very much jerondon . I'll copy those changes into my git client plugin repository on a development branch so that I can review them in the future.
Thanks also for the pointer to Microsoft Dev Drive. I had not heard of that facility and am interested to see how it performs on my Windows 11 machine.
The Microsoft JDK also performed marginally better than the Open JDK, 3-5% in our tests. Surprisingly our benchmarking showed that disabling Windows Defender antivirus (adding exclusion path to the Git repo) actually slowed the tests down. This was about an 8-10% increase in runtime for the tests we sampled with Defender disabled. We don't really have a clear explanation here and I've done some digging for why this might be but did not find anything conclusive. My current theory is that when Windows Defender is enabled and a file write operation is performed, the write gets short circuited by Windows Defender which intercepts it and reports to the program performing the file write that it has been completed. Some small time later after the scan, Windows defender performs the actual disk write and releases whatever lock it had on the file. In terms of these tests, we wouldn't really see an impact from Windows Defender holding the file for the scan, since its rarely the case that immediately after a git operation a file needs to be opened for modification.
Jenkins infra issues are generally tracked in the helpdesk: https://github.com/jenkins-infra/helpdesk/
The INFRA project in this Jira is no longer in use, and JENKINS is also wrong unless this is more general than Jenkins project infra. Depends on what your intended scope is.