-
Bug
-
Resolution: Fixed
-
Major
-
Jenkins core 1.532.2
Build-timeout plugin 1.13
-
Powered by SuggestiMate
Upon upgrading plugin "build-timeout" from 1.12.2 to 1.13, all builds took about 50% longer to complete. This resulted in an unacceptably long build queue and builds timing out. This continued for 6 or 7 hours until an emergency downgrade of "build-timeout" from 1.13 back to 1.12.2 was done, which cleared the problem immediately. No other plugins were upgraded or downgraded during this time, nor were any other system wide configuration changes made.
I suspect the use of "synchronized" in the source code change made writing to the console effectively single threaded for all running builds. (My jenkins instance has 125 slave-nodes, so I have several dozens of concurrent builds all the time.)
Priority: I've made this "Major" because a mere install this plugin version causes my Jenkins instance to be unusable due to slower builds which cause a growing build queue and timed-out builds.
Environment: I'm running Jenkins core LTS 1.532.2 version. I'll be glad to furnish more information as you request it.
[JENKINS-23012] Build-timeout plugin causes builds to slow
No, I cannot subject my employer's production jenkins to this problem again because that would impact hundreds of developers. That's unfortunate, because I know you need a place to investigate this problem.
Can you think of another way to proceed? If you think you know the problem and can generate a version that you think is 70% likely to fix this problem, then I will risk my jenkins instance on trying it.
I agree that it's risky to install testing versions to the production environmtnt. I'm not so sure what the root cause is and how to fix that.
I'll try to reproduce it in my local enviroment.
Please let me know followings:
- OS of the master node and slave nodes.
- Outline of your build process. For example, running maven, running gcc, or running other native process.
- I think whether builds are performed in Java or native processes can affect this problem.
- How much log outputs? I think the size of whole log and the time builds take will be helpful.
- I think much log outputs may trigger the problem.
- Can you see what process gets slow in builds?
- If building processes outputs timestamps, please compare them before and after downgrading build-timeout plugin.
- If you installed Timestampler plugin, please compare timestamps in console outputs.
- Timestamps logged with timestampler-plugin may differ from the activity of the building process as they can be buffered and delayed.
- If you don't have timestamper-plugin installed, you'd better not install that as that plugin also captures log output and may cause the same problem.
I think there are following possible causes to slow builds:
- Native processes lauched in builds get slow.
- Like processes launched with "Execute shell".
- I don't think Jenkins cannot affect native processes as they should be completely separated by OS.
- But slowed log output can flood output buffers of processes and may cause the processes hold for a while.
- Jenkins takes much time to proceed build steps.
- In this case, native processes don't get slow.
- Jenkins takes much time to start and stop builds.
- This can be caused by synchronized.
I've put my reply in-line with a copy of your questions. My text is dark red to make it easier to follow. Again, thank you for your attention to this.
- OS of the master node and slave nodes.
- Master and slaves are Windows Server 2008 R2 Standard. - Outline of your build process. For example, running maven, running gcc, or running other native process.
- I think whether builds are performed in Java or native processes can affect this problem.
- There are 561 defined projects that have built in the past two weeks, so I'll have to generalize.
- Our projects are 80 to 90% "free-style software projects", the rest being "multi-configuration projects".
- We use Git and Gerrit for source code management, however some projects select "none" and use multiple git repos.
- Our build-steps are normally "execute shell" or "execute windows batch command"
- Overwhelmingly we use WAF for building C/C++ source files; that's both with licensed compilers and "free" compilers.
- For authorization, we use "project-based Matrix Authorization Strategy" with 40 defined users plus anonymous. Some projects also enable project-based security.
- I think whether builds are performed in Java or native processes can affect this problem.
- How much log outputs? I think the size of whole log and the time builds take will be helpful.
- I think much log outputs may trigger the problem.
- Quick survey says... 20,000 to 40,000 lines of text for our most popular projects.
- These builds average 25 to 50 minutes; strangely, the quicker projects tend to generate more logs.
- I think much log outputs may trigger the problem.
- Can you see what process gets slow in builds?
- If building processes outputs timestamps, please compare them before and after downgrading build-timeout plugin.
- If you installed Timestampler plugin, please compare timestamps in console outputs.
- Timestamps logged with timestampler-plugin may differ from the activity of the building process as they can be buffered and delayed.
- If you don't have timestamper-plugin installed, you'd better not install that as that plugin also captures log output and may cause the same problem.
- Plugin timestamper is installed, but only some projects use it. However, the builds in question are no longer saved because they're so long ago, therefore I cannot examine them with the timestamps.
- I do have some data captured in a database about those builds. That's data like job-name, build-number, result, build-duration, and interestingly, excerpts from the logs for failed/aborted builds.
Thanks for information.
I tried to reproduce the problem using native processes and I think I could.
I continue the investigation.
How I reproduce:
- Installed build-timeout-plugin 1.13
- Create a free style project with "Execute a shell":
#!/bin/bash for i in $(seq 300); do for j in $(seq 65535); do echo ${i} ${j} done done
- I tested this on Windows 8, using cygwin for 64 bit.
- Run a build with and without "Abort the build if it's stuck".
- "Absolute" Timeout strategy with 30 minutes.
Result:
Condition | Duration |
---|---|
Without build-timeout | 10 minutes |
With build-timeout | 12 minutes |
and I found that the duration gets 32 minutes if I enabled timestamper-plugin...Amazing.
- I do have some data captured in a database about those builds. That's data like job-name, build-number, result, build-duration, and interestingly, excerpts from the logs for failed/aborted builds.
If that contains amount of logs, I want to know whether amount of logs affects how much builds get slow.
I identified the cause is watching log output.
Condition | Duration |
---|---|
Without build-timeout | 10 minutes |
With build-timeout | 12 minutes |
With build-timeout disabling log watching |
10 minutes |
Wow, great job reproducing the problem!
Questions:
- Do you have any special advice I should pass on to my Jenkins users regarding their behavior to avoid slow builds?
- What's the next step in getting the fix?
And thank you so much for your work on this!
Do you have any special advice I should pass on to my Jenkins users regarding their behavior to avoid slow builds?
As this is a problem of the plugin, there's nothing users must do.
But the investigation I did indicates followings:
- Fast and much log outputs can have builds slow.
- Decreasing log outputs may quicken your builds.
- Outputting logs to a file and archiving that as a artifact may work, but it should depend on CPU, network bandwidths and disk IOs.
What's the next step in getting the fix?
I'll plan to make a new release in this month including other fixes and improvements.
Please wait that.
Please upgrade the plugin carefully as I'm not 100% sure this can resolve your problem. You should be ready to downgrade the plugin whenever you find the problem reproduce.
If you can, you'd better introduce a testing envirinment.
Code changed in jenkins
User: ikedam
Path:
src/main/java/hudson/plugins/build_timeout/BuildTimeOutStrategy.java
src/main/java/hudson/plugins/build_timeout/BuildTimeoutWrapper.java
http://jenkins-ci.org/commit/build-timeout-plugin/e26ce053ef23921d164f46515d7d0ebb7e30c398
Log:
[FIXED JENKINS-23012] Resolved a performance problem introduced in 1.13 by capturing log outputs.
Code changed in jenkins
User: ikedam
Path:
src/main/java/hudson/plugins/build_timeout/BuildTimeOutStrategy.java
src/main/java/hudson/plugins/build_timeout/BuildTimeoutWrapper.java
http://jenkins-ci.org/commit/build-timeout-plugin/3a0984b42932286d21acc36f19dd96e315655fbb
Log:
Merge pull request #26 from ikedam/feature/JENKINS-23012_PerformanceProblem
JENKINS-23012 Resolved a performance problem introduced in 1.13 by capturing log outputs.
Compare: https://github.com/jenkinsci/build-timeout-plugin/compare/f71585212231...3a0984b42932
Fixed version is released as 1.14.
It will be available in a day.
Please try that version.
Please upgrade the plugin carefully as I'm almost but not 100% sure this can resolve your problem.
You should be ready to downgrade the plugin whenever you find the problem reproduce.
I'll close this ticket after you see it work.
I might be able to install it over the weekend. And I'll be careful.
Thanks for your help with this! I'll be in touch.
Hello. Thanks for your fix. I have installed and tested 1.14 in our multi-master jenkins CI and seems happy. I was not able to reproduce the performance issue.
Thanks for the fix report.
Please reopen this issue for the case the problem remains.
I just noticed this ticket now, so sorry for the late comment. But taking a quick look at the code, the performance issue is quite obvious: in BuildTimeoutWrapper.decorateLogger you are only overriding write(byte) in the wrapper OutputStream - that is hugely inefficient, as most writes occur through write(byte[]) or write(byte[], int, int), and those translate to calling write(byte) one byte at a time. So, override write(byte[], int, int) too (write(byte[]) can be left as is) and I think you'll find a noticeable improvement.
As those synchronized methods are called at the start and the end of builds, I don't think they cause being slow.
Rather, changes to watch log outputs may cause the problem.
https://github.com/jenkinsci/build-timeout-plugin/commit/2129c5d8fc4a9d9432cf95ce34ce522e646eb3ff#diff-891dfa43e0d85dea7162d46b430299d7R184
I want to make some testing versions to identify the cause.
But I'm not sure how to reproduce the problem.
Can you try testing versions if I provide?