-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Jenkins 2.204.5
warning-ng 8.1.0
Git Forensics 0.7.0
-
Powered by SuggestiMate
Summary
1) Warning-ng marks messages as new although they are outstanding in some cases.
2) When it happens more then once it also might mess the age information.
It seems that adding git forensics might help to improve the detection algorithm. Some thoughts about what can be done will be added as a comment.
The problem
As you can see the message at line 9264 was marked as new therefore it's age was reset to 1.
but all lines were committed at the same time (only part of the line is shown) and were not changed in the last years.
Moreover every build that the detection fails gets new referenceId for that message and as a result the Age of the warning is reset to 1.
As a result after some builds with changes on other lines the age of lines that were committed together might be totally messed up as can be seen below.
you can see that for the same commit we get different ages.
The reason is that those messages were identified (incorrectly) as new along the time and their age was reset to 1 at the time.
see the 502_diff.xlsx attachment for more details
- new_report_head.JPG
- 43 kB
- blames_screen.JPG
- 159 kB
- blame_cmd2.JPG
- 191 kB
- 502_diff.xlsx
- 26 kB
- is blocked by
-
JENKINS-6675 Invalid number of new warnings detected
-
- Open
-
- relates to
-
JENKINS-61392 Use information from SCM forensics to improve detection of new warnings
-
- Resolved
-
[JENKINS-61383] Current new messages detection mechanism is not working for simple line insertion in case of many similar adjacent warning
not sure I understand the question. The file is changed from time to time, I can run raw git command on this commit and report the result to help track the issue. I run the git blame both on build system and on the jenkins slave that runs the cppcheck and got the same results therefore I tend to think it is a problem in the plugin and not in our setup.
It seems that there is a problem in the way the plugin calculate the age for different lines. Moreover when I did rebuilt for the same job again - i got different lines that were reported with age 1 and therefore are reported new, which is another indication of a problem.
I see. Then a warning has been marked as new due to a refactoring, am I right? But the warning actually was not new...
AFAIU The warning has been marked as new due to the fact that the line age was reported (incorrectly) as 1 by the git forensics module.
This is not possible, detection of new warnings is based on comparing the warning sets of the current and the reference build. Are there some unrelated source code changes in sim_mgr.c (9264 +-5 lines) in the current build with respect to the reference?
i think that some lines were added. I will check again tomorrow.
all the adjacent lines had the same warning and some lines were added so the warnings shifted while some of the message were still valid for the new line.
regardless of the new mechanism. How come git forensics report different age for the same commit id?
The age has nothing to do with the Git blames. It is only computed from the properties of the warning. Two warnings are considered the same if they are equal with respect to all properties, or if the source code context is the same. The context is currently computed by the +-5 lines of the warning line.
The Git blame is computed afterwards: for every warning file and line git blame will be called. Seems that Git blame still can map the moved line of the old position.
Thanks, now it makes more sense to me. Somehow I was thought that that the age is related to the commit so I was confused. Maybe the name should somehow reflect that this is reference age.
Anyway I collected the data to answer your question. The only change in the file was that 2 lines were added before the relevant line. I created an excel with all the relevant data per my understanding. Please let me know if it helps.
See the excel for more details.
What warning properties are compared, I assume type, details without line code itself am I correct? Maybe comparing the actual line can help?
How is the source code context comparison done is it git diff or free comparison?
BTW. Where in the source code this comparison is done?
Thanks for the detailed analysis: this is exactly what I assumed. The warning lines changed from #502 to #503. Therefore the warning delta is using the context of the warning (+-5 lines) to track new warnings. This does not work in all cases.
This algorithm just uses the texts of the source code from both builds (it is independent of any source code management system).
The source code is located here: https://github.com/jenkinsci/analysis-model/blob/master/src/main/java/edu/hm/hafner/analysis/IssueDifference.java
I saw you changed the issue to major Improvement. Well like the say it depends if you are a buyer or a seller...
From a User point of view it is definitely a bug and a critical one. Because the algorithm does not work in my case (and I have many more examples), therefore it is unreliable. the whole point to characterize a message as a new is that people can devote their energy to new messages (only). But when I mark old message as new one I lose the attention that people were willing to give to the static code analysis tool.
From a Developer point a view the algorithm is working only in most cases so it is a call for major improvement (or picking a different algorithm).
I tend to think that Jira should serve mainly the "user prospective" for bugs and "user & developer prospective" for improvements. But I this case I personally favor user prospective.
Anyway, thanks for pointing out the exact code, I will try to come with an offer for improvement hopefully soon.
You can set the priority to the value you like. So if you feel that it needs to be a blocker, then go ahead. For me it does not really make any difference. Typically we use blocker in Jenkins for things that break a plugin. Like a NPE or a class loading problem during a release upgrade. These bugs I try to fix asap.
But in this case you still can use the plugin, just the number of new warnings is wrong. And this is something that we can't solve in general, we can just improve the heuristics.
Well it seems that this comparison is not a simple thing..
Anyway some initial thoughts about possible improvements and the algorithm per my current very limited understanding of the code as of writing this comment.
Currently, I tend to think that compering message type & severity , description + line code itself (git blame commit id instead of =-5 lines context) for the line will be strong enough to recognize outstanding and will give better results, however since I am not sure, I added some more details and thoughts.
For the sake of this discussion I want to focus on changes that their root cause is in the source code. that is to say that I am ignoring for example new messages that are new due to new version of static analyzer tool that reports more messages,
Maybe we should add another category (even for short term till the algorithm will cleaned) such as "seems new" side by side to "new" or add attribute of the level of confidence of the category. (Same for fixed)
The benefits are:
1) Marking something as new/new incorrectly has more "damage" then marking it as "seemingly new" from user prospective
2) This way it will be easier to find bugs in "seemingly new" algorithm. I tend to think that in our environment people will devote their energy mainly to clean "new" messages and to prove that "seemingly new" are not real new so this way, the algorithm can converge faster.
Therefore if we can find message that are "strongly match" it will be defined as "outstanding", but if the messages are "weakly match" we can define them as "seemingly new", they will become "new" if they don't even "weakly match".
The decision what attributes to compare in order to declare "weak match" can be changed down the road per the performance of the algorithm in the field.
We can start with initial definition of the importance of each attribute equivalence and decide to include only the more important attributes as sufficient for "weak match"
I tend to think that the importance of the attributes ,is the following:
0) no diff in the file
1) The message type & severity , description
2) The source code line text == git blame commit id for the line
3) The file name
4) The location (line, column) in the file
The algorithm
1) sort currentIssues Report per file (and line)
for each file:
if current file is equal to reference file (do git diff since it is very fast):
for every message in current file
if an "exact/strong" match is found in reference file:
mark message in this file as outstanding
set confidence "high/absolute"
if an "weak" match is found in reference file:
mark message in this file as outstanding
set confidence "mid"
for all the rest
mark message in this file as seemingly new
set confidence "low"
if files are different try to add git diff information about line movement to build strong or weak match
2) In second pass try to find code that moved between files
I might try to add more details later, but I will be happy to get your initial feedback on the above.
Thanks for helping to improve that algorithm! I'm not sure if I miss something but for most analysis tools I think we can assume that if the file has not been changed then the warnings properties are still the same. This is already perfectly handled by the current algorithm. I marks all warnings as outstanding that share the same properties. So the interesting part are the remaining ones: the file changed somehow and some warning properties are different: the typical cases are new line numbers due to inserted or removed lines. Or, renamed content due to a refactoring (method name change, etc.). For these cases we need to find a better algorithm. I see the following ways:
- Current behavior: match the source code text below and before the warning and try to find the same text in the reference file.
- Rather than using a string match we are trying to match the AST of the file. This gives some better results (see https://github.com/jenkinsci/analysis-model/pull/305).
- Rather than using a string match we are using the techniques behind MOSS (https://theory.stanford.edu/~aiken/moss/) to fingerprint the files (see https://github.com/jenkinsci/analysis-model/tree/moss).
- Additionally use the version history to improve 1, 2 or 3.
Thanks a lot for your detailed answered. It is very kind of you that you are so open to hear from the open and respond in detail.
Thanks, I don’t take it for granted.
1) I totally agree with you that once the file is not changed that all (exact) matching warning can be treated as outstanding, and all the rest as new.
If you ask how come there will be new issues for files that was not changed, then I tend to think that it can happen for 2 (rare I must say) reasons.
- Tool changes (such as version upgrade or different configuration as removing suppressions)
- Change in other logically linked files that created a new problem and that the report will point to an old code as part of the problem. In c++ for example (tool: cppcheck) one can see that part of the messages are not “contained in one line”. Meaning that there are problems that are multiline e.g. allocate memory with no free, asking on null pointer after using it.
However since those are quite rare and also make sense from user prospective that will pop us as new for old code, it is quite simple to handle.
2) Now for the interesting part what happens when the file was changed.
From the 3 possibilities you mentioned I tend to think that the current behavior is the only one that is valid.
From my point of view plagiarism is a totally different issue/problem that is not relevant here, therefore both AST/MOSS will not help us here (see below why I think so).
So assuming this is the only valid option, what can be improved in the current behavior to overcome the bug?
I tend to think that the only required change is how to compare an issue and how to fingerprint a message.
Therefore my offer is to:
- Calculate line shift through git diff. and/or use git blame to ignore line shift.
- Compare only the exact line (+- shift) of the source code (not range of lines) & warning message id & text.
- If needed compare fingerprint and when match found inform user that this is a fingerprint match and not except match.
Let me explain why I think so. But before that, I want to put it in prospective by adding a disclaimer and some background.
First, I am not an expert on any of those issues, so all I can offer is my (somewhat naïve) thoughts on this issue. I googled to understand what is AST/ MOSS and briefed the code and I am not Jenkins expert.
Second, my view is derived from my current mission. I am responsible to introduce static code analysis tools (SCA) in the development process for a very big legacy code base. I tend to think that for this reason I tend to prefer one type of solution on the other as I will explain.
Ok, so let’s take it one by one.
- Why the current behavior is the only one that is valid?
Well, AFAIU AST/MOSS are plagiarism detection tools that look for copies (with small modifications) for the original code . But for me another copy of a piece of code which was introduced between commits is a new code. Therefore even if I had 2 warning on the first copies and now I have 2 more the 2 new messages should be tagged as new (while the original as outstanding). The is perfectly valid from developer point of view because someone added a copy in another place so this is new message. From developer point of view, if he (or others) added some lines before the line that triggered the warning message line or added some white space in the line then “he did not changed the line”, but if he/others touched the line (in the extreme even if it was white space) then it is new.
Look at it this way, let’s define the exact line that created the warning message as the “message line”.
So we have three types of file changes:
- Before the line
- In the line
- After the line
Clearly the match algorithm don’t care for (c), the match algorithm & the developer care about (b), and regarding (a) the developer don’t care but the match algorithm should since it needs to understand that it will change the line mentioned in the warning message.
So we can use either mechanism that brings better results.
- Sum all git changes up to the current line as line shift and compare reference line + shift to current line
OR
- Compare only git blame SHA1 for reference line to current line (ignore the actual line location and use line content for comparison)
- Why mark real compare different then fingerprint compare ?
As I said I am trying to introduce SCA for large legacy code base.
So, for this purpose I am looking for
- Having absolute accurate delta report for “new” issues.
For my needs it is a must since our current code base creates many warnings and I must be able to focus the energy of the organization only on real new issues, otherwise the effort to introduce SCA is doomed to fail .
- Having (almost) accurate delta report for “fixed” issues.
For next phase of the deployment, I would like to data mine all the fixes of the source code that were done during the development of old versions and try to correlate them with SCA messages. Finding old fixes that are correlated with old SCA messages means that those problems were detected by QA or customers and therefore can help me to prove that it is cost effective to fix even legacy messages at least for message categories that will show correlation with old fixes.
Therefore I tend to think that letting the user know about what is absolute match and what seems to be a match is very crucial to keep user’s trust in the report categories {new, fixed, outstanding}.
Well, some more info.
We tried an alternative algorithm.
All we do is we do diff for each reference file of sorted (by lines) warning to current warning file
1) The source_file_name, error id, error message, and the source line contents (NOT the line number).
This simple algorithm is giving quite excellent results for our case (currently as POC).
The only 'flaws' that we found
1) Someone changed the source code line but did not resolved the issue
- This will be shown in our algorithm as resolved on one hand and a new error will be reported for the same line on the other hand.
- For our case this is perfectly ok. It is true that the real error was not resolved. However since someone bothered to touch the line, then from our point of view we can mark it as new, since the code is active it is better to show the error.
- BUT this triggered a new idea that will be described below.
2) Reordering content the source file without real change might cause real noise.
- Well in the POC we did not find such a case in our code base, and we tend to think that this is rare enough and will be accepted by users as known limitation.
- However once we will find such a case and assuming users will treat it as a bug we will enhance the simple algorithm to overcome it.
New idea:
The case that the line was changed but the problem exists although acceptable by users is still interesting to be identified for data mining.
It seems that we would like to add another category to the [new, resolved, outstanding]: touched.
So resolved means that the problem really solved and touched means that the line was changed therefore the full comparison fails but since the same error type still being reported for the line we will tag it as touched.
So in the case of changed line instead of getting 2 messages: resolved + new. We will get only one : touched
We still did not implement it, but soon will.
What do you think?
Just a few comments:
1) The source_file_name, error id, error message, and the source line contents (NOT the line number).
Isn't that almost the same algorithm I am using? Just without using a multi-line context - just the line with the warning. I think we can make that property configurable.
Another thing to consider: currently there is no API to access Git (or any other version control system) to get diffs. So if you plan to extend that part it should be a generic approach that still works if nobody is using Git.
1) Yes I agree,
the main difference is that we compare
- the line and not a range of line
- we omit the line number from the comparison.
It will be really great if you could make it configurable.
2) Since we need it quite urgently I will appreciate your estimation when you can release such a fix/ improvement ?
3) I did not understood your comment about diff. There are many diff tools. We currently use a python lib for diff, but as you mentioned git diff is very strong tool for diff and can be used as command line even if the source repository is not git, see https://stackoverflow.com/a/17194704 for example.
I created JENKINS-61607 to track the improvement.
3) I did not understood your comment about diff. There are many diff tools. We currently use a python lib for diff, but as you mentioned git diff is very strong tool for diff and can be used as command line even if the source repository is not git, see https://stackoverflow.com/a/17194704 for example.
Ah now I understand your idea. This is not possible since there is no previous code available! All we have is the source code of the current build (and the fingerprint hashes of the warnings before).
I am not comparing the source code neither the fingerprint.
I am comparing the full report of the current warning to the full report of the reference.
But how do you then find a warning if you add new lines?
Example:
old.file: line 1 and line 3 have the same warning type
new.file: one empty line is added in line 1, i.e. line 2 and line 4 now have the same warning type
How should we find the new warnings without inspecting the source code?
Here is extract from a real example
new build # | category | Error id | File | Line | Code |
991100167 | Outstanding | useInitializationList | collector.cpp | 64 | m_szHotfixName = ""; |
991100167 | Outstanding | useInitializationList | collector.cpp | 65 | m_szBuildTake = ""; |
991100167 | Outstanding | useInitializationList | collector.cpp | 70 | m_szHotfixName = i_szHotfixName; |
991100167 | Outstanding | useInitializationList | collector.cpp | 71 | m_szBuildTake = i_szBuildTake; |
991100167 | Outstanding | useInitializationList | collector.cpp | 125 | m_szProductName = ""; |
991100167 | Outstanding | useInitializationList | collector.cpp | 130 | m_szProductName = i_szProductName; |
991100168 | Outstanding | useInitializationList | collector.cpp | 64 | m_szHotfixName = ""; |
991100168 | Outstanding | useInitializationList | collector.cpp | 65 | m_szBuildTake = ""; |
991100168 | Outstanding | useInitializationList | collector.cpp | 70 | m_szHotfixName = i_szHotfixName; |
991100168 | Outstanding | useInitializationList | collector.cpp | 71 | m_szBuildTake = i_szBuildTake; |
991100168 | Outstanding | useInitializationList | collector.cpp | 129 | m_szProductName = ""; |
991100168 | Outstanding | useInitializationList | collector.cpp | 134 | m_szProductName = i_szProductName; |
As you can see in build 168 4 lines were added between line 71 to 125 as a result the last 2 warning moved 4 lines down. However it is marked as outstanding.
What we do currently (in the mean time till we will have a real solution) is the following:
1) For every build we keep 2 report files - one with the line number and another one without line number
2) we use diff (since it does smart comparison) between the current and reference report file that does not contain the line number. As you can see since without the line number all the details match the diff will know to handle it (even if in line 128 there will be a new warning that was not in last build).
[3) Then in the second pass we compare all the details with the line number but without the code itself in order to find "touched" as was explained above]
Let me know if this is clear enough. if needed we can do a Zoom meeting (It is popular nowadays...)
I think we are mixing two many things in one issue (warnings that are marked as new but are not and the case you are describing in the last issue): from my understanding the last two warnings are actually outstanding, so the current algorithm is working just fine: nothing changed for the two warnings. They are just moved down 4 lines. In your understanding they are not outstanding?
Maybe it would makes sense if we create some scenarios (maybe based on a unit test) where we can actually discuss the current behavior and the expected behavior. Otherwise I think we've been going in circles... I'm using such a approach to show that using an AST is better to detect new warnings. I think I need to generalize this approach in order to get a good test fixture...
Well again if needed we can talk.
1) I agree that a good test test case is a good idea. Generally I am in favor of tests and TDD.
2) All this bug report started since I run cppcheck on many build and looked at the new and resolved of the warning-ng. I found many times that the warning ng was confused due to line insertions and reported new on many cases it should not. For example see above at the beginning of the thread.
3) Since I am not fluent in Java and Jenkins development and I don't have a working setup. We wrote python tool to see if we can prove that the algorithm works kinds of POC.
4) It seems that it works quite well and does not report the amount of 'false' new as the warning ng. The problem with what we did is that it is not integrated with Jenkins and the nice graph the warning ng shows.
In other words.
from my understanding the last two warnings are actually outstanding, so the current algorithm is working just fine
The output above is from OUR tool that why it shows outstanding. I am quite sure that the current algorithm will mark it as new (I did not tested this particular example, but I do have a lot of examples that the current algorithm fails and mine works OK).
The test scenario:
Have a file with same error type but different code for 10 lines.
Do a report
Add a 10 lines in the middle with 2 lines with the same error type
Add one line with different error type
Do another report
I looked today and saw that you linked JENKINS-6675 Invalid number of new warnings detected - to this issue.
I think that the example mentioned there can be a good test, (BTW my proposed algorithm will act there correctly).
Did someone change the file sim_mgr.c?