Please read the whole story as the issue is quite complex and touches at least two problems in Jenkins.
The issue that we face almost everyday is that from time to time part of our Jenkins Nodes fleet gets broken until we disconnect them and reconnect them again. What happens when they get broken is all builds on these nodes fail due to an exception in post-build steps (in our case in ProcessTreeKiller), even when regular build steps succeed. As a consequence such a broken node can lead to the failure of multiple builds one after another, even if their build steps are OK and all succeed. Now imagine a setup in which you have tens or hundreds of nodes, hundreds of jobs in a queue and suddenly a few nodes get broken and start failing builds one after another. Cleaning up this mess is a lot of work... And it happens almost every day.
Over time we managed to determine that a partial reason why our Jenkins Nodes become broken is described in JENKINS-61103. More than a year I shared the details of the exact problem that we observe in our setup in a comment. The issue was fixed in one of the latest versions of the remoting library but we didn't have the chance to try it out.
Now back to Gerrit Trigger plugin. In our case the Gerrit Trigger plugin is the one that actually triggers the issue described above. The problem is the Gerrit Trigger plugin can interrupt the thread executing a given task multiple times. When this happens we see the following message in build details page multiple times: "Aborted by new patch set."
Based on experimentation and source code analysis it seems the problem occurs when two or more Jenkins jobs have Gerrit Trigger plugin configured to be triggered by changes from the same Gerrit project/repository. When a change in Gerrit is created/updated, an event is fired by Gerrit, which is then processed by Gerrit Trigger plugin for each of these jobs separately. The thing is the first interrupt usually interrupts our main build step while the remaining interrupts simply interrupt any post-build processing which often leads to the issue described in JENKINS-61103 which is a fatal one.
Now to the details, if a job does not have "Build current patchsets only" option enabled (so it relies on the global one which is enabled in our setup) then:
- scheduled() calls cancelOutDatedEvents() without providing the job name (null in the 3rd parameter) associated with the RunningJobs instance / GerritTrigger instance
- then cancelMatchingJobs() doesn't apply any filtering and interrupts the executors of all builds from all jobs triggered by the original event
- the same procedure is repeated for each job triggered by a given Gerrit event which leads to multiple interrupts / aborts...
Interestingly if a job does have "Build current patchsets only" option enabled then:
- cancelTriggeredJob() calls cancelOutDatedEvents() with a specific job name which is then used in cancelMatchingJobs() to filter and interrupt only the jobs matching that name
The above observation was discovered just yesterday. Today morning we enabled "Build current patchsets only" option in all jobs triggered by the plugin to see if that would help us, but of course it's just a workaround. Currently we're testing and monitoring our Jenkins setup...
I'd be happy to discuss the potential fix and contribute it to the community. I just need guidance. I'll probably prepare an initial PR with a fix proposal so that we could discuss more easily.
In the meantime I checked the GitHub PullRequest Builder plugin code as it also offers similar functionality, i.e. it can abort running jobs when a new commit is pushed. The GHPRB plugin does this differently from the Gerrit Trigger plugin meaning it only checks and aborts builds provided by the associated project/job. My understanding is that both plugins should be doing the exact same thing here, although I might be wrong of course.
One final observation, and this is more an architectural one for Jenkins itself, is that plugins may easily interrupt post-build steps. Sometimes this is not a big deal, but in some cases, for example when a post-build step is freeing critical resources or is killing left-over processes like ProcessTreeKiller does, this is a big deal. I'd expect that during post-build phase any interrupts would not be propagated directly but rather they would be propagated only after some predefined amount of time elapses giving the post-build steps time to finish. Of course one of them could be stuck so such interrupts should not be ignored completely, just they should be passed directly. I know this is a complex topic and it should be addressed separately, but I just wanted to touch it too, as it's related.
Also please note that this issue was very hard to track down. I can imagine many other Jenkins installations suffering from the same issue or a similar one, simply caused by interrupting post-build steps. And there are other plugins which can interrupt running builds, like the Timeout plugin. Also a build can be aborted twice by the user (see JENKINS-59494). In the end if the timing is bad then totally unexpected things might happen...