-
Type:
Epic
-
Resolution: Unresolved
-
Priority:
Major
-
Component/s: cucumber-reports-plugin, github-autostatus-plugin
-
None
-
Environment:Jenkins LTS 2.528.x
-
FIX-CME
Initial context:
As discussed on Gitter during autumn 2025, several Jenkins deployments that I know of intimately, had intermittently failed long-running jobs - which is quite annoying since the faults are unrelated to the built/tested code quality and re-runs may cost hours (complex systems, embedded boxes re-flashing, etc.) This seems to have intensified after summer, but that is not certain.
Some commonalities included running parallel stages (in some cases massively-parallel, hundreds of them; in others about a dozen overall, nested-parallel or not), using lockable resources, updating job badges to signal progress one way or another, using Git of course, possibly adjusting currentBuild.result and similar data (whether directly or through unstable, warnError etc. steps).
Â
Investigation results:
The visible effect that with no rhyme or reason jobs just occasionally failed, with tons of their steps and stages "aborted", shell steps in them "Terminated", etc. Sometimes but not always a stack trace appeared at the end of build log, mentioning a java.util.ConcurrentModificationException along with different preceding lines (of an actual failed activity) and different "Caused" exceptions listed about 10 lines below.
Looking for such markers in build logs and Jenkins console logs uncovered that such events do not always kill a running job, and that they did happen in the past months and years; it is inconclusive if the frequency now is higher or not (and if yes - why, changes to plugins/core/pipelines themselves?)
- One notable suspect is the Lockable Resource plugin and changes sometime after its February release, see https://github.com/jenkinsci/lockable-resources-plugin/issues/818
- Other plugins that bit me included GitHub status submissions (usually associated with faults in SCM checkout step), and Cucumber analysis (which I first thought was related to fingerprinting).
- It is likely there are other cases that my deployments did not hit, feel free to list more in the comments – maybe my controllers and jobs are not using those plugins, or they have/use no Actions that can fail a WorkflowRun serialization.
I fed the stack traces and other clues and thoughts to ChatGPT (to see how it would fare), and positively-surprisingly to me, its analysis seems quite spot-on, so linking here to a copy of my chat with it: http://htmlpreview.github.io/?https://gist.github.com/jimklimov/7504643a52279406896c71c8ad3a33f0/rawÂ
TL;DR: Various events in the Jenkins itself, in certain plugin actions, and in job persistence (depends on global/job options), can all cause saving of configuration or workflow states. Generally this can not be controlled nor caught, especially not by pipeline or library code (which is additionally walled-off by CPS).
Saving implies serialization of objects (with XStream) and eventual de-serialization if it comes to that. In case of WorkflowRun objects it is about saving the Action list associated with the run, and whatever that would involve further.
Sometimes the code (e.g. plugin classes) modifies the object being serialized while it is being scraped by XStream, without having special reservation for providing snapshots of complex properties (Lists, Maps) or using concurrency-safe implementations of those.
There are tricks to make the code better conforming, whether following https://www.jenkins.io/doc/developer/persistence/backward-compatibility/ or some others as in that chat excerpt.
In that chat log there are also suggestions about coding patterns to avoid some known issues today, unfortunately many of these are not acceptable to my pipelines (e.g. "only set badges after the parallel part of the job" when we need to make progress of hours-long job visible and badges/statuses are the working way to do so). Anyhow, it is the Jenkins controller's (core+plugins) responsibility to be robust against whatever is thrown at it.