When running a certain configuration of a job chain, we intermittently, but fairly reliably, get the following error:
The fault first occurred when uplifting from Matrix Project Plugin 1.7.1 to 1.8 and it persists in later versions.
It seems to be a requirement that there are a multitude of nodes active and serving the chain of jobs, for the error to occur. Our test environment only uses a limited number of nodes, causing the fault to be absent at first when trying to reproduce it. But by copying these nodes, simulating the use of 12 nodes, we managed to get it to reoccur.
I have attached configs for a chain of three jobs, a start job (IllegalMonitor_Start), the main matrix job (IllegalMonitor_Matrix) and a postbuild job (IllegalMonitor_Matrix_Cleanup), that reproduces the fault, at least in our environment. To produce the fault, setup a bunch of nodes with a bunch of executors (we used 10 for each node), setup the job chain and spam-trigger builds of the start job. Typically the fault occurs within 10-30 builds, although sometimes I've had it first occurring at ~80-100 builds
I also identified that it is the groovy postbuild action in the post-build job that causes the error, namely:
If it is removed, the fault disappears.
The fault seems somewhat sporadic. Any time it occurs it will be while triggering builds on the same node, for some time, and then the node where the fault occurs will switch. For example the fault might occur when triggering builds on node_1 for a few hours or a day, and then switch to occurring while triggering builds on node_7.
There have also been situations where I've stripped the config of some features (while trying to pinpoint the source of the error), saw that the fault disappeared, rolled back the config and again, piece by piece, stripping it to the same config as when the fault previously disappeared only to have the fault occur again. Some of this might be due to variance in the fault occurring but doubtless all of it wasn't.