- Gerrit is set up with replication
- All git operations in jenkins jobs are running from replicas
- Gerrit trigger plugin is configured to block builds until replication is completed
- Change is pushed to refs/for/master in gerrit
- Jenkins has a pre-submit job configured for this repo, triggered by patchset-created-event in Gerrit trigger plugin.
- This job is enqueued
- Awaits for replication in queue
- Job starts
- Job completes successfully.
- Same change is submitted / merged in gerrit
- Another post-submit job on same change in jenkins should run, triggered via change-merged-event.
Nothing else has been merged in between so the merge revision will be the exact same as the one which ran through the pre-submit job above.
- Job is enqueued
- ROOT CAUSE: Job does NOT await replication - Due to ReplicationQueueTaskDispatcher.updateFromReplicationCache containing a cache on the same ref from #2.2 above
(This cache only uses server, host, ref, project as a key)
- Job is started
- Code in the job assumes that change is merged to master and that it can simply run git fetch origin master
- ERROR SYMPTOM: - Replica has the ref, but it has not yet received the push event for master so the fetch will serve potentially old head of master!
The scenario above has been confirmed with logs on the merge event stating "processed a replication event from the cache" even before the replica has received the push.
The thesis here is that change-merge-event should never use the replication cache, or the cache has to be extended to also take branch pushes into account, not just refs. Because post-submit jobs often want to assume that master-branch is in a consistent state and contains the most recent changes. They don't directly operate on the GERRIT_REFSPEC as you would in a pre-submit job.
Alternatively each job should have an option for using the replication cache or not since only the job author knows if the job uses "master" or "GERRIT_REFSPEC".
As of now there is no pretty alternative except sleeping or polling in the job. I see there is also a config for cache expiration which i guess could be set to 1 sec or something to effectively disable it completely, but that will have global impact.