I've seen a similar issue where one stuck slave causes all other jobs of the same build to backup. For example, I had a hundred builds that looked to finished, but were still on the slaves and had this message on the console output:
00:32:03.591 Editable Email Notification is waiting for a checkpoint on build_hhvm_fbcode #13105
Build #13105 had the following message:
01:26:19.644 Looks like the node went offline during the build. Check the slave log for the details.
That machine had in fact been rebooted, but all of the other builds shouldn't have been waiting for it. Once I cancelled build 13105, all the other jobs completed.
I've seen a similar issue where one stuck slave causes all other jobs of the same build to backup. For example, I had a hundred builds that looked to finished, but were still on the slaves and had this message on the console output:
00:32:03.591 Editable Email Notification is waiting for a checkpoint on build_hhvm_fbcode #13105
Build #13105 had the following message:
01:26:19.644 Looks like the node went offline during the build. Check the slave log for the details.
That machine had in fact been rebooted, but all of the other builds shouldn't have been waiting for it. Once I cancelled build 13105, all the other jobs completed.