The Jenkins-vSphere plugin and the vSphere hypervisor are getting out of step : I'm seeing VMs in vSphere (started by Jenkins) that Jenkins doesn't know about (either as Jenkins slaves or in the vSphere plugin's internals).
The Jenkins plugin successfully creates slave VMs in vSphere, the Jenkins<->Slave connection establishes, build(s) are run - everything looks good ... until it doesn't.
Jenkins starts complaining (in the log) that it can't create a VM called "myslave-1" because "myslave-1" already exists (which is true - there is a VM in vSphere with that name even though the plugin has no record of it), except Jenkins doesn't know about any slave myslave-1 (there's no node entry and the plugin doesn't know about it either).
i.e. we end up in a situation where the vSphere hypervisor has Jenkins slave VMs running which Jenkins is not aware of.
In my case, I've told the plugin to limit the number of slaves for each type of slave to a fixed number (rather than a total number for the cloud as a whole) so the plugin chooses slave names like myslave-1, myslave-2 ... myslave-N, so when the plugin "forgets" about a slave, it then tries to create myslave-1 and vSphere refuses because myslave-1 already exists in vSphere.
I suspect that if I'd not limited the number of slaves of each kind, and thus the plugin would be using pseudo-random numbering, I'd not see "VM already exists" errors but instead of end up running far more VMs than I'd bargained for.
Unfortunately I have yet to determine exactly what triggers the "leak" (thus far, I've only gone looking in the logs when we're failing to create VMs, which is a long time after we lost track of the VMs).
What we need to do is EITHER to not "leak" these VMs in the first place (i.e. not forget about a VM until it's really gone) OR to have some form of self-healing mechanism whereby the plugin will find out about slaves which exist that it didn't already know about and then either tell Jenkins about them or kill them off.
(or, ideally, a combination, whereby it doesn't get out of sync unless users go in and delete slaves, but it'll still cope if users manually start messing around with slave creation/deletion in Jenkins/vSphere)