Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-52736

Occasionally the plugin leaves orphaned, stopped VMs

      Occasionally this plugin leaves orphaned VMs after they are terminated / no longer used by this plugin, left in a stopped state:

      This wastes compute resources and costs.

      Ideally the plugin would not do this, but in addition, having a periodic (every 5 minutes) check to go through the current VMs in the project, see which ones are tagged with "jenkins" and then automatically terminate any VMs tagged with that and not known to Jenkins. This would make it resilient against unexpected Jenkins restarts, etc. (though it should be an option in case multiple Jenkins instances share the same GCE project).

          [JENKINS-52736] Occasionally the plugin leaves orphaned, stopped VMs

          Rachel Yen added a comment -

          Have you identified any other factors that lead to the orphaned/stopped VM's? I would like to replicate the situation.

          Rachel Yen added a comment - Have you identified any other factors that lead to the orphaned/stopped VM's? I would like to replicate the situation.

          June Rhodes added a comment -

          I haven't been able to isolate the cause of the issue (I also haven't run a build that used these VMs in a month or two). I think it only really happens under high load when the plugin is firing up like 10+ compute VMs for parallel jobs.

          For reference, we use pipeline DSL with the `parallel` and `node` tasks, so we're often firing up lots of VMs in parallel to get through the work. It might be worth checking to make sure that the shutdown code paths not only send the terminate command to the VMs, but also wait to verify that the VMs actually did terminate and no longer exist (I'm not sure what the code does right now).

          June Rhodes added a comment - I haven't been able to isolate the cause of the issue (I also haven't run a build that used these VMs in a month or two). I think it only really happens under high load when the plugin is firing up like 10+ compute VMs for parallel jobs. For reference, we use pipeline DSL with the `parallel` and `node` tasks, so we're often firing up lots of VMs in parallel to get through the work. It might be worth checking to make sure that the shutdown code paths not only send the terminate command to the VMs, but also wait to verify that the VMs actually did terminate and no longer exist (I'm not sure what the code does right now).

          James Robson added a comment -

          I'm seeing this as well, I believe it is caused by pre-emptible instances actually getting pre-empted. I didn't check every instance that gets into this state but all the ones I did look at had a message about getting pre-empted in the stackdriver logs.

           

          James Robson added a comment - I'm seeing this as well, I believe it is caused by pre-emptible instances actually getting pre-empted. I didn't check every instance that gets into this state but all the ones I did look at had a message about getting pre-empted in the stackdriver logs.  

          June Rhodes added a comment -

          That's very weird - I would expect preempted instances on Google Cloud to be deleted, not stopped. Otherwise customers will continue to billed for storage after the VM is stopped.

          Certainly in the case where we use Kubernetes Engine with preemptible VMs, the VMs don't stick around in a stopped state.

          June Rhodes added a comment - That's very weird - I would expect preempted instances on Google Cloud to be deleted, not stopped. Otherwise customers will continue to billed for storage after the VM is stopped. Certainly in the case where we use Kubernetes Engine with preemptible VMs, the VMs don't stick around in a stopped state.

          James Robson added a comment - - edited

          In the Compute Engine, a pre-empted instance is left stopped. From the doc https://cloud.google.com/compute/docs/instances/preemptible:

          Preempted instances still appear in your project, but you are not charged for the instance hours while it remains in a TERMINATED state. You can access and recover data from any persistent disks that are attached to the instance, but those disks still incur storage charges until you delete them

           

           

          James Robson added a comment - - edited In the Compute Engine, a pre-empted instance is left stopped. From the doc https://cloud.google.com/compute/docs/instances/preemptible : Preempted instances still appear in your project, but you are not charged for the instance hours while it remains in a TERMINATED state. You can access and recover data from any persistent disks that are attached to the instance, but those disks still incur storage charges until you delete them    

          June Rhodes added a comment - - edited

          Oh that's very wierd and not at all how similar functionality works on other platforms like AWS. The idea that the machine will be automatically terminated but Compute Engine won't automatically delete it or clean up storage for you doesn't make a lot of sense to me (like if you want to keep data when using a preemptible instance then I guess, but I'd argue a normal instance makes more sense in that case).

          I don't think there's a way to change the behavior to delete instead of terminate on preemption here, but we definitely want to delete storage resources as soon as they're no longer in use. So we're going to have to do something janky here:

          • Create a Pub/Sub topic and subscription in the project, which connects to a Google Cloud Function that deletes the preemptible VM
          • Add a Stackdriver logging export specifically for preemption notices and configure that export to point at the Pub/Sub topic

          I thought the Pub/Sub topic could point at Jenkins, but that won't work for Jenkins instances not accessible by the Internet, so we have to deploy a handle on GCF instead.

          If anyone has any better ideas, let me know.

          June Rhodes added a comment - - edited Oh that's very wierd and not at all how similar functionality works on other platforms like AWS. The idea that the machine will be automatically terminated but Compute Engine won't automatically delete it or clean up storage for you doesn't make a lot of sense to me (like if you want to keep data when using a preemptible instance then I guess, but I'd argue a normal instance makes more sense in that case). I don't think there's a way to change the behavior to delete instead of terminate on preemption here, but we definitely want to delete storage resources as soon as they're no longer in use. So we're going to have to do something janky here: Create a Pub/Sub topic and subscription in the project, which connects to a Google Cloud Function that deletes the preemptible VM Add a Stackdriver logging export specifically for preemption notices and configure that export to point at the Pub/Sub topic I thought the Pub/Sub topic could point at Jenkins, but that won't work for Jenkins instances not accessible by the Internet, so we have to deploy a handle on GCF instead. If anyone has any better ideas, let me know.

          Rachel Yen added a comment -

          Hi June,

          Were you using preemptible instances? 
          Also, thanks for pointing this out. I will have to research this and perhaps change how we're terminating instances.

           

           

          Rachel Yen added a comment - Hi June, Were you using preemptible instances?  Also, thanks for pointing this out. I will have to research this and perhaps change how we're terminating instances.    

          June Rhodes added a comment -

          Yup, we are using preemptible instances.

          June Rhodes added a comment - Yup, we are using preemptible instances.

          Karol Lassak added a comment -

          I think I have found root cause..

           

          Because GCP stop instances when pre-empted they are left in that state..

          And then plugin tries to delete only "RUNNING" instances..

           

          cloud.client.terminateInstanceWithStatus(cloud.projectId, zone, name, "RUNNING");

           

          I think that this line should be changed to 

           

          cloud.client.terminateInstance(cloud.projectId, zone, name);

          Karol Lassak added a comment - I think I have found root cause..   Because GCP stop instances when pre-empted they are left in that state.. And then plugin tries to delete only "RUNNING" instances..   cloud.client.terminateInstanceWithStatus(cloud.projectId, zone, name, "RUNNING");   I think that this line should be changed to    cloud.client.terminateInstance(cloud.projectId, zone, name);

          Craig Barber added a comment -

          Craig Barber added a comment - https://github.com/jenkinsci/google-compute-engine-plugin/issues/77

            zombiemoose Rachel Yen
            hachque June Rhodes
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: