Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26798

EC2 instance not stopped/terminated when slave marked offline

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • ec2-plugin
    • None
    • Jenkins ver. 1.580.3
      EC2 Plugin ver. 1.24

      We heavily utilize this plugins with 4 separate models created in our master for different AMIs for different type of build project.

      While the scale capability is great, we do encounter high frequency of out-of-sync slave. This means that we have Jenkins slave being marked offline, but the actual EC2 instance are still in "running" state instead of "stopped" state.

      Not sure if it is something to do with AWS side unable to stop instance, but at least the Jenkins.log does not indicate that is the scenario.

      Given the high volume and high frequency of this situation, we are actually bleeding money without added build capacity for those out-of-sync slave.

          [JENKINS-26798] EC2 instance not stopped/terminated when slave marked offline

          Kevin Cheng added a comment -

          I am not quite convinced, this "out-of-sync" issue is the same as the Zombie issue. Both issues were reported by me. After applying the v1.27, the great news is that Zombie issue (AWS instance with "no value for name tag" does not have corresponding slave in Jenkins) has been addressed. However, after observing for several days after the upgrade, I still see the out-of-sync situation.

          Kevin Cheng added a comment - I am not quite convinced, this "out-of-sync" issue is the same as the Zombie issue. Both issues were reported by me. After applying the v1.27, the great news is that Zombie issue (AWS instance with "no value for name tag" does not have corresponding slave in Jenkins) has been addressed. However, after observing for several days after the upgrade, I still see the out-of-sync situation.

          Francis Upton added a comment -

          Can you try this with 1.30?

          Francis Upton added a comment - Can you try this with 1.30?

          Tully Foote added a comment -

          I am seeing this issue in 1.31.

          I am seeing slaves that drop offline due to a connectivity issue. To be debugged separately, but the net effect is that the slaves stay offline.

          But the plugin does not spin down the instances. So I can end up with many more executors than I mostly all offline. And this can happen overnight such that in the morning I come in and find out that the system processed everything overnight but the machines that went offline are still running on EC2. If I click the delete slave, everything cleans up correctly.

          I understand that for debugging or otherwise potentially not tearing down offline slaves might be valuable, but it costs us time and effort for monitoring that I'd rather just say tear it down if it's offline and hitting it's timeout time.

          I'm tracking this for our project at: https://github.com/ros-infrastructure/buildfarm_deployment/issues/125

          Tully Foote added a comment - I am seeing this issue in 1.31. I am seeing slaves that drop offline due to a connectivity issue. To be debugged separately, but the net effect is that the slaves stay offline. But the plugin does not spin down the instances. So I can end up with many more executors than I mostly all offline. And this can happen overnight such that in the morning I come in and find out that the system processed everything overnight but the machines that went offline are still running on EC2. If I click the delete slave, everything cleans up correctly. I understand that for debugging or otherwise potentially not tearing down offline slaves might be valuable, but it costs us time and effort for monitoring that I'd rather just say tear it down if it's offline and hitting it's timeout time. I'm tracking this for our project at: https://github.com/ros-infrastructure/buildfarm_deployment/issues/125

          francisu There's a good chance this is fixed on master.

          Johnny Shields added a comment - francisu There's a good chance this is fixed on master.

          Francis Upton added a comment -

          Can someone try this on the current master and reopen it if it's not fixed?

          Francis Upton added a comment - Can someone try this on the current master and reopen it if it's not fixed?

          Francis Upton added a comment -

          Probably fixed in master.

          Francis Upton added a comment - Probably fixed in master.

          francisu I'm pretty sure I've seen this but haven't investigated further. I think we can close this and re-open the issue when we have a smoking gun.

          Johnny Shields added a comment - francisu I'm pretty sure I've seen this but haven't investigated further. I think we can close this and re-open the issue when we have a smoking gun.

            francisu Francis Upton
            kevcheng Kevin Cheng
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: