Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26798

EC2 instance not stopped/terminated when slave marked offline

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Cannot Reproduce
    • ec2-plugin
    • None
    • Jenkins ver. 1.580.3
      EC2 Plugin ver. 1.24

    Description

      We heavily utilize this plugins with 4 separate models created in our master for different AMIs for different type of build project.

      While the scale capability is great, we do encounter high frequency of out-of-sync slave. This means that we have Jenkins slave being marked offline, but the actual EC2 instance are still in "running" state instead of "stopped" state.

      Not sure if it is something to do with AWS side unable to stop instance, but at least the Jenkins.log does not indicate that is the scenario.

      Given the high volume and high frequency of this situation, we are actually bleeding money without added build capacity for those out-of-sync slave.

      Attachments

        Issue Links

          Activity

            kevcheng Kevin Cheng added a comment -

            I am not quite convinced, this "out-of-sync" issue is the same as the Zombie issue. Both issues were reported by me. After applying the v1.27, the great news is that Zombie issue (AWS instance with "no value for name tag" does not have corresponding slave in Jenkins) has been addressed. However, after observing for several days after the upgrade, I still see the out-of-sync situation.

            kevcheng Kevin Cheng added a comment - I am not quite convinced, this "out-of-sync" issue is the same as the Zombie issue. Both issues were reported by me. After applying the v1.27, the great news is that Zombie issue (AWS instance with "no value for name tag" does not have corresponding slave in Jenkins) has been addressed. However, after observing for several days after the upgrade, I still see the out-of-sync situation.
            francisu Francis Upton added a comment -

            Can you try this with 1.30?

            francisu Francis Upton added a comment - Can you try this with 1.30?
            tfoote Tully Foote added a comment -

            I am seeing this issue in 1.31.

            I am seeing slaves that drop offline due to a connectivity issue. To be debugged separately, but the net effect is that the slaves stay offline.

            But the plugin does not spin down the instances. So I can end up with many more executors than I mostly all offline. And this can happen overnight such that in the morning I come in and find out that the system processed everything overnight but the machines that went offline are still running on EC2. If I click the delete slave, everything cleans up correctly.

            I understand that for debugging or otherwise potentially not tearing down offline slaves might be valuable, but it costs us time and effort for monitoring that I'd rather just say tear it down if it's offline and hitting it's timeout time.

            I'm tracking this for our project at: https://github.com/ros-infrastructure/buildfarm_deployment/issues/125

            tfoote Tully Foote added a comment - I am seeing this issue in 1.31. I am seeing slaves that drop offline due to a connectivity issue. To be debugged separately, but the net effect is that the slaves stay offline. But the plugin does not spin down the instances. So I can end up with many more executors than I mostly all offline. And this can happen overnight such that in the morning I come in and find out that the system processed everything overnight but the machines that went offline are still running on EC2. If I click the delete slave, everything cleans up correctly. I understand that for debugging or otherwise potentially not tearing down offline slaves might be valuable, but it costs us time and effort for monitoring that I'd rather just say tear it down if it's offline and hitting it's timeout time. I'm tracking this for our project at: https://github.com/ros-infrastructure/buildfarm_deployment/issues/125

            francisu There's a good chance this is fixed on master.

            johnny_shields Johnny Shields added a comment - francisu There's a good chance this is fixed on master.
            francisu Francis Upton added a comment -

            Can someone try this on the current master and reopen it if it's not fixed?

            francisu Francis Upton added a comment - Can someone try this on the current master and reopen it if it's not fixed?
            francisu Francis Upton added a comment -

            Probably fixed in master.

            francisu Francis Upton added a comment - Probably fixed in master.

            francisu I'm pretty sure I've seen this but haven't investigated further. I think we can close this and re-open the issue when we have a smoking gun.

            johnny_shields Johnny Shields added a comment - francisu I'm pretty sure I've seen this but haven't investigated further. I think we can close this and re-open the issue when we have a smoking gun.

            People

              francisu Francis Upton
              kevcheng Kevin Cheng
              Votes:
              4 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: