Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-26797

Zombie instance created by the EC2 plugins

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • ec2-plugin
    • Jenkins ver. 1.580.3
      EC2 Plugin ver. 1.24

    Description

      Occasionally (probably several times a week), I observed there are EC2 slave were initiated by the plugins which the "name" tag value is missed, while other tags are there in the instance. The major situation is that in those instances, Jenkins does not have those instance registered as slave.

      That said, they became zombie instances, and only can by found and killed in AWS console

      Attachments

        Activity

          coryrking Cory King added a comment -

          I too see this issue many times. The plugin creates the slave, disappears from jenkins, but still remains alive with no "name" tag and can only be terminated in AWS.

          I have yet to find anything useful in the logs.

          coryrking Cory King added a comment - I too see this issue many times. The plugin creates the slave, disappears from jenkins, but still remains alive with no "name" tag and can only be terminated in AWS. I have yet to find anything useful in the logs.

          I also noticed that these instances do count towards the instance cap though they don't show up as slaves in Jenkins.

          mlieberman85 Michael Lieberman added a comment - I also noticed that these instances do count towards the instance cap though they don't show up as slaves in Jenkins.
          arash arash m added a comment -

          I've noticed this behavior for a while. Though I feel it might have gotten worse with the most recent version of the plugin.

          arash arash m added a comment - I've noticed this behavior for a while. Though I feel it might have gotten worse with the most recent version of the plugin.
          cbek Christoph Beckmann added a comment - - edited

          Issues/Bug was introduced by #79
          Was fix via #123 ans was released in 1.25.
          1.25 is buggy, so go directly to 1.26.

          cbek Christoph Beckmann added a comment - - edited Issues/Bug was introduced by #79 Was fix via #123 ans was released in 1.25. 1.25 is buggy, so go directly to 1.26.

          Fixed in 1.25 by #123

          cbek Christoph Beckmann added a comment - Fixed in 1.25 by #123
          arash arash m added a comment -

          I've been using version 1.26 for a couple of weeks and I'm still seeing the issue.

          arash arash m added a comment - I've been using version 1.26 for a couple of weeks and I'm still seeing the issue.

          >>I've been using version 1.26 for a couple of weeks and I'm still seeing the issue.
          Ok, agree #123 fix only one part of that issue.
          I checked it we still have also zombie instances.
          Seams that jenkins master lose connection to slaves and isn't reconnecting.
          Any idea how we could reproduce it?

          cbek Christoph Beckmann added a comment - >>I've been using version 1.26 for a couple of weeks and I'm still seeing the issue. Ok, agree #123 fix only one part of that issue. I checked it we still have also zombie instances. Seams that jenkins master lose connection to slaves and isn't reconnecting. Any idea how we could reproduce it?
          kevcheng Kevin Cheng added a comment -

          I upgraded one of my Jenkins instance plugin to 1.26. I still see the zombie issue (but probably not as severe). From Chris' comment about the master lost connection to slave, I am not sure if that is my original case which I have instances with all the tagging key-value set but not the "name" tag.

          If the slave has ever connected to the master once, and some how lost the registration, I do not think the behavior observed would fall into this situation.

          I do have another ticket about "out-of-sync" issue
          JENKINS-26798 - High occurrence of out-sync EC2 slaves

          kevcheng Kevin Cheng added a comment - I upgraded one of my Jenkins instance plugin to 1.26. I still see the zombie issue (but probably not as severe). From Chris' comment about the master lost connection to slave, I am not sure if that is my original case which I have instances with all the tagging key-value set but not the "name" tag. If the slave has ever connected to the master once, and some how lost the registration, I do not think the behavior observed would fall into this situation. I do have another ticket about "out-of-sync" issue JENKINS-26798 - High occurrence of out-sync EC2 slaves
          dsuskin_fitbit Daniel Suskin added a comment -

          I've been seeing this issue as well with a high rate of occurrence on 1.26. Here's a stack trace from the Jenkins log for one such instance. I've removed information specific to this particular request.

          WARNING: Provisioned slave EC2 XXXXXXXXXXXXXX (ami-XXXXXXXXXXX) failed to launch
          com.amazonaws.AmazonServiceException: The instance ID 'i-XXXXXXXXX' does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidInstanceID.NotFound; Request ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX)
          at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:886)
          at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:484)
          at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:256)
          at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:8798)
          at com.amazonaws.services.ec2.AmazonEC2Client.createTags(AmazonEC2Client.java:4990)
          at hudson.plugins.ec2.SlaveTemplate.updateRemoteTags(SlaveTemplate.java:732)
          at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:426)
          at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:287)
          at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:398)
          at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:394)
          at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
          at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:744)

          dsuskin_fitbit Daniel Suskin added a comment - I've been seeing this issue as well with a high rate of occurrence on 1.26. Here's a stack trace from the Jenkins log for one such instance. I've removed information specific to this particular request. WARNING: Provisioned slave EC2 XXXXXXXXXXXXXX (ami-XXXXXXXXXXX) failed to launch com.amazonaws.AmazonServiceException: The instance ID 'i-XXXXXXXXX' does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidInstanceID.NotFound; Request ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX) at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:886) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:484) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:256) at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:8798) at com.amazonaws.services.ec2.AmazonEC2Client.createTags(AmazonEC2Client.java:4990) at hudson.plugins.ec2.SlaveTemplate.updateRemoteTags(SlaveTemplate.java:732) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:426) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:287) at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:398) at hudson.plugins.ec2.EC2Cloud$1.call(EC2Cloud.java:394) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744)

          Seeing the same issue.

          • we have instances without any name or tags
          • seams from the jenkins.log - the tagging failed - maybe, instances wasn't up on AWS side
          • if the tagging is failing instance will not be "stored" (EC2Cloud:399) -> unknown instance

          I will provide a fix/PR today.

          cbek Christoph Beckmann added a comment - Seeing the same issue. we have instances without any name or tags seams from the jenkins.log - the tagging failed - maybe, instances wasn't up on AWS side if the tagging is failing instance will not be "stored" ( EC2Cloud:399 ) -> unknown instance I will provide a fix/PR today.

          Had a SNAPSHOT version running in our production systems for over 3 days.
          No zombies instances in last 3 days.

          Still waiting for merge: https://github.com/jenkinsci/ec2-plugin/pull/140
          SNAPSHOT version with fix available under: https://buildhive.cloudbees.com/job/jenkinsci/job/ec2-plugin/99/org.jenkins-ci.plugins$ec2/

          cbek Christoph Beckmann added a comment - Had a SNAPSHOT version running in our production systems for over 3 days. No zombies instances in last 3 days. Still waiting for merge: https://github.com/jenkinsci/ec2-plugin/pull/140 SNAPSHOT version with fix available under: https://buildhive.cloudbees.com/job/jenkinsci/job/ec2-plugin/99/org.jenkins-ci.plugins$ec2/

          Code changed in jenkins
          User: Christoph Beckmann
          Path:
          pom.xml
          src/main/java/hudson/plugins/ec2/SlaveTemplate.java
          src/test/java/hudson/plugins/ec2/SlaveTemplateUnitTest.java
          http://jenkins-ci.org/commit/ec2-plugin/141236b5e3739c46b7ca5b9f85b43a0c63ee7624
          Log:
          JENKINS-26797 - Zombie instance created by the EC2 plugins

          • fixing error code for provisionOndemand - InvalidInstanceID.NotFound
          • refactoring updateRemoteTags - no code duplication
          • no exception throwing if remote tagging fails - only logging
            • otherwise we got zombie instances which are not shown
          • adding unit testing
          scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Christoph Beckmann Path: pom.xml src/main/java/hudson/plugins/ec2/SlaveTemplate.java src/test/java/hudson/plugins/ec2/SlaveTemplateUnitTest.java http://jenkins-ci.org/commit/ec2-plugin/141236b5e3739c46b7ca5b9f85b43a0c63ee7624 Log: JENKINS-26797 - Zombie instance created by the EC2 plugins fixing error code for provisionOndemand - InvalidInstanceID.NotFound refactoring updateRemoteTags - no code duplication no exception throwing if remote tagging fails - only logging otherwise we got zombie instances which are not shown adding unit testing

          Code changed in jenkins
          User: Christoph Beckmann
          Path:
          pom.xml
          src/main/java/hudson/plugins/ec2/SlaveTemplate.java
          src/test/java/hudson/plugins/ec2/SlaveTemplateUnitTest.java
          http://jenkins-ci.org/commit/ec2-plugin/6b46558c8d2797bbd17cf7523d3d321b8cabb3b0
          Log:
          JENKINS-26797 - Zombie instance created by the EC2 plugins

          • changed updateRemoteTags back to private
          • introduced powermock-module-junit4 for invokeMethod
          scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Christoph Beckmann Path: pom.xml src/main/java/hudson/plugins/ec2/SlaveTemplate.java src/test/java/hudson/plugins/ec2/SlaveTemplateUnitTest.java http://jenkins-ci.org/commit/ec2-plugin/6b46558c8d2797bbd17cf7523d3d321b8cabb3b0 Log: JENKINS-26797 - Zombie instance created by the EC2 plugins changed updateRemoteTags back to private introduced powermock-module-junit4 for invokeMethod

          Code changed in jenkins
          User: Francis Upton
          Path:
          pom.xml
          src/main/java/hudson/plugins/ec2/SlaveTemplate.java
          src/test/java/hudson/plugins/ec2/SlaveTemplateUnitTest.java
          http://jenkins-ci.org/commit/ec2-plugin/0dcf0870e6e6bf1046af0f5c64f6dfe40a93baf5
          Log:
          Merge pull request #140 from cbek/JENKINS-26797

          JENKINS-26797 - Zombie instance created by the EC2 plugins

          Compare: https://github.com/jenkinsci/ec2-plugin/compare/2c5aa2462f72...0dcf0870e6e6

          scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Francis Upton Path: pom.xml src/main/java/hudson/plugins/ec2/SlaveTemplate.java src/test/java/hudson/plugins/ec2/SlaveTemplateUnitTest.java http://jenkins-ci.org/commit/ec2-plugin/0dcf0870e6e6bf1046af0f5c64f6dfe40a93baf5 Log: Merge pull request #140 from cbek/ JENKINS-26797 JENKINS-26797 - Zombie instance created by the EC2 plugins Compare: https://github.com/jenkinsci/ec2-plugin/compare/2c5aa2462f72...0dcf0870e6e6

          Is part of release 1.27

          cbek Christoph Beckmann added a comment - Is part of release 1.27
          kevcheng Kevin Cheng added a comment -

          Upgraded to v1.27 for 4 days now, I am happy to report that I do not see zombie instances anymore, however, the out-of-sync issue still remains as reported as separate ticket JENKINS-26798 which was marked as dup (as I do not agree they are the same, at least the fix went in did not fix the OOS situation)

          kevcheng Kevin Cheng added a comment - Upgraded to v1.27 for 4 days now, I am happy to report that I do not see zombie instances anymore, however, the out-of-sync issue still remains as reported as separate ticket JENKINS-26798 which was marked as dup (as I do not agree they are the same, at least the fix went in did not fix the OOS situation)

          @Kevin Cheng
          Good to hear, we notice the same after ~2 weeks.
          After fixing this issue I agree JENKINS-26798 is different issue.

          cbek Christoph Beckmann added a comment - @Kevin Cheng Good to hear, we notice the same after ~2 weeks. After fixing this issue I agree JENKINS-26798 is different issue.
          arash arash m added a comment -

          Agreed. This issue is fixed on my side as well.
          Thank you for the fix.

          arash arash m added a comment - Agreed. This issue is fixed on my side as well. Thank you for the fix.

          People

            cbek Christoph Beckmann
            kevcheng Kevin Cheng
            Votes:
            6 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: