Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-57795

Orphaned EC2 instances after Jenkins restart

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Component/s: ec2-plugin
    • Labels:
      None
    • Environment:
      Jenkins ver. 2.176.1, 2.204.2
      ec2 plugin 1.43, 1.44, 1.45, 1.49.1
    • Similar Issues:
    • Released As:
      ec2 1.51

      Description

      Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

      The plugin will just loop on this:

      SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
      May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
      SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
      May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
      Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}
      

      If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

      It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

      We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

      We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

        Attachments

          Issue Links

            Activity

            Hide
            pyieh Pierson Yieh added a comment - - edited

            We've identified the cause of our issue. The orphan re-attachment logic is tied the EC2Cloud's provision method. But the issue occurs when the actual number of existing AWS nodes has hit an instance cap (i.e. no more nodes can be provisioned). Because we've hit an instance cap, provisioning isn't even attempted and the orphan re-attachment logic isn't triggered. Submitted a PR here: https://github.com/jenkinsci/ec2-plugin/pull/448

            Show
            pyieh Pierson Yieh added a comment - - edited We've identified the cause of our issue. The orphan re-attachment logic is tied the EC2Cloud's provision method. But the issue occurs when the actual number of existing AWS nodes has hit an instance cap (i.e. no more nodes can be provisioned). Because we've hit an instance cap, provisioning isn't even attempted and the orphan re-attachment logic isn't triggered. Submitted a PR here:  https://github.com/jenkinsci/ec2-plugin/pull/448
            Hide
            manojtr Manoj added a comment -

            Hi, we still face the exact same issue with Jenkins(2.222.4), EC plugin(1.50.2.1). We face this issue mostly with windows instances which is built using the groovy EC2 config injected as an init script. Any update whether the fix is released or any timeline for it? Looking at this shows it is not yet released. An update on this would be helpful  

            Show
            manojtr Manoj added a comment - Hi, we still face the exact same issue with Jenkins(2.222.4), EC plugin(1.50.2.1). We face this issue mostly with windows instances which is built using the groovy EC2 config injected as an init script. Any update whether the fix is released or any timeline for it? Looking at this shows it is not yet released. An update on this would be helpful  
            Hide
            manojtr Manoj added a comment -

            Today I tried 1.53 version and the issue is not resolved. 

            SlaveTemplate{ami='ami-038e073abe89730b3', labels='win2016dlp'}. Attempting to provision slave needed by excess workload of 1 units
            Nov 04, 2020 12:36:01 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlaveSlaveTemplate{ami='ami-038e073abe89730b3', labels='win2016dlp'}. Cannot provision - no capacity for instances: 0
            Nov 04, 2020 12:36:01 PM WARNING hudson.plugins.ec2.EC2Cloud provisionCan't raise nodes for SlaveTemplate{ami='ami-038e073abe89730b3', labels='win2016dlp'}
            

            However, Jenkins identified the node and the node details screen shows "Launch Agent" Button. But the agent is not running. 

            Please note, this is a windows agent. 

            Show
            manojtr Manoj added a comment - Today I tried 1.53 version and the issue is not resolved.  SlaveTemplate{ami= 'ami-038e073abe89730b3' , labels= 'win2016dlp' }. Attempting to provision slave needed by excess workload of 1 units Nov 04, 2020 12:36:01 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlaveSlaveTemplate{ami= 'ami-038e073abe89730b3' , labels= 'win2016dlp' }. Cannot provision - no capacity for instances: 0 Nov 04, 2020 12:36:01 PM WARNING hudson.plugins.ec2.EC2Cloud provisionCan 't raise nodes for SlaveTemplate{ami=' ami-038e073abe89730b3 ', labels=' win2016dlp'} However, Jenkins identified the node and the node details screen shows "Launch Agent" Button. But the agent is not running.  Please note, this is a windows agent. 
            Hide
            raihaan Raihaan Shouhell added a comment -

            Manoj the identification of the agent is what this ticket is about, so i will assume this is resolved could you open a new one sharing more details about your situation

             

            Show
            raihaan Raihaan Shouhell added a comment - Manoj the identification of the agent is what this ticket is about, so i will assume this is resolved could you open a new one sharing more details about your situation  
            Hide
            manojtr Manoj added a comment -

            Raihaan Shouhell but I think the description of this issue says exactly what I described above. Do I need to still open another ticket? Sorry I am confused, what you mean by the identification of the agent? 

            Show
            manojtr Manoj added a comment - Raihaan Shouhell  but I think the description of this issue says exactly what I described above. Do I need to still open another ticket? Sorry I am confused, what you mean by the identification of the agent? 

              People

              Assignee:
              thoulen FABRIZIO MANFREDI
              Reporter:
              jbochenski Jakub Bochenski
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: