• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • ec2-plugin
    • None
    • Jenkins ver. 2.176.1, 2.204.2
      ec2 plugin 1.43, 1.44, 1.45, 1.49.1
    • ec2 1.51

      Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

      The plugin will just loop on this:

      SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
      May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
      SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
      May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
      Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}
      

      If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

      It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

      We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

      We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

          [JENKINS-57795] Orphaned EC2 instances after Jenkins restart

          Jakub Bochenski added a comment - - edited

          Do you have any logs from the retention strategy. I see in your logs that the instances were stopped but I'm not certain why.

          I'm not sure what you mean. I have pasted all of the Jenkins log output here already.
          Do you want me to enable DEBUG level logging for some components?

          Jakub Bochenski added a comment - - edited Do you have any logs from the retention strategy. I see in your logs that the instances were stopped but I'm not certain why. I'm not sure what you mean. I have pasted all of the Jenkins log output here already. Do you want me to enable DEBUG level logging for some components?

          raihaan I have filled a separate issue about the agents dying during launch as it happens independently of this issue.

          Jakub Bochenski added a comment - raihaan I have filled a separate issue about the agents dying during launch as it happens independently of this issue.

          Pierson Yieh added a comment - - edited

          We've also seen this behavior before though we're not sure how to reproduce the problem. We saw it when we'd hit our max AWS request limit and Jenkins started losing track of nodes and couldn't spin up new ones cause the orphaned nodes were still being counted towards the max instance count, but weren't showing up in the Jenkins UI.

          I'm able to "simulate" the "losing track of nodes" by running a groovy script on the Jenkins Master to manually remove the node for the Jenkins object. And we're looking into implementing a feature to automatically re-attach these orphaned nodes to Jenkins. 

          Update: seems the SlaveTemplate.checkInstance() finds our orphan nodes and were able to re-attach them to the Jenkins Master. Not sure why in the past they weren't getting re-attached.

          Pierson Yieh added a comment - - edited We've also seen this behavior before though we're not sure how to reproduce the problem. We saw it when we'd hit our max AWS request limit and Jenkins started losing track of nodes and couldn't spin up new ones cause the orphaned nodes were still being counted towards the max instance count, but weren't showing up in the Jenkins UI. I'm able to "simulate" the "losing track of nodes" by running a groovy script on the Jenkins Master to manually remove the node for the Jenkins object. And we're looking into implementing a feature to automatically re-attach these orphaned nodes to Jenkins.  Update: seems the SlaveTemplate.checkInstance() finds our orphan nodes and were able to re-attach them to the Jenkins Master. Not sure why in the past they weren't getting re-attached.

          I was able to resolve my problem in the same way as described in https://issues.jenkins-ci.org/browse/JENKINS-61370?focusedCommentId=388247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-388247

          The swallowing of useful error output is a big issue that should be improved

          Jakub Bochenski added a comment - I was able to resolve my problem in the same way as described in https://issues.jenkins-ci.org/browse/JENKINS-61370?focusedCommentId=388247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-388247 The swallowing of useful error output is a big issue that should be improved

          Pierson Yieh added a comment -

          jbochenski Was the problem that was solved the issue of orphan nodes not getting reconnected or agents dying during launch? Our issue is of orphan nodes not getting re-attached to their respective Jenkins Masters. 

          Pierson Yieh added a comment - jbochenski Was the problem that was solved the issue of orphan nodes not getting reconnected or agents dying during launch? Our issue is of orphan nodes not getting re-attached to their respective Jenkins Masters. 

          Pierson Yieh added a comment - - edited

          We've identified the cause of our issue. The orphan re-attachment logic is tied the EC2Cloud's provision method. But the issue occurs when the actual number of existing AWS nodes has hit an instance cap (i.e. no more nodes can be provisioned). Because we've hit an instance cap, provisioning isn't even attempted and the orphan re-attachment logic isn't triggered. Submitted a PR here: https://github.com/jenkinsci/ec2-plugin/pull/448

          Pierson Yieh added a comment - - edited We've identified the cause of our issue. The orphan re-attachment logic is tied the EC2Cloud's provision method. But the issue occurs when the actual number of existing AWS nodes has hit an instance cap (i.e. no more nodes can be provisioned). Because we've hit an instance cap, provisioning isn't even attempted and the orphan re-attachment logic isn't triggered. Submitted a PR here:  https://github.com/jenkinsci/ec2-plugin/pull/448

          Manoj added a comment -

          Hi, we still face the exact same issue with Jenkins(2.222.4), EC plugin(1.50.2.1). We face this issue mostly with windows instances which is built using the groovy EC2 config injected as an init script. Any update whether the fix is released or any timeline for it? Looking at this shows it is not yet released. An update on this would be helpful  

          Manoj added a comment - Hi, we still face the exact same issue with Jenkins(2.222.4), EC plugin(1.50.2.1). We face this issue mostly with windows instances which is built using the groovy EC2 config injected as an init script. Any update whether the fix is released or any timeline for it? Looking at this shows it is not yet released. An update on this would be helpful  

          Manoj added a comment -

          Today I tried 1.53 version and the issue is not resolved. 

          SlaveTemplate{ami='ami-038e073abe89730b3', labels='win2016dlp'}. Attempting to provision slave needed by excess workload of 1 units
          Nov 04, 2020 12:36:01 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlaveSlaveTemplate{ami='ami-038e073abe89730b3', labels='win2016dlp'}. Cannot provision - no capacity for instances: 0
          Nov 04, 2020 12:36:01 PM WARNING hudson.plugins.ec2.EC2Cloud provisionCan't raise nodes for SlaveTemplate{ami='ami-038e073abe89730b3', labels='win2016dlp'}
          

          However, Jenkins identified the node and the node details screen shows "Launch Agent" Button. But the agent is not running. 

          Please note, this is a windows agent. 

          Manoj added a comment - Today I tried 1.53 version and the issue is not resolved.  SlaveTemplate{ami= 'ami-038e073abe89730b3' , labels= 'win2016dlp' }. Attempting to provision slave needed by excess workload of 1 units Nov 04, 2020 12:36:01 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlaveSlaveTemplate{ami= 'ami-038e073abe89730b3' , labels= 'win2016dlp' }. Cannot provision - no capacity for instances: 0 Nov 04, 2020 12:36:01 PM WARNING hudson.plugins.ec2.EC2Cloud provisionCan 't raise nodes for SlaveTemplate{ami=' ami-038e073abe89730b3 ', labels=' win2016dlp'} However, Jenkins identified the node and the node details screen shows "Launch Agent" Button. But the agent is not running.  Please note, this is a windows agent. 

          manojtr the identification of the agent is what this ticket is about, so i will assume this is resolved could you open a new one sharing more details about your situation

           

          Raihaan Shouhell added a comment - manojtr the identification of the agent is what this ticket is about, so i will assume this is resolved could you open a new one sharing more details about your situation  

          Manoj added a comment -

          raihaan but I think the description of this issue says exactly what I described above. Do I need to still open another ticket? Sorry I am confused, what you mean by the identification of the agent? 

          Manoj added a comment - raihaan  but I think the description of this issue says exactly what I described above. Do I need to still open another ticket? Sorry I am confused, what you mean by the identification of the agent? 

            thoulen FABRIZIO MANFREDI
            jbochenski Jakub Bochenski
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: