[JENKINS-57795] Orphaned EC2 instances after Jenkins restart

Type: Bug
Resolution: Fixed
Priority: Critical
Component/s: ec2-plugin
Labels:
None
Environment:
Jenkins ver. 2.176.1, 2.204.2
ec2 plugin 1.43, 1.44, 1.45, 1.49.1

Similar Issues:
Powered by SuggestiMate

Show
Released As:
ec2 1.51

Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

The plugin will just loop on this:

SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}

If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

start_fresh_1.46-rc1050.43f9773eed95.txt
5 kB
2019-09-19 11:32
jenkins.temp_dsl.log
12 kB
2019-09-16 10:11
jenkins_201909121030.log
15 kB
2019-09-12 11:28

links to

PR-448

Jakub Bochenski created issue - 2019-05-31 15:04

Jakub Bochenski made changes - 2019-05-31 15:07

Description

Original: Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

The plugin will just loop on this:
{code}SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}
{code}

If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

It seems the problems do not occur when I do a `/safeRestart` but they do if I use e.g. "restart Jenkins when no jobs are running" form the Update Center.

We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

New: Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

The plugin will just loop on this:
{code}SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}
{code}

If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

It seems the problems do not occur when I do a {{/safeRestart}} but they do if I use e.g. "restart Jenkins when no jobs are running" form the Update Center.

We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

Jakub Bochenski added a comment - 2019-06-25 13:42

thoulen it would be nice to at least get some pointers on how to debug this further or work around it

Jakub Bochenski added a comment - 2019-06-25 13:42 thoulen it would be nice to at least get some pointers on how to debug this further or work around it

Jakub Bochenski added a comment - 2019-06-26 13:56

raihaan maybe you would care to respond?

Jakub Bochenski added a comment - 2019-06-26 13:56 raihaan maybe you would care to respond?

FABRIZIO MANFREDI added a comment - 2019-06-26 15:02

Can you tell me which version are you using ?

There is a bug of the calculation, but should not affect you case.

What is the configuration of your pool ?

do you have more then one pool with same description, ami and tags ?

Can you try with 2 ?

FABRIZIO MANFREDI added a comment - 2019-06-26 15:02 i Can you tell me which version are you using ? There is a bug of the calculation, but should not affect you case. What is the configuration of your pool ? do you have more then one pool with same description, ami and tags ? Can you try with 2 ?

Jakub Bochenski made changes - 2019-06-26 15:47

Description

Original: Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

The plugin will just loop on this:
{code}SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}
{code}

If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

It seems the problems do not occur when I do a {{/safeRestart}} but they do if I use e.g. "restart Jenkins when no jobs are running" form the Update Center.

We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

New: Sometimes after a Jenkins restart the plugin won't be able to spawn more agents.

The plugin will just loop on this:
{code}SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Attempting to provision slave needed by excess workload of 1 units
May 31, 2019 2:23:53 PM INFO hudson.plugins.ec2.EC2Cloud getNewOrExistingAvailableSlave
SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}. Cannot provision - no capacity for instances: 0
May 31, 2019 2:23:53 PM WARNING hudson.plugins.ec2.EC2Cloud provision
Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker'}
{code}

If I go to the EC2 console and terminate the instance manually the plugin will spawn a new one and use it.

It seems like there is some mismatch in the plugin logic. The part responsible for calculating the number of instances and checking the cap sees the EC2 instance. However the part responsible for picking up running EC2 instances doesn't seem to be able to find it.

We use a single subnet, security group and vpc (I've seen some reports about this causing problems).

We use instanceCap = 1 setting as we are testing the plugin, this might make this problem more visible than with a higher cap.

Jakub Bochenski added a comment - 2019-06-26 15:48 - edited

This is happening at least since 1.43 and it just happened on 1.44

I have only one EC2 cloud configured, but I also have an ECS cloud (they use separate agent labels).

This is our cloud configuration done via groovy script:

final cloud = new AmazonEC2Cloud(
        'ec2',
        false,
        config.ec2_access_key,
        config.ec2_region,
        config.ec2_ssh_key,
        config.ec2_instance_cap,
        [


                new SlaveTemplate(
                        config.ec2_ami_id,
                        '',
                        null,
                        config.ec2_security_groups,
                        '/tmp',
                        InstanceType.fromValue(config.ec2_instance_type),
                        false,
                        config.ec2_label,
                        Node.Mode.NORMAL,
                        "ec2 (${config.ec2_ami_id})",
                        '',
                        '/tmp',
                        '',
                        '1',
                        config.ec2_remote_user,
                        new UnixData(null, null, null, null),
                        '',
                        false,
                        config.ec2_subnet_id,
                        [
                                Name: 'acme', 
                                Contact : 'acme@acme.com',
                        ].collect{ new EC2Tag(it.key,it.value) },
                        '30',
                        false,
                        '',
                        config.ec2_arn_role,
                        true,
                        false,
                        false,
                        '1800',
                        false,
                        '',
                        false,
                        false,
                        false,
                        false
                )],
        config.ec2_arn_role,
        ''
)

Jakub Bochenski added a comment - 2019-06-26 15:48 - edited This is happening at least since 1.43 and it just happened on 1.44 I have only one EC2 cloud configured, but I also have an ECS cloud (they use separate agent labels). This is our cloud configuration done via groovy script: final cloud = new AmazonEC2Cloud( 'ec2' , false , config.ec2_access_key, config.ec2_region, config.ec2_ssh_key, config.ec2_instance_cap, [ new SlaveTemplate( config.ec2_ami_id, '', null , config.ec2_security_groups, '/tmp' , InstanceType.fromValue(config.ec2_instance_type), false , config.ec2_label, Node.Mode.NORMAL, "ec2 (${config.ec2_ami_id})" , '', '/tmp' , '', '1' , config.ec2_remote_user, new UnixData( null , null , null , null ), '', false , config.ec2_subnet_id, [ Name: 'acme' , Contact : 'acme@acme.com' , ].collect{ new EC2Tag(it.key,it.value) }, '30' , false , '', config.ec2_arn_role, true , false , false , '1800' , false , '', false , false , false , false )], config.ec2_arn_role, '' )

Jakub Bochenski added a comment - 2019-06-27 10:47

Can you try with 2 ?

If I reproduce the issue with instance cap = 1, then increase the cap to 2 I will get a new agent spawned (but only 1)

Now trying to reproduce this with 2 instances getting orphaned.

I also tried setting instance cap on slave template to 2 (it was blank before) – doesn't seem to help

Jakub Bochenski added a comment - 2019-06-27 10:47 Can you try with 2 ? If I reproduce the issue with instance cap = 1, then increase the cap to 2 I will get a new agent spawned (but only 1) Now trying to reproduce this with 2 instances getting orphaned. I also tried setting instance cap on slave template to 2 (it was blank before) – doesn't seem to help

Jakub Bochenski added a comment - 2019-06-27 11:38

I'm now getting this situation with instance cap = 2. I have two matching instances on EC2, both are active.
Plugin is looping with above message, with no agents available for the builds.

Now when I terminated one of the instances an interesting thing happened. Jenkins was able to pick up the other instance and reconnect it

SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker docker-bakery'}. Cannot provision - no capacity for instances: 0

Jun 27, 2019 11:35:07 AM WARNING hudson.plugins.ec2.EC2Cloud provision

Can't raise nodes for SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker docker-bakery'}

Jun 27, 2019 11:35:16 AM INFO hudson.plugins.ec2.EC2Cloud provision

SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker docker-bakery'}. Attempting to provision slave needed by excess workload of 1 units

Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo

SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker docker-bakery'}. Considering launching

Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate setupRootDevice

AMI had xvda

Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate setupRootDevice

{DeleteOnTermination: true,SnapshotId: snap-0b70f104d64ae4a48,VolumeSize: 8,VolumeType: gp2,Encrypted: false,}

Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo

SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker docker-bakery'}. Setting Instance Initiated Shutdown Behavior : ShutdownBehavior.Terminate

Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo

SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker docker-bakery'}. Looking for existing instances with describe-instance: {Filters: SNAP

Jun 27, 2019 11:35:18 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo

SlaveTemplate{ami='ami-0efbb291c6e8cc847', labels='docker docker-bakery'}. checkInstance: i-0e454aea630ccb88f. true - Instance is not connected to Jenkins

Jakub Bochenski added a comment - 2019-06-27 11:38 I'm now getting this situation with instance cap = 2. I have two matching instances on EC2, both are active. Plugin is looping with above message, with no agents available for the builds. Now when I terminated one of the instances an interesting thing happened. Jenkins was able to pick up the other instance and reconnect it SlaveTemplate{ami= 'ami-0efbb291c6e8cc847' , labels= 'docker docker-bakery' }. Cannot provision - no capacity for instances: 0 Jun 27, 2019 11:35:07 AM WARNING hudson.plugins.ec2.EC2Cloud provision Can 't raise nodes for SlaveTemplate{ami=' ami-0efbb291c6e8cc847 ', labels=' docker docker-bakery'} Jun 27, 2019 11:35:16 AM INFO hudson.plugins.ec2.EC2Cloud provision SlaveTemplate{ami= 'ami-0efbb291c6e8cc847' , labels= 'docker docker-bakery' }. Attempting to provision slave needed by excess workload of 1 units Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo SlaveTemplate{ami= 'ami-0efbb291c6e8cc847' , labels= 'docker docker-bakery' }. Considering launching Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate setupRootDevice AMI had xvda Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate setupRootDevice {DeleteOnTermination: true ,SnapshotId: snap-0b70f104d64ae4a48,VolumeSize: 8,VolumeType: gp2,Encrypted: false ,} Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo SlaveTemplate{ami= 'ami-0efbb291c6e8cc847' , labels= 'docker docker-bakery' }. Setting Instance Initiated Shutdown Behavior : ShutdownBehavior.Terminate Jun 27, 2019 11:35:17 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo SlaveTemplate{ami= 'ami-0efbb291c6e8cc847' , labels= 'docker docker-bakery' }. Looking for existing instances with describe-instance: {Filters: SNAP Jun 27, 2019 11:35:18 AM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo SlaveTemplate{ami= 'ami-0efbb291c6e8cc847' , labels= 'docker docker-bakery' }. checkInstance: i-0e454aea630ccb88f. true - Instance is not connected to Jenkins

Jakub Bochenski added a comment - 2019-06-27 11:40

Above looks like maybe there is some "off by one" error, when the plugin won't attempt to re-connect instances if it's at instance cap

Jakub Bochenski added a comment - 2019-06-27 11:40 Above looks like maybe there is some "off by one" error, when the plugin won't attempt to re-connect instances if it's at instance cap

Assignee:: FABRIZIO MANFREDI

Reporter:: Jakub Bochenski

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2019-05-31 15:04

Updated:: 2020-11-16 00:30

Resolved:: 2020-11-08 11:26

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Jakub Bochenski added a comment - 2019-06-25 13:42

Expand comment: Jakub Bochenski added a comment - 2019-06-25 13:42

Collapse comment: Jakub Bochenski added a comment - 2019-06-26 13:56

Expand comment: Jakub Bochenski added a comment - 2019-06-26 13:56

Collapse comment: FABRIZIO MANFREDI added a comment - 2019-06-26 15:02

Expand comment: FABRIZIO MANFREDI added a comment - 2019-06-26 15:02

Collapse comment: Jakub Bochenski added a comment - 2019-06-26 15:48, Edited by Jakub Bochenski - 2019-06-26 15:51

Expand comment: Jakub Bochenski added a comment - 2019-06-26 15:48, Edited by Jakub Bochenski - 2019-06-26 15:51

Collapse comment: Jakub Bochenski added a comment - 2019-06-27 10:47

Expand comment: Jakub Bochenski added a comment - 2019-06-27 10:47

Collapse comment: Jakub Bochenski added a comment - 2019-06-27 11:38

Expand comment: Jakub Bochenski added a comment - 2019-06-27 11:38

Collapse comment: Jakub Bochenski added a comment - 2019-06-27 11:40

Expand comment: Jakub Bochenski added a comment - 2019-06-27 11:40

People

Dates