The Jenkins EC2 plugin isn't starting existing (stopped) nodes, instead, it always starts a new one. I can see the past instances as offline in the nodes tab, but instead of starting these nodes, the plugin always starts a new instance instead.
relates to
JENKINS-64520EC2 node not start after stop/disconnect with parameter Idle termination time
Open
[JENKINS-67190] EC2-plugin not spooling up stopped nodes, starting new nodes instead
I have set the instance cap to 1, and that just makes the queue stuck. It says the stopped instance is offline and won't try to connect to it. The same thing happens if I turn the instance on in AWS, I need to go on Jenkins and select "launch agent" for Jenkins to recognize the agent as online and the build to start.
The expected behavior should be that Jenkins starts the instance automatically and tries to reconnect to it
Bruno Esteves
added a comment - I have set the instance cap to 1, and that just makes the queue stuck. It says the stopped instance is offline and won't try to connect to it. The same thing happens if I turn the instance on in AWS, I need to go on Jenkins and select "launch agent" for Jenkins to recognize the agent as online and the build to start.
The expected behavior should be that Jenkins starts the instance automatically and tries to reconnect to it
I'm getting this when I try to manually launch the agents
Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
Started EC2 alive agents monitor
Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
Finished EC2 alive agents monitor. 110 ms
Nov 23, 2021 4:48:30 PM WARNING hudson.plugins.ec2.win.WinConnection pingFailingIfSSHHandShakeError
Failed to verify connectivity to Windows agent java.net.SocketTimeoutException: connect timed out at java.base/java.net.PlainSocketImpl.waitForConnect(Native Method) at java.base/java.net.PlainSocketImpl.socketConnect(PlainSocketImpl.java:107) at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) at java.base/java.net.Socket.connect(Socket.java:609) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:87) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:63) at com.hierynomus.smbj.transport.tcp.direct.DirectTcpTransport.connect(DirectTcpTransport.java:88) at com.hierynomus.smbj.connection.Connection.connect(Connection.java:139) at com.hierynomus.smbj.SMBClient.getEstablishedOrConnect(SMBClient.java:96) at com.hierynomus.smbj.SMBClient.connect(SMBClient.java:71) at hudson.plugins.ec2.win.WinConnection.pingFailingIfSSHHandShakeError(WinConnection.java:135) at hudson.plugins.ec2.win.EC2WindowsLauncher.connectToWinRM(EC2WindowsLauncher.java:189) at hudson.plugins.ec2.win.EC2WindowsLauncher.launchScript(EC2WindowsLauncher.java:52) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:293) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
Bruno Esteves
added a comment - I'm getting this when I try to manually launch the agents
Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
Started EC2 alive agents monitor
Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
Finished EC2 alive agents monitor. 110 ms
Nov 23, 2021 4:48:30 PM WARNING hudson.plugins.ec2.win.WinConnection pingFailingIfSSHHandShakeError
Failed to verify connectivity to Windows agent java.net.SocketTimeoutException: connect timed out at java.base/java.net.PlainSocketImpl.waitForConnect(Native Method) at java.base/java.net.PlainSocketImpl.socketConnect(PlainSocketImpl.java:107) at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) at java.base/java.net.Socket.connect(Socket.java:609) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:87) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:63) at com.hierynomus.smbj.transport.tcp.direct.DirectTcpTransport.connect(DirectTcpTransport.java:88) at com.hierynomus.smbj.connection.Connection.connect(Connection.java:139) at com.hierynomus.smbj.SMBClient.getEstablishedOrConnect(SMBClient.java:96) at com.hierynomus.smbj.SMBClient.connect(SMBClient.java:71) at hudson.plugins.ec2.win.WinConnection.pingFailingIfSSHHandShakeError(WinConnection.java:135) at hudson.plugins.ec2.win.EC2WindowsLauncher.connectToWinRM(EC2WindowsLauncher.java:189) at hudson.plugins.ec2.win.EC2WindowsLauncher.launchScript(EC2WindowsLauncher.java:52) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:293) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:834)
Having this same issue on our jenkins instance. I'm able to unblock it by manually turning the AWS ec2 on. Then everything continues to work as normal. This happens with our windows and linux agents.
Dante Kiaunis
added a comment - - edited Having this same issue on our jenkins instance. I'm able to unblock it by manually turning the AWS ec2 on. Then everything continues to work as normal. This happens with our windows and linux agents.
Same issue on my instance, no error, no exception, no call is sent to AWS EC2 to initiate instance startup.
The cap instance is not respected at global or AMI level.
We have to force instances to be deleted to control the number of subnodes.
It's a major issue as it lead to loss cost control.
florian Locqueneux
added a comment - Same issue on my instance, no error, no exception, no call is sent to AWS EC2 to initiate instance startup.
The cap instance is not respected at global or AMI level.
We have to force instances to be deleted to control the number of subnodes.
It's a major issue as it lead to loss cost control.
I also encountered this issue in my Jenkins setup. Can confirm that it is only on EC2 version 1.66. Downgrading to 1.65 and things start working.
My bad. It was actually a rerun build that just use Jenkins scheduling to reuse the same node.
Issue still persists in 1.65 when scheduling a new build.
Cuong
added a comment - - edited I also encountered this issue in my Jenkins setup.
Can confirm that it is only on EC2 version 1.66. Downgrading to 1.65 and things start working.
My bad. It was actually a rerun build that just use Jenkins scheduling to reuse the same node.
Issue still persists in 1.65 when scheduling a new build.
greyjackal have you found a way to make it a reproducible situation? We experience almost the same but have not been able to consistently reproduce it. It seems to happen randomly? Also wondering for completeness what is your EC2 template configuration look like. Do you have additional arguments, boot delay, etc.
Matthias Glastra
added a comment - greyjackal have you found a way to make it a reproducible situation? We experience almost the same but have not been able to consistently reproduce it. It seems to happen randomly? Also wondering for completeness what is your EC2 template configuration look like. Do you have additional arguments, boot delay, etc.
I, too, am experiencing this problem, and may have something to contribute. In my first attempt, which resulted in the plugin launching a new instance, the logs show it looking for an instance that matched:
hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='AWS Linux 2', labels='ec2'}. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2]}, {Name: tenancy,Values: [default]}, {Name: subnet-id,Values: [subnet-xxxxxxxx]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https://foo.com/jenkins/]}],InstanceIds: [],}
I then logged in to the AWS EC2 console Instances dialog and began entering the filters above. The list of instances was fine until I added the availability-zone filter. AWS suggested us-east-2a, us-east-2b, and us-east-2c instead of plain us-east-2. The web dialog accepted us-east-2* and continued to list the appropriate instances, so I changed the plugin settings Availability Zone field to that value also. The second problematic filter was subnet-id, as in the plugin settings I have multiple subnets configured for round robin, but the existence of a specific subnet-id in the filter above would interfere with that. I removed the list of subnet IDs in the plugin. After making those two changes to the plugin settings, a subsequent build succeeded in finding and starting an existing instance:
hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='AWS Linux 2', labels='ec2'}. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2*]}, {Name: tenancy,Values: [default]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https://mathkins.pfxdev.com/jenkins/]}],InstanceIds: [],}
I hope this helps. Also, I'd much prefer subnet name to subnet ID, as that could be wildcarded.
David Drum
added a comment - - edited I, too, am experiencing this problem, and may have something to contribute. In my first attempt, which resulted in the plugin launching a new instance, the logs show it looking for an instance that matched:
hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description= 'AWS Linux 2' , labels= 'ec2' }. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2]}, {Name: tenancy,Values: [ default ]}, {Name: subnet-id,Values: [subnet-xxxxxxxx]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https: //foo.com/jenkins/]}],InstanceIds: [],}
I then logged in to the AWS EC2 console Instances dialog and began entering the filters above. The list of instances was fine until I added the availability-zone filter. AWS suggested us-east-2a , us-east-2b , and us-east-2c instead of plain us-east-2 . The web dialog accepted us-east-2* and continued to list the appropriate instances, so I changed the plugin settings Availability Zone field to that value also. The second problematic filter was subnet-id , as in the plugin settings I have multiple subnets configured for round robin, but the existence of a specific subnet-id in the filter above would interfere with that. I removed the list of subnet IDs in the plugin. After making those two changes to the plugin settings, a subsequent build succeeded in finding and starting an existing instance:
hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description= 'AWS Linux 2' , labels= 'ec2' }. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2*]}, {Name: tenancy,Values: [ default ]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https: //mathkins.pfxdev.com/jenkins/]}],InstanceIds: [],}
I hope this helps. Also, I'd much prefer subnet name to subnet ID, as that could be wildcarded.
I have set the instance cap to 1, and that just makes the queue stuck. It says the stopped instance is offline and won't try to connect to it. The same thing happens if I turn the instance on in AWS, I need to go on Jenkins and select "launch agent" for Jenkins to recognize the agent as online and the build to start.
The expected behavior should be that Jenkins starts the instance automatically and tries to reconnect to it