Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67190

EC2-plugin not spooling up stopped nodes, starting new nodes instead

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • ec2-plugin
    • None
    • Jenkins 2.303.3, Ec2 plugin 1.66

      The Jenkins EC2 plugin isn't starting existing (stopped) nodes, instead, it always starts a new one. I can see the past instances as offline in the nodes tab, but instead of starting these nodes, the plugin always starts a new instance instead.

          [JENKINS-67190] EC2-plugin not spooling up stopped nodes, starting new nodes instead

          Bruno Esteves added a comment -

          I have set the instance cap to 1, and that just makes the queue stuck. It says the stopped instance is offline and won't try to connect to it. The same thing happens if I turn the instance on in AWS, I need to go on Jenkins and select "launch agent" for Jenkins to recognize the agent as online and the build to start.

           

          The expected behavior should be that Jenkins starts the instance automatically and tries to reconnect to it

          Bruno Esteves added a comment - I have set the instance cap to 1, and that just makes the queue stuck. It says the stopped instance is offline and won't try to connect to it. The same thing happens if I turn the instance on in AWS, I need to go on Jenkins and select "launch agent" for Jenkins to recognize the agent as online and the build to start.   The expected behavior should be that Jenkins starts the instance automatically and tries to reconnect to it

          Bruno Esteves added a comment -

          I'm getting this when I try to manually launch the agents

          Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
          Started EC2 alive agents monitor
          Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0
          Finished EC2 alive agents monitor. 110 ms
          Nov 23, 2021 4:48:30 PM WARNING hudson.plugins.ec2.win.WinConnection pingFailingIfSSHHandShakeError
          Failed to verify connectivity to Windows agent java.net.SocketTimeoutException: connect timed out at java.base/java.net.PlainSocketImpl.waitForConnect(Native Method) at java.base/java.net.PlainSocketImpl.socketConnect(PlainSocketImpl.java:107) at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) at java.base/java.net.Socket.connect(Socket.java:609) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:87) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:63) at com.hierynomus.smbj.transport.tcp.direct.DirectTcpTransport.connect(DirectTcpTransport.java:88) at com.hierynomus.smbj.connection.Connection.connect(Connection.java:139) at com.hierynomus.smbj.SMBClient.getEstablishedOrConnect(SMBClient.java:96) at com.hierynomus.smbj.SMBClient.connect(SMBClient.java:71) at hudson.plugins.ec2.win.WinConnection.pingFailingIfSSHHandShakeError(WinConnection.java:135) at hudson.plugins.ec2.win.EC2WindowsLauncher.connectToWinRM(EC2WindowsLauncher.java:189) at hudson.plugins.ec2.win.EC2WindowsLauncher.launchScript(EC2WindowsLauncher.java:52) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:293) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

           

          Bruno Esteves added a comment - I'm getting this when I try to manually launch the agents Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0 Started EC2 alive agents monitor Nov 23, 2021 4:48:10 PM INFO hudson.model.AsyncPeriodicWork lambda$doRun$0 Finished EC2 alive agents monitor. 110 ms Nov 23, 2021 4:48:30 PM WARNING hudson.plugins.ec2.win.WinConnection pingFailingIfSSHHandShakeError Failed to verify connectivity to Windows agent java.net.SocketTimeoutException: connect timed out at java.base/java.net.PlainSocketImpl.waitForConnect(Native Method) at java.base/java.net.PlainSocketImpl.socketConnect(PlainSocketImpl.java:107) at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) at java.base/java.net.Socket.connect(Socket.java:609) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:87) at com.hierynomus.protocol.commons.socket.ProxySocketFactory.createSocket(ProxySocketFactory.java:63) at com.hierynomus.smbj.transport.tcp.direct.DirectTcpTransport.connect(DirectTcpTransport.java:88) at com.hierynomus.smbj.connection.Connection.connect(Connection.java:139) at com.hierynomus.smbj.SMBClient.getEstablishedOrConnect(SMBClient.java:96) at com.hierynomus.smbj.SMBClient.connect(SMBClient.java:71) at hudson.plugins.ec2.win.WinConnection.pingFailingIfSSHHandShakeError(WinConnection.java:135) at hudson.plugins.ec2.win.EC2WindowsLauncher.connectToWinRM(EC2WindowsLauncher.java:189) at hudson.plugins.ec2.win.EC2WindowsLauncher.launchScript(EC2WindowsLauncher.java:52) at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:48) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:293) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang. Thread .run( Thread .java:834)  

          Bruno Esteves added a comment -

          I should also mention that the slaves are windows instances

          Bruno Esteves added a comment - I should also mention that the slaves are windows instances

          Dante Kiaunis added a comment - - edited

          Having this same issue on our jenkins instance. I'm able to unblock it by manually turning the AWS ec2 on. Then everything continues to work as normal. This happens with our windows and linux agents.

          Dante Kiaunis added a comment - - edited Having this same issue on our jenkins instance. I'm able to unblock it by manually turning the AWS ec2 on. Then everything continues to work as normal. This happens with our windows and linux agents.

          Same issue on my instance, no error, no exception, no call is sent to AWS EC2 to initiate instance startup.

          The cap instance is not respected at global or AMI level.

          We have to force instances to be deleted to control the number of subnodes.

          It's a major issue as it lead to loss cost control.

           

          florian Locqueneux added a comment - Same issue on my instance, no error, no exception, no call is sent to AWS EC2 to initiate instance startup. The cap instance is not respected at global or AMI level. We have to force instances to be deleted to control the number of subnodes. It's a major issue as it lead to loss cost control.  

          Cuong added a comment - - edited

          I also encountered this issue in my Jenkins setup.
          Can confirm that it is only on EC2 version 1.66. Downgrading to 1.65 and things start working.
          My bad. It was actually a rerun build that just use Jenkins scheduling to reuse the same node.
          Issue still persists in 1.65 when scheduling a new build.

          Cuong added a comment - - edited I also encountered this issue in my Jenkins setup. Can confirm that it is only on EC2 version 1.66. Downgrading to 1.65 and things start working. My bad. It was actually a rerun build that just use Jenkins scheduling to reuse the same node. Issue still persists in 1.65 when scheduling a new build.

          greyjackal have you found a way to make it a reproducible situation? We experience almost the same but have not been able to consistently reproduce it. It seems to happen randomly? Also wondering for completeness what is your EC2 template configuration look like. Do you have additional arguments, boot delay, etc.

          Matthias Glastra added a comment - greyjackal have you found a way to make it a reproducible situation? We experience almost the same but have not been able to consistently reproduce it. It seems to happen randomly? Also wondering for completeness what is your EC2 template configuration look like. Do you have additional arguments, boot delay, etc.

          David Drum added a comment - - edited

          I, too, am experiencing this problem, and may have something to contribute. In my first attempt, which resulted in the plugin launching a new instance, the logs show it looking for an instance that matched:

          hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='AWS Linux 2', labels='ec2'}. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2]}, {Name: tenancy,Values: [default]}, {Name: subnet-id,Values: [subnet-xxxxxxxx]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https://foo.com/jenkins/]}],InstanceIds: [],}
          

          I then logged in to the AWS EC2 console Instances dialog and began entering the filters above. The list of instances was fine until I added the availability-zone filter. AWS suggested us-east-2a, us-east-2b, and us-east-2c instead of plain us-east-2. The web dialog accepted us-east-2* and continued to list the appropriate instances, so I changed the plugin settings Availability Zone field to that value also. The second problematic filter was subnet-id, as in the plugin settings I have multiple subnets configured for round robin, but the existence of a specific subnet-id in the filter above would interfere with that. I removed the list of subnet IDs in the plugin. After making those two changes to the plugin settings, a subsequent build succeeded in finding and starting an existing instance:

          hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='AWS Linux 2', labels='ec2'}. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2*]}, {Name: tenancy,Values: [default]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https://mathkins.pfxdev.com/jenkins/]}],InstanceIds: [],}
          

          I hope this helps. Also, I'd much prefer subnet name to subnet ID, as that could be wildcarded.

           

          David Drum added a comment - - edited I, too, am experiencing this problem, and may have something to contribute. In my first attempt, which resulted in the plugin launching a new instance, the logs show it looking for an instance that matched: hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description= 'AWS Linux 2' , labels= 'ec2' }. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2]}, {Name: tenancy,Values: [ default ]}, {Name: subnet-id,Values: [subnet-xxxxxxxx]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https: //foo.com/jenkins/]}],InstanceIds: [],} I then logged in to the AWS EC2 console Instances dialog and began entering the filters above. The list of instances was fine until I added the availability-zone filter. AWS suggested us-east-2a , us-east-2b , and us-east-2c instead of plain us-east-2 . The web dialog accepted us-east-2* and continued to list the appropriate instances, so I changed the plugin settings Availability Zone field to that value also. The second problematic filter was subnet-id , as in the plugin settings I have multiple subnets configured for round robin, but the existence of a specific subnet-id in the filter above would interfere with that. I removed the list of subnet IDs in the plugin. After making those two changes to the plugin settings, a subsequent build succeeded in finding and starting an existing instance: hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description= 'AWS Linux 2' , labels= 'ec2' }. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-xxxxxxxxxxxxxxxxx]}, {Name: instance-type,Values: [c5.4xlarge]}, {Name: key-name,Values: [jenkins]}, {Name: availability-zone,Values: [us-east-2*]}, {Name: tenancy,Values: [ default ]}, {Name: tag:jenkins_slave_type,Values: [demand_AWS Linux 2]}, {Name: tag:jenkins_server_url,Values: [https: //mathkins.pfxdev.com/jenkins/]}],InstanceIds: [],} I hope this helps. Also, I'd much prefer subnet name to subnet ID, as that could be wildcarded.  

          Possibly related to JENKINS-64520 "EC2 node not start after stop/disconnect with parameter Idle termination time"?

          Matthew Webber added a comment - Possibly related to JENKINS-64520 "EC2 node not start after stop/disconnect with parameter Idle termination time"?

            Unassigned Unassigned
            greyjackal Bruno Esteves
            Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: