Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-71554

EC2ConnectionUpdater is not resestablishing connection

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • ec2-plugin
    • None

      Hi there, we have been noticing this issue where we get the following exception sporadically.

       

      hudson.plugins.ec2.EC2Cloud#provision: SlaveTemplate{description='cbci-amznl2-agent', labels='aws-aml2 aws-linux'}. Exception during provisioning
      com.amazonaws.services.ec2.model.AmazonEC2Exception: Request has expired. (Service: AmazonEC2; Status Code: 400; Error Code: RequestExpired; Request ID: 3de0d618-7b0c-4ef1-82f4-8eedf5bd8c88; Proxy: null)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
          at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
          at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
          at com.amazonaws.services.ec2.AmazonEC2Client.doInvoke(AmazonEC2Client.java:34698)
          at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:34665)
          at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:34654)
          at com.amazonaws.services.ec2.AmazonEC2Client.executeDescribeImages(AmazonEC2Client.java:15117)
          at com.amazonaws.services.ec2.AmazonEC2Client.describeImages(AmazonEC2Client.java:15085)
          at hudson.plugins.ec2.SlaveTemplate.getImage(SlaveTemplate.java:1347)
          at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:901)
          at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:717)
          at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:743)
          at hudson.slaves.Cloud.provision(Cloud.java:210)
          at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:726)
          at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:325)
          at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:823)
          at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:94)
          at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:69)
          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
          at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:829) 

       

      In order to unblock ourselves, we have been saving the cloud configuration with no changes, this seems to reestablish the connection to AWS.

       

      We have noticed that in some cases, there have been logs from the EC2ConnectionUpdater between instances of the above stack trace.

       

      I noticed that the EC2ConnectionUpdater works by catching an AmazonClientException and then triggers the reconnectToEc2 method on EC2Cloud.

       

      The only real difference I can see between the calls made to EC2 from this plugin is that the EC2ConnectionUpdater calls describeInstances whereas the SlaveTemplate calls describeImages.

       

      If you need any more information from me please feel free to reach out.

       

      Thanks,

      Alan.

          [JENKINS-71554] EC2ConnectionUpdater is not resestablishing connection

          lata kopalle added a comment -

          Also, we don't see this issue on controllers that have agents launching in the same AWS account as the controller.

          The scenario Alan described is specific to cross-account agents where agents are set to launch in a different AWS account than the account where controller is hosted.

          lata kopalle added a comment - Also, we don't see this issue on controllers that have agents launching in the same AWS account as the controller. The scenario Alan described is specific to cross-account agents where agents are set to launch in a different AWS account than the account where controller is hosted.

          Hi lata and a_devine thank you for report and your initial investigation.

          Since saving the config works my guess here is that the credentials have expired and you have discovered the EC2ConnectionUpdater whose job it is to check on this.

          Do you guys see any logs related to LOGGER.log(Level.SEVERE, "Timer task " + this + " failed", t); I'm guessing the EC2ConnectionUpdater should throw one with an AmazonEC2Exception Instead of an AmazonClientException which is why it doesn't reconnect.

          If you guys could provide me with information related to that. It would be very helpful. Thanks!

          Raihaan Shouhell added a comment - Hi lata and a_devine thank you for report and your initial investigation. Since saving the config works my guess here is that the credentials have expired and you have discovered the EC2ConnectionUpdater whose job it is to check on this. Do you guys see any logs related to LOGGER.log(Level.SEVERE, "Timer task " + this + " failed", t); I'm guessing the EC2ConnectionUpdater should throw one with an AmazonEC2Exception Instead of an AmazonClientException which is why it doesn't reconnect. If you guys could provide me with information related to that. It would be very helpful. Thanks!

          Alan added a comment -

          Hi Raihaan,

          We will be sure to watch for that on the next occurrence.

          We have a log recorder setup targeting the EC2 Plugin, but given that it is quite noisy, is there a class you would recommend targeting in a log recorder?

          Thanks,

          Alan.

          Alan added a comment - Hi Raihaan, We will be sure to watch for that on the next occurrence. We have a log recorder setup targeting the EC2 Plugin, but given that it is quite noisy, is there a class you would recommend targeting in a log recorder? Thanks, Alan.

          Raihaan Shouhell added a comment - Hey a_devine , It should be https://github.com/jenkinsci/jenkins/blob/e283ec92a086a1a96758bf578eb75349d4dcf1b8/core/src/main/java/hudson/triggers/SafeTimerTask.java#L91-L95 this is what runs the EC2ConnectionUpdater and the base class of https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/PeriodicWork.java#L63 Cheers, Raihaan

          Edward added a comment -

          So we are still having issues with this – we do not get any errors regarding the SafeTimerTask.

          In general we dont have any issues - we get these RequestExpired exceptions in flurries.

          In the EC2 Cloud config page under, the Advanced tab, there are 2 field for roleArn and roleSession name with absolutely no help/documentation around them. What are these for? The AWS credential is the iamrolearn of the role in the target account (with the controller role in the trust policy). Are these extra settings required?

          I could well be wrong here, but this method https://github.com/jenkinsci/ec2-plugin/blob/4e585c3bab446ceac82cf7f2579b9e07a629640e/src/main/java/hudson/plugins/ec2/EC2Cloud.java#L931-L951 uses the roleSessionName and the roleArn to create an STS provider. So its confusing as to whether these are linked to the cloud form fields or not and if they should be filled in

          Edward added a comment - So we are still having issues with this – we do not get any errors regarding the SafeTimerTask. In general we dont have any issues - we get these RequestExpired exceptions in flurries. In the EC2 Cloud config page under, the Advanced tab, there are 2 field for roleArn and roleSession name with absolutely no help/documentation around them. What are these for? The AWS credential is the iamrolearn of the role in the target account (with the controller role in the trust policy). Are these extra settings required? I could well be wrong here, but this method https://github.com/jenkinsci/ec2-plugin/blob/4e585c3bab446ceac82cf7f2579b9e07a629640e/src/main/java/hudson/plugins/ec2/EC2Cloud.java#L931-L951 uses the roleSessionName and the roleArn to create an STS provider. So its confusing as to whether these are linked to the cloud form fields or not and if they should be filled in

          Hey przenie the roleArn and roleSessionName is for cases where the credentials is used to assume a role in a different account.

          Hey a_devine and lata, regarding this issue. I think due to the fact that RequestExpired can indicate credential issues (https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html) I've added a work around specifically to reconnect in these conditions. Its released as version 1609.v53b_02a_b_9e52d

          Raihaan Shouhell added a comment - Hey przenie the roleArn and roleSessionName is for cases where the credentials is used to assume a role in a different account. Hey a_devine and lata , regarding this issue. I think due to the fact that RequestExpired can indicate credential issues ( https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html ) I've added a work around specifically to reconnect in these conditions. Its released as version 1609.v53b_02a_b_9e52d

            raihaan Raihaan Shouhell
            a_devine Alan
            Votes:
            4 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: