Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63537

EC2SlaveMonitor terminates agent with running job when STS session expires

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: ec2-plugin
    • Labels:
      None
    • Environment:
    • Similar Issues:

      Description

      OS: Amazon Linux 2
      JDK: Corretto-11.0.6.10.1
      Jenkins: 2.222.4

      Plugins:

      Amazon EC2 plugin - org.jenkins-ci.plugins:ec2:1.51
      Jackson-annotations - com.fasterxml.jackson.core:jackson-annotations:2.11.1
      mbassador - net.engio:mbassador:1.3.0
      SnakeYAML - org.yaml:snakeyaml:1.24
      smbj - com.hierynomus:smbj:0.10.0
      Multiline Secrets UI - io.jenkins.temp.jelly:multiline-secrets-ui:1.0
      jackson-databind - com.fasterxml.jackson.core:jackson-databind:2.11.1
      Jackson-dataformat-YAML - com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.10.2
      asn-one - com.hierynomus:asn-one:0.4.0

      EC2 agents are launched on-demand using the ec2-plugin and accessed with SSH. Credentials for creating new EC2 instances, perform operations on the instances, etc. are provided through an 'AssumeRole' call to STS. The STS session timeout is set to 15 minutes.

      Agents are successfully created, some jobs are able to successfully run, and will be terminated when idle for 30 minutes as configured in the EC2 plugin.

      I created a log recorder to collect entries from the following loggers:

      • hudson.plugins.ec2
      • com.amazonaws.request
      • com.amazonaws.http.response.AwsResponseHandlerAdapter

      The output of this recorder contains entries from the EC2ConnectionUpdater similar to the following:

          Aug 27, 2020 4:29:34 PM FINER hudson.plugins.ec2.EC2Cloud$EC2ConnectionUpdater doRun Checking EC2 Connection on: CLOUD
          Aug 27, 2020 4:29:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Sending Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., ) 
          Aug 27, 2020 4:29:36 PM FINEST com.amazonaws.http.StaxResponseHandler handleParsing service response XML
          Aug 27, 2020 4:29:41 PM FINEST com.amazonaws.http.StaxResponseHandler handle Done parsing service response
          Aug 27, 2020 4:29:41 PM FINE com.amazonaws.http.response.AwsResponseHandlerAdapter handle Received successful response: 200, AWS Request ID: ...
      

       Every 15 minutes, we see messages similar to the following:

          Aug 27, 2020 4:33:34 PM FINER hudson.plugins.ec2.EC2Cloud$EC2ConnectionUpdater doRun Checking EC2 Connection on: CLOUD
          Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Sending Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., ) 
          Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor handleErrorResponse Received error response: com.amazonaws.services.ec2.model.AmazonEC2Exception: Request has expired. (Service: AmazonEC2; Status Code: 400; Error Code: RequestExpired; Request ID: ...; Proxy: null)
          Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Retrying Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., ) 
          Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor handleErrorResponse Received error response: com.amazonaws.services.ec2.model.AmazonEC2Exception: Request has expired. (Service: AmazonEC2; Status Code: 400; Error Code: RequestExpired; Request ID: ...; Proxy: null)
      

      There are always 17 400s followed the following:

          Aug 27, 2020 4:34:51 PM FINER hudson.plugins.ec2.EC2Cloud$EC2ConnectionUpdater doRun Reconnecting to EC2 on: aws-ec2-nprod
          Aug 27, 2020 4:34:52 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Sending Request: POST https://sts.amazonaws.com / Parameters: ({"Action":["AssumeRole"],"Version":["2011-06-15"],"RoleArn":["..."],"RoleSessionName":["Jenkins"],"DurationSeconds":["900"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., ) 
          Aug 27, 2020 4:34:52 PM FINEST com.amazonaws.http.StaxResponseHandler handle Parsing service response XML
          Aug 27, 2020 4:34:53 PM FINEST com.amazonaws.http.StaxResponseHandler handle Done parsing service response
          Aug 27, 2020 4:34:53 PM FINE com.amazonaws.http.response.AwsResponseHandlerAdapter handle Received successful response: 200, AWS Request ID: ...
      

      After the 'AssumeRole' call succeeds, then we have a 15-minute window without issues, then another 17 400 errors. The timespan for the 17 retries is ~90 seconds.

      When we have jobs running on nodes created by the ec2-plugin, the node will occasionally be terminated by the EC2SlaveMonitor. This happens when the STS session timeout coincides with the check from the EC2SlaveMonitor. EC2SlaveMonitor runs every 10 minutes; EC2ConnectionUpdater runs every minute. When EC2SlaveMonitor checks connectivity with the existing agents after the STS session has expired but has not yet been restored, any existing agent will be terminated.

      The job console when this occurs looks like the following:

      17:01:22 Cannot contact EC2 (CLOUD) - NODE (...): hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@50725879:EC2 (CLOUD) - NODE (...)": Remote call on EC2 (CLOUD) - NODE (...) failed. The channel is closing down or has closed down
       17:06:22 Agent EC2 (CLOUD) - NODE (...) was deleted; cancelling node body
       17:06:22 Could not connect to EC2 (CLOUD) - NODE (...) to send interrupt signal to process
       17:06:22 EC2 (CLOUD) - NODE (...) was marked offline: Node is being removed
      

       

        Attachments

          Activity

          Hide
          suthsc Scott Sutherland added a comment -

          I have created PR #493 as a potential solution to the expired STS session. Since the EC2Connection monitor will reconnect to AWS each minute, we avoid calling isAlive() on EC2AbstractSlaves with at least one busy executor.

          Show
          suthsc Scott Sutherland added a comment - I have created PR #493 as a potential solution to the expired STS session. Since the EC2Connection monitor will reconnect to AWS each minute, we avoid calling isAlive() on EC2AbstractSlaves with at least one busy executor.

            People

            Assignee:
            thoulen FABRIZIO MANFREDI
            Reporter:
            suthsc Scott Sutherland
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated: