-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
OS: Amazon Linux 2
JDK: Corretto-11.0.6.10.1
Jenkins: 2.222.4
Plugins:
Amazon EC2 plugin - org.jenkins-ci.plugins:ec2:1.51
Jackson-annotations - com.fasterxml.jackson.core:jackson-annotations:2.11.1
mbassador - net.engio:mbassador:1.3.0
SnakeYAML - org.yaml:snakeyaml:1.24
smbj - com.hierynomus:smbj:0.10.0
Multiline Secrets UI - io.jenkins.temp.jelly:multiline-secrets-ui:1.0
jackson-databind - com.fasterxml.jackson.core:jackson-databind:2.11.1
Jackson-dataformat-YAML - com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.10.2
asn-one - com.hierynomus:asn-one:0.4.0
OS: Amazon Linux 2 JDK: Corretto-11.0.6.10.1 Jenkins: 2.222.4 Plugins: Amazon EC2 plugin - org.jenkins-ci.plugins:ec2:1.51 Jackson-annotations - com.fasterxml.jackson.core:jackson-annotations:2.11.1 mbassador - net.engio:mbassador:1.3.0 SnakeYAML - org.yaml:snakeyaml:1.24 smbj - com.hierynomus:smbj:0.10.0 Multiline Secrets UI - io.jenkins.temp.jelly:multiline-secrets-ui:1.0 jackson-databind - com.fasterxml.jackson.core:jackson-databind:2.11.1 Jackson-dataformat-YAML - com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.10.2 asn-one - com.hierynomus:asn-one:0.4.0
OS: Amazon Linux 2
JDK: Corretto-11.0.6.10.1
Jenkins: 2.222.4
Plugins:
Amazon EC2 plugin - org.jenkins-ci.plugins:ec2:1.51
Jackson-annotations - com.fasterxml.jackson.core:jackson-annotations:2.11.1
mbassador - net.engio:mbassador:1.3.0
SnakeYAML - org.yaml:snakeyaml:1.24
smbj - com.hierynomus:smbj:0.10.0
Multiline Secrets UI - io.jenkins.temp.jelly:multiline-secrets-ui:1.0
jackson-databind - com.fasterxml.jackson.core:jackson-databind:2.11.1
Jackson-dataformat-YAML - com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:2.10.2
asn-one - com.hierynomus:asn-one:0.4.0
EC2 agents are launched on-demand using the ec2-plugin and accessed with SSH. Credentials for creating new EC2 instances, perform operations on the instances, etc. are provided through an 'AssumeRole' call to STS. The STS session timeout is set to 15 minutes.
Agents are successfully created, some jobs are able to successfully run, and will be terminated when idle for 30 minutes as configured in the EC2 plugin.
I created a log recorder to collect entries from the following loggers:
- hudson.plugins.ec2
- com.amazonaws.request
- com.amazonaws.http.response.AwsResponseHandlerAdapter
The output of this recorder contains entries from the EC2ConnectionUpdater similar to the following:
Aug 27, 2020 4:29:34 PM FINER hudson.plugins.ec2.EC2Cloud$EC2ConnectionUpdater doRun Checking EC2 Connection on: CLOUD
Aug 27, 2020 4:29:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Sending Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., )
Aug 27, 2020 4:29:36 PM FINEST com.amazonaws.http.StaxResponseHandler handleParsing service response XML
Aug 27, 2020 4:29:41 PM FINEST com.amazonaws.http.StaxResponseHandler handle Done parsing service response
Aug 27, 2020 4:29:41 PM FINE com.amazonaws.http.response.AwsResponseHandlerAdapter handle Received successful response: 200, AWS Request ID: ...
Every 15 minutes, we see messages similar to the following:
Aug 27, 2020 4:33:34 PM FINER hudson.plugins.ec2.EC2Cloud$EC2ConnectionUpdater doRun Checking EC2 Connection on: CLOUD Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Sending Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., ) Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor handleErrorResponse Received error response: com.amazonaws.services.ec2.model.AmazonEC2Exception: Request has expired. (Service: AmazonEC2; Status Code: 400; Error Code: RequestExpired; Request ID: ...; Proxy: null) Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Retrying Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., ) Aug 27, 2020 4:33:34 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor handleErrorResponse Received error response: com.amazonaws.services.ec2.model.AmazonEC2Exception: Request has expired. (Service: AmazonEC2; Status Code: 400; Error Code: RequestExpired; Request ID: ...; Proxy: null)
There are always 17 400s followed the following:
Aug 27, 2020 4:34:51 PM FINER hudson.plugins.ec2.EC2Cloud$EC2ConnectionUpdater doRun Reconnecting to EC2 on: aws-ec2-nprod
Aug 27, 2020 4:34:52 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest Sending Request: POST https://sts.amazonaws.com / Parameters: ({"Action":["AssumeRole"],"Version":["2011-06-15"],"RoleArn":["..."],"RoleSessionName":["Jenkins"],"DurationSeconds":["900"]}Headers: (amz-sdk-invocation-id: ..., User-Agent: aws-sdk-java/1.11.821 Linux/4.14.121-109.96.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.6+10-LTS java/11.0.6 groovy/2.4.12 vendor/Amazon.com_Inc., )
Aug 27, 2020 4:34:52 PM FINEST com.amazonaws.http.StaxResponseHandler handle Parsing service response XML
Aug 27, 2020 4:34:53 PM FINEST com.amazonaws.http.StaxResponseHandler handle Done parsing service response
Aug 27, 2020 4:34:53 PM FINE com.amazonaws.http.response.AwsResponseHandlerAdapter handle Received successful response: 200, AWS Request ID: ...
After the 'AssumeRole' call succeeds, then we have a 15-minute window without issues, then another 17 400 errors. The timespan for the 17 retries is ~90 seconds.
When we have jobs running on nodes created by the ec2-plugin, the node will occasionally be terminated by the EC2SlaveMonitor. This happens when the STS session timeout coincides with the check from the EC2SlaveMonitor. EC2SlaveMonitor runs every 10 minutes; EC2ConnectionUpdater runs every minute. When EC2SlaveMonitor checks connectivity with the existing agents after the STS session has expired but has not yet been restored, any existing agent will be terminated.
The job console when this occurs looks like the following:
17:01:22 Cannot contact EC2 (CLOUD) - NODE (...): hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@50725879:EC2 (CLOUD) - NODE (...)": Remote call on EC2 (CLOUD) - NODE (...) failed. The channel is closing down or has closed down
17:06:22 Agent EC2 (CLOUD) - NODE (...) was deleted; cancelling node body
17:06:22 Could not connect to EC2 (CLOUD) - NODE (...) to send interrupt signal to process
17:06:22 EC2 (CLOUD) - NODE (...) was marked offline: Node is being removed