-
Bug
-
Resolution: Not A Defect
-
Major
-
None
We have Jenkins + EC2 plugin running both Windows and Linux instances in AWS. from time to time instances being terminated during building a job.
Observations:
- Happens only on Windows nodes. Linux works perfectly
- Happens only in off working hours as they are defined in section
[Only apply minimum number of instances during specific time range].
Configuration details
- Jenkins : 2.303.3 , EC2 Plugin : 1.66
- Auto scale , From: 06:00 To: 21:00
- the "Minimum number of instances" is 0
- the "Minimum number of spare instances" is 6
Jenkins Job Log:
2022-02-03 05:00:42 5: [ RUN ] ****
2022-02-03 05:00:42 5: [ RUN ] ****
2022-02-03 05:00:42 5: [ OK ] ****
2022-02-03 05:00:44 5: [ RUN ] ****
2022-02-03 05:00:44 Terminating on signal SIGTERM(15)
2022-02-03 05:00:44 FATAL: command execution failed
2022-02-03 05:00:48 java.io.EOFException
2022-02-03 05:00:48 at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2798).......
2022-02-03 05:00:48 Caused: java.io.IOException: Backing channel 'EC2 (AWS) - eu-west1b-windows (i-0a6a1e55947b1e6fa)' is disconnected.
Jenkins Server log :
-- 2022-02-03 05:17:04.095+0000 [id=40] INFO hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='eu-west1b-windows', labels='aws_win'}. checkInstance: i-03e553095dcff6c00.. false - found existing corresponding Jenkins agent: i-03e553095dcff6c00 -- 2022-02-03 05:17:15.928+0000 [id=46] INFO hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='eu-west1b-windows', labels='aws_win'}. checkInstance: i-03e553095dcff6c00.. false - found existing corresponding Jenkins agent: i-03e553095dcff6c00 -- 2022-02-03 05:17:38.354+0000 [id=46] INFO hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='eu-west1b-windows', labels='aws_win'}. checkInstance: i-03e553095dcff6c00.. false - found existing corresponding Jenkins agent: i-03e553095dcff6c00 -- 2022-02-03 05:27:45.532+0000 [id=44] INFO hudson.plugins.ec2.SlaveTemplate#logProvisionInfo: SlaveTemplate{description='eu-west1b-windows', labels='aws_win'}. checkInstance: i-03e553095dcff6c00.. false - found existing corresponding Jenkins agent: i-03e553095dcff6c00 -- 2022-02-03 06:03:12.419+0000 [id=6376824] INFO h.p.ec2.EC2RetentionStrategy#internalCheck: Idle timeout of EC2 (AWS) - eu-west1b-windows (i-03e553095dcff6c00) after 60 idle minutes, instance statusRUNNING 2022-02-03 06:03:12.419+0000 [id=6376824] INFO h.plugins.ec2.EC2AbstractSlave#idleTimeout: EC2 instance idle time expired: i-03e553095dcff6c00 2022-02-03 06:03:12.724+0000 [id=6377196] INFO h.plugins.ec2.EC2OndemandSlave#lambda$terminate$0: Terminated EC2 instance (terminated): i-03e553095dcff6c00 -- 2022-02-03 06:04:13.537+0000 [id=6377017] INFO h.p.ec2.EC2RetentionStrategy#internalCheck: Idle timeout of EC2 (AWS) - eu-west1b-windows (i-03e553095dcff6c00) after 61 idle minutes, instance statusSHUTTING_DOWN 2022-02-03 06:04:13.537+0000 [id=6377017] INFO h.plugins.ec2.EC2AbstractSlave#idleTimeout: EC2 instance idle time expired: i-03e553095dcff6c00 -- 2022-02-03 06:06:39.956+0000 [id=6377196] INFO h.plugins.ec2.EC2OndemandSlave#lambda$terminate$0: Removed EC2 instance from jenkins controller: i-03e553095dcff6c00
One observation: usually, this problem happens only during the night, when we are running automation jobs using a large number of nodes.
We are launching ~ 200 small instances in 20 minutes, and then we are releasing them in a period of 60 minutes, after 3 hours of work.
But the master node seems not busy, according to CPU / RAM usage, and there are no issues with network or disk as well.
All nodes are created properly. All nodes are working well. Only random windows node is getting killed by EC2 plugin. And from logs looks like the node is killed due to "idle time expired", but this is not the case, cos there is a running job on this node. Is there anything else that we can do to :
a) work around this problem
b) provide more information that can help solve the issue.