-
New Feature
-
Resolution: Unresolved
-
Minor
-
None
Not sure if this ECS specific or a more general JNLP issue.
One of the main use cases of using Docker for builds is to isolate the VM/hardware performing builds from the build system. As such if an ECS host is removed from the cluster (scaling, spot instance gets outbid, hosts replaced/upgraded) any slave jobs running on this host will never get restarted on a different host.
This manifests as a bunch of different errors mostly related to not being able to connect to the specific ECS slave that was doing the build:
Example1:
Cannot contact ecs-mycluster-3ec9178c860: java.io.IOException: remote file operation failed: /home/jenkins/workspace/nd_my-build-3FBTDD6B4SFOB7TWNY6VBPSYYH6DRQTTK6UHDJWRBBMUEXUBV2BQ at hudson.remoting.Channel@66dfe171:Channel to /172.31.4.13: hudson.remoting.ChannelClosedException: channel is already closed
Example 2:
Cannot contact ecs-majstg-3dc6b7ce91d: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed
Is this an ECS plugin issue or a more general issue/limitation with Jenkins slaves? How are people using this plugin with spot instances or in general coordinating ECS hosts leaving/entering the cluster?