-
Bug
-
Resolution: Unresolved
-
Minor
-
None
Seeing ChannelClosed or Workspace not available errors seemingly randomly when using ECS containers.
Considerations:
- Cluster load does not appear to be a factor
- Jenkins server and cluster exist in the same VPC
- pingIntervalSeconds using default 300 seconds
- pingTimeoutSeconds using default 240 seconds
- Unable to reproduce the issue reliably
- Not other network issues seen at similar times with nodes in EC2
- Length of build/task does not appear to be a factor
I know this is a horribly vague bug to resolve, so I'm quite happy to try any reconfiguration or similar suggested.
code
FATAL: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-10-1-0-187.ec2.internal/10.1.0.187:43840
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at hudson.remoting.Request.call(Request.java:202)
at hudson.remoting.Channel.call(Channel.java:954)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1078)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:510)
at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:148)
at hudson.model.Run.execute(Run.java:1794)
at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:76)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Caused: hudson.remoting.RequestAbortedException
at hudson.remoting.Request.abort(Request.java:340)
at hudson.remoting.Channel.terminate(Channel.java:1038)
at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
code
code
INFO: No logs found
Jun 11, 2018 11:34:32 PM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
WARNING: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/10.1.0.85:7300 remote=ip-10-1-0-187.
ec2.internal/10.1.0.187:43840]] / Computer.threadPoolForRemoting 1796 for ecs-rdkcmf-12377045296774 terminated
java.nio.channels.ClosedChannelException
at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Jun 11, 2018 11:34:33 PM jenkins.plugins.slack.StandardSlackService publish
code
This happens to us too sometimes – our ECS cluster will downscale while a job is still actively running on the instance, and instead of waiting and draining, it just kills the job. We've been trying to investigate ways to more intelligently downscale, but so far no luck. Does your cluster autoscale based on CPU usage?