• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • amazon-ecs-plugin
    • None

      Seeing ChannelClosed or Workspace not available errors seemingly randomly when using ECS containers.

       

      Considerations:

      • Cluster load does not appear to be a factor
      • Jenkins server and cluster exist in the same VPC
      • pingIntervalSeconds using default 300 seconds
      • pingTimeoutSeconds using default 240 seconds
      • Unable to reproduce the issue reliably
      • Not other network issues seen at similar times with nodes in EC2
      • Length of build/task does not appear to be a factor

       

      I know this is a horribly vague bug to resolve, so I'm quite happy to try any reconfiguration or similar suggested.

       

      code

      FATAL: java.nio.channels.ClosedChannelException
      java.nio.channels.ClosedChannelException
      Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-10-1-0-187.ec2.internal/10.1.0.187:43840
      at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
      at hudson.remoting.Request.call(Request.java:202)
      at hudson.remoting.Channel.call(Channel.java:954)
      at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1078)
      at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:510)
      at com.tikal.jenkins.plugins.multijob.MultiJobBuild$MultiJobRunnerImpl.run(MultiJobBuild.java:148)
      at hudson.model.Run.execute(Run.java:1794)
      at com.tikal.jenkins.plugins.multijob.MultiJobBuild.run(MultiJobBuild.java:76)
      at hudson.model.ResourceController.execute(ResourceController.java:97)
      at hudson.model.Executor.run(Executor.java:429)
      Caused: hudson.remoting.RequestAbortedException
      at hudson.remoting.Request.abort(Request.java:340)
      at hudson.remoting.Channel.terminate(Channel.java:1038)
      at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
      at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
      at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
      at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
      at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
      at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
      at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
      at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
      at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      code

      code
      INFO: No logs found
      Jun 11, 2018 11:34:32 PM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
      WARNING: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/10.1.0.85:7300 remote=ip-10-1-0-187.
      ec2.internal/10.1.0.187:43840]] / Computer.threadPoolForRemoting 1796 for ecs-rdkcmf-12377045296774 terminated
      java.nio.channels.ClosedChannelException
      at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
      at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
      at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)

      Jun 11, 2018 11:34:33 PM jenkins.plugins.slack.StandardSlackService publish
      code

          [JENKINS-51878] Random Remoting issues - ChannelClosed

          Eva Connors added a comment - - edited

          This happens to us too sometimes – our ECS cluster will downscale while a job is still actively running on the instance, and instead of waiting and draining, it just kills the job. We've been trying to investigate ways to more intelligently downscale, but so far no luck. Does your cluster autoscale based on CPU usage?

          Eva Connors added a comment - - edited This happens to us too sometimes – our ECS cluster will downscale while a job is still actively running on the instance, and instead of waiting and draining, it just kills the job. We've been trying to investigate ways to more intelligently downscale, but so far no luck. Does your cluster autoscale based on CPU usage?

          David Hayes added a comment -

          emconnors, no, it doesn't. We downscale based on memory reservation being 0 (we tend to have over night periods where no jobs are running). I've ruled out termination of the EC2 host instance as a cause in this case, though it would certainly lead to the same effect as you've seen.

          David Hayes added a comment - emconnors , no, it doesn't. We downscale based on memory reservation being 0 (we tend to have over night periods where no jobs are running). I've ruled out termination of the EC2 host instance as a cause in this case, though it would certainly lead to the same effect as you've seen.

          I'm seeing this as well, although on Fargate.

          Chris St. Pierre added a comment - I'm seeing this as well, although on Fargate.

            roehrijn2 Jan Roehrich
            evidex David Hayes
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: