Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59705

hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from IP/IP:58344 failed. The channel is closing down or has closed dow

    • Icon: Bug Bug
    • Resolution: Not A Defect
    • Icon: Blocker Blocker
    • kubernetes-plugin
    • None
    • Jenkins master version: 2.190.1
      Kubernetes Plugin: 1.19.3

      It also happened before the upgrade in
      Jenkins: 2.176.3
      K8S plugin: 1.19.0

      It happens frequently not something constant, which makes it very hard to debug.

      This is my podTemplate:

      podTemplate(containers: [
          containerTemplate(
              name: 'build',
              image: 'my_builder:latest',
              command: 'cat',
              ttyEnabled: true,
              workingDir: '/mnt/jenkins'
          )
      ],
      volumes: [
          hostPathVolume(mountPath: '/var/run/docker.sock', hostPath: '/var/run/docker.sock'),
          hostPathVolume(mountPath: '/mnt/jenkins', hostPath: '/mnt/jenkins')
      ],
      yaml: """
      spec:
       containers:
         - name: build
           resources:
             requests:
               cpu: "10"
               memory: "10Gi" 
       securityContext:
         fsGroup: 995
      """
      )
      {
          node(POD_LABEL) {
              stage("Checkout") {
              }       
              // more stages
          }
      }
      

      This is the log from the pod:

      Inbound agent connected from IP/IP
      Waiting for agent to connect (0/100): my_branch
      Remoting version: 3.35
      This is a Unix agent
      Waiting for agent to connect (1/100): my_branch
      Agent successfully connected and online
      ERROR: Connection terminated
      java.nio.channels.ClosedChannelException
          at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
          at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
          at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)
      

      Logs from Jenkins "cat /var/log/jenkins/jenkins.log":

      2019-10-08 14:40:48.171+0000 [id=287] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: branch_name, template=PodTemplate{, name='pod_name', namespace='default', label='label_name', nodeUsageMode=EXCLUSIVE, volumes=[HostPathVolume [mountPath=/var/run/docker.sock, hostPath=/var/run/docker.sock], HostPathVolume [mountPath=/mnt/jenkins, hostPath=/mnt/jenkins]], containers=[ContainerTemplate{name='build', image='my_builder', workingDir='/mnt/jenkins', command='cat', ttyEnabled=true, envVars=[KeyValueEnvVar [getValue()=deploy/.dazelrc, getKey()=RC_FILE]]}], annotations=[org.csanchez.jenkins.plugins.kubernetes.PodAnnotation@aab9c821]} io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [100000] milliseconds for [Pod] with name:[branch_name] in namespace [default]. at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:130) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:134) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
      

          [JENKINS-59705] hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from IP/IP:58344 failed. The channel is closing down or has closed dow

          Eddie Mashayev created issue -
          Eddie Mashayev made changes -
          Description Original: It happens frequently not something constant, which makes it very hard to debug.



          This is my podTemplate:
          {code:java}
          podTemplate(containers: [
              containerTemplate(
                  name: 'build',
                  image: 'my_builder:latest',
                  command: 'cat',
                  ttyEnabled: true,
                  workingDir: '/mnt/jenkins'
              )
          ],
          volumes: [
              hostPathVolume(mountPath: '/var/run/docker.sock', hostPath: '/var/run/docker.sock'),
              hostPathVolume(mountPath: '/mnt/jenkins', hostPath: '/mnt/jenkins')
          ],
          yaml: """
          spec:
           containers:
             - name: build
               resources:
                 requests:
                   cpu: "10"
                   memory: "10Gi"
           securityContext:
             fsGroup: 995
          """
          )
          {
              node(POD_LABEL) {
                  stage("Checkout") {
                  } // more stages
              }
          }
          {code}
          This is the log from the pod:
          {code:java}
          Inbound agent connected from IP/IP
          Waiting for agent to connect (0/100): my_branch
          Remoting version: 3.35
          This is a Unix agent
          Waiting for agent to connect (1/100): my_branch
          Agent successfully connected and online
          ERROR: Connection terminated
          java.nio.channels.ClosedChannelException
              at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
              at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
              at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
              at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
              at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
          {code}




          Logs from Jenkins "cat /var/log/jenkins/jenkins.log":
          {code:java}
          2019-10-08 14:40:48.171+0000 [id=287] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: branch_name, template=PodTemplate{, name='pod_name', namespace='default', label='label_name', nodeUsageMode=EXCLUSIVE, volumes=[HostPathVolume [mountPath=/var/run/docker.sock, hostPath=/var/run/docker.sock], HostPathVolume [mountPath=/mnt/jenkins, hostPath=/mnt/jenkins]], containers=[ContainerTemplate{name='build', image='my_builder', workingDir='/mnt/jenkins', command='cat', ttyEnabled=true, envVars=[KeyValueEnvVar [getValue()=deploy/.dazelrc, getKey()=RC_FILE]]}], annotations=[org.csanchez.jenkins.plugins.kubernetes.PodAnnotation@aab9c821]} io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [100000] milliseconds for [Pod] with name:[branch_name] in namespace [default]. at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:130) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:134) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
          {code}
          New: It happens frequently not something constant, which makes it very hard to debug.

          This is my podTemplate:
          {code:java}
          podTemplate(containers: [
              containerTemplate(
                  name: 'build',
                  image: 'my_builder:latest',
                  command: 'cat',
                  ttyEnabled: true,
                  workingDir: '/mnt/jenkins'
              )
          ],
          volumes: [
              hostPathVolume(mountPath: '/var/run/docker.sock', hostPath: '/var/run/docker.sock'),
              hostPathVolume(mountPath: '/mnt/jenkins', hostPath: '/mnt/jenkins')
          ],
          yaml: """
          spec:
           containers:
             - name: build
               resources:
                 requests:
                   cpu: "10"
                   memory: "10Gi"
           securityContext:
             fsGroup: 995
          """
          )
          {
              node(POD_LABEL) {
                  stage("Checkout") {
                  }
                  // more stages
              }
          }
          {code}
          This is the log from the pod:
          {code:java}
          Inbound agent connected from IP/IP
          Waiting for agent to connect (0/100): my_branch
          Remoting version: 3.35
          This is a Unix agent
          Waiting for agent to connect (1/100): my_branch
          Agent successfully connected and online
          ERROR: Connection terminated
          java.nio.channels.ClosedChannelException
              at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
              at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
              at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
              at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
              at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
          {code}
          Logs from Jenkins "cat /var/log/jenkins/jenkins.log":
          {code:java}
          2019-10-08 14:40:48.171+0000 [id=287] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: branch_name, template=PodTemplate{, name='pod_name', namespace='default', label='label_name', nodeUsageMode=EXCLUSIVE, volumes=[HostPathVolume [mountPath=/var/run/docker.sock, hostPath=/var/run/docker.sock], HostPathVolume [mountPath=/mnt/jenkins, hostPath=/mnt/jenkins]], containers=[ContainerTemplate{name='build', image='my_builder', workingDir='/mnt/jenkins', command='cat', ttyEnabled=true, envVars=[KeyValueEnvVar [getValue()=deploy/.dazelrc, getKey()=RC_FILE]]}], annotations=[org.csanchez.jenkins.plugins.kubernetes.PodAnnotation@aab9c821]} io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [100000] milliseconds for [Pod] with name:[branch_name] in namespace [default]. at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:130) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:134) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
          {code}

          I think I have found the issue, I'm using EKS and using SPOT instance to run my CI. When using spot this issue happens frequently, when using on demand it pass all the time.

          The reason is that the Jenkins is getting the wrong instance ip to connect to the Jenkins master.

          Example:

          kubectl get pods -o wide --all-namespaces
          NAMESPACE       NAME                                                              READY   STATUS              RESTARTS   AGE     IP              NODE                            NOMINATED NODE
          default         some-job-5-c328g-kd-k2pfz   0/2     ContainerCreating   0          2s      <none>          ip-172-26-18-44.ec2.internal    <none>
          

          As you can see the job run on instance "ip-172-26-18-44.ec2.internal"

          Instance is in ready state in K8S:

           

          kubectl get nodes
          NAME                            STATUS                     ROLES    AGE     VERSION
          ip-172-26-18-44.ec2.internal    Ready                      <none>   11d     v1.12.10-eks-1246e3
          

           

          This is the log from Jenkins console:

          [Pipeline] End of Pipeline
          Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-172-26-30-207.ec2.internal/jenkins_master_IP:37312
                  at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
                  at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
                  at hudson.remoting.Channel.call(Channel.java:957)
                  at hudson.FilePath.act(FilePath.java:1072)
                  at hudson.FilePath.act(FilePath.java:1061)
                  at hudson.FilePath.mkdirs(FilePath.java:1246)
                  at hudson.plugins.git.GitSCM.createClient(GitSCM.java:811)
                  at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186)
                  at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124)
                  at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
                  at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
                  at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          java.nio.file.AccessDeniedException: /mnt/jenkins/workspace
              at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
              at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
              at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
              at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
              at java.nio.file.Files.createDirectory(Files.java:674)
              at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
              at java.nio.file.Files.createDirectories(Files.java:767)
              at hudson.FilePath.mkdirs(FilePath.java:3239)
              at hudson.FilePath.access$1300(FilePath.java:212)
              at hudson.FilePath$Mkdirs.invoke(FilePath.java:1254)
              at hudson.FilePath$Mkdirs.invoke(FilePath.java:1250)
              at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3052)
              at hudson.remoting.UserRequest.perform(UserRequest.java:211)
              at hudson.remoting.UserRequest.perform(UserRequest.java:54)
              at hudson.remoting.Request$2.run(Request.java:369)
              at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:97)
              at java.lang.Thread.run(Thread.java:748)
          

          It try to connect to Jenkins master with "ip-172-26-30-207.ec2.internal" and this instance doesn't exist.

           

          Seems like some bug in the K8S plugin and the communication to get the SPOT correct IP.

           

          Eddie Mashayev added a comment - I think I have found the issue, I'm using EKS and using SPOT instance to run my CI. When using spot this issue happens frequently, when using on demand it pass all the time. The reason is that the Jenkins is getting the wrong instance ip to connect to the Jenkins master. Example: kubectl get pods -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE default some-job-5-c328g-kd-k2pfz 0/2 ContainerCreating 0 2s <none> ip-172-26-18-44.ec2.internal <none> As you can see the job run on instance "ip-172-26-18-44.ec2.internal" Instance is in ready state in K8S:   kubectl get nodes NAME STATUS ROLES AGE VERSION ip-172-26-18-44.ec2.internal Ready <none> 11d v1.12.10-eks-1246e3   This is the log from Jenkins console: [Pipeline] End of Pipeline Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-172-26-30-207.ec2.internal/jenkins_master_IP:37312 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:957) at hudson.FilePath.act(FilePath.java:1072) at hudson.FilePath.act(FilePath.java:1061) at hudson.FilePath.mkdirs(FilePath.java:1246) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:811) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.nio.file.AccessDeniedException: /mnt/jenkins/workspace at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) at java.nio.file.Files.createDirectory(Files.java:674) at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) at java.nio.file.Files.createDirectories(Files.java:767) at hudson.FilePath.mkdirs(FilePath.java:3239) at hudson.FilePath.access$1300(FilePath.java:212) at hudson.FilePath$Mkdirs.invoke(FilePath.java:1254) at hudson.FilePath$Mkdirs.invoke(FilePath.java:1250) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3052) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:97) at java.lang. Thread .run( Thread .java:748) It try to connect to Jenkins master with "ip-172-26-30-207.ec2.internal" and this instance doesn't exist.   Seems like some bug in the K8S plugin and the communication to get the SPOT correct IP.  

          Seems like need to add more resources to the JNLP container, by editing the yaml resources(I have added 4CPU and 4G Ram):

          yaml: """
          spec:
           containers:
             - name: "jnlp"
               resources:
                 requests:
                   cpu: "4"
                   memory: "4Gi"
          """
          

          I was getting the Pod evicted notification for every time this error appeared in Jenkins console.

          Eddie Mashayev added a comment - Seems like need to add more resources to the JNLP container, by editing the yaml resources(I have added 4CPU and 4G Ram): yaml: """ spec: containers: - name: "jnlp" resources: requests: cpu: "4" memory: "4Gi" """ I was getting the Pod evicted notification for every time this error appeared in Jenkins console.

          Need to add more resources to JNLP container, default resources sometimes are not enough.

          Eddie Mashayev added a comment - Need to add more resources to JNLP container, default resources sometimes are not enough.
          Eddie Mashayev made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Fixed but Unreleased [ 10203 ]

          Karol Gil added a comment -

          eddiem21 did this stop for you after you increased resources? We're observing these failures on a daily basis and in all cases it's trying to connect to non-existing node hostnames. We're running on demand worker nodes on EKS (no spots used).

          According to our monitoring JNLP container never uses more than 1.2 GB RAM and ~0.8 CPU, hence I doubt it's because of resources.

          Karol Gil added a comment - eddiem21 did this stop for you after you increased resources? We're observing these failures on a daily basis and in all cases it's trying to connect to non-existing node hostnames. We're running on demand worker nodes on EKS (no spots used). According to our monitoring JNLP container never uses more than 1.2 GB RAM and ~0.8 CPU, hence I doubt it's because of resources.

          karolgil Hey we  still face this issue once in a while. I worked on it a lot and described all my actions in this ticket.

          These things are NOT related to the issue:

          1. increasing JNLP resources.
          2. using spot/ondemand.

           

          There is one thing which fixing it reduce this issue to "Once in a while" :

          1. Increasing the root volume size for each EKS node - We are building many docker images and the root volume get full very quickly, increasing it to 250G(Default is 20G) and clean the images frequently fixed the majority of the failures.

           

          BUT we still facing this issue, I have suspicions it's related to the fact that Jenkins is scheduling a job in EKS node that is going down as part of the autoscaler policy. Job is being triggered and at the same time autoscaler components mark the same node to be cordoned. I dont have a prove yet for it. and it's being investigated.

          Eddie Mashayev added a comment - karolgil Hey we  still face this issue once in a while. I worked on it a lot and described all my actions in this ticket. These things are NOT related to the issue: increasing JNLP resources. using spot/ondemand.   There is one thing which fixing it reduce this issue to "Once in a while" : Increasing the root volume size for each EKS node - We are building many docker images and the root volume get full very quickly, increasing it to 250G(Default is 20G) and clean the images frequently fixed the majority of the failures.   BUT we still facing this issue, I have suspicions it's related to the fact that Jenkins is scheduling a job in EKS node that is going down as part of the autoscaler policy. Job is being triggered and at the same time autoscaler components mark the same node to be cordoned.  I dont have a prove yet for it. and it's being investigated.
          Eddie Mashayev made changes -
          Resolution Original: Fixed [ 1 ]
          Status Original: Fixed but Unreleased [ 10203 ] New: Reopened [ 4 ]

          Karol Gil added a comment -

          Hey eddiem21, thanks for the response. I've been fighting this one for a while now as well and can confirm that your "not related" section is correct - we did both changes and issues are still being observed once in a while.

          Our monitoring shows that root volumes are far from full in any of the nodes being used for running our jobs so I doubt it's related - maybe the symptom is similar?

          I think it may be related to autoscaling as you said - we're observing this mostly in jobs that are using specific autoscaling group that has default capacity set to 0 and in peaks scales up to 80 nodes - this is when issue is most common. What bugs be is the fact that I can't track the hostnames that are listed in build log - these machines are not defined in AWS nor can I see them in autoscaler logs.

          By any chance - did you manage to reproduce that effectively? Or it appears to be "random"?

          Karol Gil added a comment - Hey eddiem21 , thanks for the response. I've been fighting this one for a while now as well and can confirm that your "not related" section is correct - we did both changes and issues are still being observed once in a while. Our monitoring shows that root volumes are far from full in any of the nodes being used for running our jobs so I doubt it's related - maybe the symptom is similar? I think it may be related to autoscaling as you said - we're observing this mostly in jobs that are using specific autoscaling group that has default capacity set to 0 and in peaks scales up to 80 nodes - this is when issue is most common. What bugs be is the fact that I can't track the hostnames that are listed in build log - these machines are not defined in AWS nor can I see them in autoscaler logs. By any chance - did you manage to reproduce that effectively? Or it appears to be "random"?

            Unassigned Unassigned
            eddiem21 Eddie Mashayev
            Votes:
            5 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: