[JENKINS-59705] hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from IP/IP:58344 failed. The channel is closing down or has closed dow

Type: Bug
Resolution: Not A Defect
Priority: Blocker
Component/s: kubernetes-plugin
Labels:
None
Environment:
Jenkins master version: 2.190.1
Kubernetes Plugin: 1.19.3

It also happened before the upgrade in
Jenkins: 2.176.3
K8S plugin: 1.19.0

Similar Issues:
Powered by SuggestiMate

Show

It happens frequently not something constant, which makes it very hard to debug.

This is my podTemplate:

podTemplate(containers: [
    containerTemplate(
        name: 'build',
        image: 'my_builder:latest',
        command: 'cat',
        ttyEnabled: true,
        workingDir: '/mnt/jenkins'
    )
],
volumes: [
    hostPathVolume(mountPath: '/var/run/docker.sock', hostPath: '/var/run/docker.sock'),
    hostPathVolume(mountPath: '/mnt/jenkins', hostPath: '/mnt/jenkins')
],
yaml: """
spec:
 containers:
   - name: build
     resources:
       requests:
         cpu: "10"
         memory: "10Gi" 
 securityContext:
   fsGroup: 995
"""
)
{
    node(POD_LABEL) {
        stage("Checkout") {
        }       
        // more stages
    }
}

This is the log from the pod:

Inbound agent connected from IP/IP
Waiting for agent to connect (0/100): my_branch
Remoting version: 3.35
This is a Unix agent
Waiting for agent to connect (1/100): my_branch
Agent successfully connected and online
ERROR: Connection terminated
java.nio.channels.ClosedChannelException
    at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
    at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
    at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Logs from Jenkins "cat /var/log/jenkins/jenkins.log":

2019-10-08 14:40:48.171+0000 [id=287] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: branch_name, template=PodTemplate{, name='pod_name', namespace='default', label='label_name', nodeUsageMode=EXCLUSIVE, volumes=[HostPathVolume [mountPath=/var/run/docker.sock, hostPath=/var/run/docker.sock], HostPathVolume [mountPath=/mnt/jenkins, hostPath=/mnt/jenkins]], containers=[ContainerTemplate{name='build', image='my_builder', workingDir='/mnt/jenkins', command='cat', ttyEnabled=true, envVars=[KeyValueEnvVar [getValue()=deploy/.dazelrc, getKey()=RC_FILE]]}], annotations=[org.csanchez.jenkins.plugins.kubernetes.PodAnnotation@aab9c821]} io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [100000] milliseconds for [Pod] with name:[branch_name] in namespace [default]. at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:130) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:134) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Eddie Mashayev created issue - 2019-10-08 15:17

Eddie Mashayev made changes - 2019-10-08 15:21

Description

Original: It happens frequently not something constant, which makes it very hard to debug.

This is my podTemplate:
{code:java}
podTemplate(containers: [
    containerTemplate(
        name: 'build',
        image: 'my_builder:latest',
        command: 'cat',
        ttyEnabled: true,
        workingDir: '/mnt/jenkins'
    )
],
volumes: [
    hostPathVolume(mountPath: '/var/run/docker.sock', hostPath: '/var/run/docker.sock'),
    hostPathVolume(mountPath: '/mnt/jenkins', hostPath: '/mnt/jenkins')
],
yaml: """
spec:
containers:
   - name: build
     resources:
       requests:
         cpu: "10"
         memory: "10Gi"
securityContext:
   fsGroup: 995
"""
)
{
    node(POD_LABEL) {
        stage("Checkout") {
        } // more stages
    }
}
{code}
This is the log from the pod:
{code:java}
Inbound agent connected from IP/IP
Waiting for agent to connect (0/100): my_branch
Remoting version: 3.35
This is a Unix agent
Waiting for agent to connect (1/100): my_branch
Agent successfully connected and online
ERROR: Connection terminated
java.nio.channels.ClosedChannelException
    at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
    at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
    at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
{code}

Logs from Jenkins "cat /var/log/jenkins/jenkins.log":
{code:java}
2019-10-08 14:40:48.171+0000 [id=287] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: branch_name, template=PodTemplate{, name='pod_name', namespace='default', label='label_name', nodeUsageMode=EXCLUSIVE, volumes=[HostPathVolume [mountPath=/var/run/docker.sock, hostPath=/var/run/docker.sock], HostPathVolume [mountPath=/mnt/jenkins, hostPath=/mnt/jenkins]], containers=[ContainerTemplate{name='build', image='my_builder', workingDir='/mnt/jenkins', command='cat', ttyEnabled=true, envVars=[KeyValueEnvVar [getValue()=deploy/.dazelrc, getKey()=RC_FILE]]}], annotations=[org.csanchez.jenkins.plugins.kubernetes.PodAnnotation@aab9c821]} io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [100000] milliseconds for [Pod] with name:[branch_name] in namespace [default]. at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:130) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:134) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
{code}

New: It happens frequently not something constant, which makes it very hard to debug.

This is my podTemplate:
{code:java}
podTemplate(containers: [
    containerTemplate(
        name: 'build',
        image: 'my_builder:latest',
        command: 'cat',
        ttyEnabled: true,
        workingDir: '/mnt/jenkins'
    )
],
volumes: [
    hostPathVolume(mountPath: '/var/run/docker.sock', hostPath: '/var/run/docker.sock'),
    hostPathVolume(mountPath: '/mnt/jenkins', hostPath: '/mnt/jenkins')
],
yaml: """
spec:
containers:
   - name: build
     resources:
       requests:
         cpu: "10"
         memory: "10Gi"
securityContext:
   fsGroup: 995
"""
)
{
    node(POD_LABEL) {
        stage("Checkout") {
        }
        // more stages
    }
}
{code}
This is the log from the pod:
{code:java}
Inbound agent connected from IP/IP
Waiting for agent to connect (0/100): my_branch
Remoting version: 3.35
This is a Unix agent
Waiting for agent to connect (1/100): my_branch
Agent successfully connected and online
ERROR: Connection terminated
java.nio.channels.ClosedChannelException
    at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
    at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
    at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
{code}
Logs from Jenkins "cat /var/log/jenkins/jenkins.log":
{code:java}
2019-10-08 14:40:48.171+0000 [id=287] WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: branch_name, template=PodTemplate{, name='pod_name', namespace='default', label='label_name', nodeUsageMode=EXCLUSIVE, volumes=[HostPathVolume [mountPath=/var/run/docker.sock, hostPath=/var/run/docker.sock], HostPathVolume [mountPath=/mnt/jenkins, hostPath=/mnt/jenkins]], containers=[ContainerTemplate{name='build', image='my_builder', workingDir='/mnt/jenkins', command='cat', ttyEnabled=true, envVars=[KeyValueEnvVar [getValue()=deploy/.dazelrc, getKey()=RC_FILE]]}], annotations=[org.csanchez.jenkins.plugins.kubernetes.PodAnnotation@aab9c821]} io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [100000] milliseconds for [Pod] with name:[branch_name] in namespace [default]. at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:130) at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:134) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
{code}

Eddie Mashayev added a comment - 2019-10-09 08:36

I think I have found the issue, I'm using EKS and using SPOT instance to run my CI. When using spot this issue happens frequently, when using on demand it pass all the time.

The reason is that the Jenkins is getting the wrong instance ip to connect to the Jenkins master.

Example:

kubectl get pods -o wide --all-namespaces
NAMESPACE       NAME                                                              READY   STATUS              RESTARTS   AGE     IP              NODE                            NOMINATED NODE
default         some-job-5-c328g-kd-k2pfz   0/2     ContainerCreating   0          2s      <none>          ip-172-26-18-44.ec2.internal    <none>

As you can see the job run on instance "ip-172-26-18-44.ec2.internal"

Instance is in ready state in K8S:

kubectl get nodes
NAME                            STATUS                     ROLES    AGE     VERSION
ip-172-26-18-44.ec2.internal    Ready                      <none>   11d     v1.12.10-eks-1246e3

This is the log from Jenkins console:

[Pipeline] End of Pipeline
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-172-26-30-207.ec2.internal/jenkins_master_IP:37312
        at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743)
        at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
        at hudson.remoting.Channel.call(Channel.java:957)
        at hudson.FilePath.act(FilePath.java:1072)
        at hudson.FilePath.act(FilePath.java:1061)
        at hudson.FilePath.mkdirs(FilePath.java:1246)
        at hudson.plugins.git.GitSCM.createClient(GitSCM.java:811)
        at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186)
        at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124)
        at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
        at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
        at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.nio.file.AccessDeniedException: /mnt/jenkins/workspace
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
    at java.nio.file.Files.createDirectory(Files.java:674)
    at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
    at java.nio.file.Files.createDirectories(Files.java:767)
    at hudson.FilePath.mkdirs(FilePath.java:3239)
    at hudson.FilePath.access$1300(FilePath.java:212)
    at hudson.FilePath$Mkdirs.invoke(FilePath.java:1254)
    at hudson.FilePath$Mkdirs.invoke(FilePath.java:1250)
    at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3052)
    at hudson.remoting.UserRequest.perform(UserRequest.java:211)
    at hudson.remoting.UserRequest.perform(UserRequest.java:54)
    at hudson.remoting.Request$2.run(Request.java:369)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:97)
    at java.lang.Thread.run(Thread.java:748)

It try to connect to Jenkins master with "ip-172-26-30-207.ec2.internal" and this instance doesn't exist.

Seems like some bug in the K8S plugin and the communication to get the SPOT correct IP.

Eddie Mashayev added a comment - 2019-10-09 08:36 I think I have found the issue, I'm using EKS and using SPOT instance to run my CI. When using spot this issue happens frequently, when using on demand it pass all the time. The reason is that the Jenkins is getting the wrong instance ip to connect to the Jenkins master. Example: kubectl get pods -o wide --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE default some-job-5-c328g-kd-k2pfz 0/2 ContainerCreating 0 2s <none> ip-172-26-18-44.ec2.internal <none> As you can see the job run on instance "ip-172-26-18-44.ec2.internal" Instance is in ready state in K8S: kubectl get nodes NAME STATUS ROLES AGE VERSION ip-172-26-18-44.ec2.internal Ready <none> 11d v1.12.10-eks-1246e3 This is the log from Jenkins console: [Pipeline] End of Pipeline Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from ip-172-26-30-207.ec2.internal/jenkins_master_IP:37312 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1743) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357) at hudson.remoting.Channel.call(Channel.java:957) at hudson.FilePath.act(FilePath.java:1072) at hudson.FilePath.act(FilePath.java:1061) at hudson.FilePath.mkdirs(FilePath.java:1246) at hudson.plugins.git.GitSCM.createClient(GitSCM.java:811) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:124) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.nio.file.AccessDeniedException: /mnt/jenkins/workspace at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) at java.nio.file.Files.createDirectory(Files.java:674) at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) at java.nio.file.Files.createDirectories(Files.java:767) at hudson.FilePath.mkdirs(FilePath.java:3239) at hudson.FilePath.access$1300(FilePath.java:212) at hudson.FilePath$Mkdirs.invoke(FilePath.java:1254) at hudson.FilePath$Mkdirs.invoke(FilePath.java:1250) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3052) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:97) at java.lang. Thread .run( Thread .java:748) It try to connect to Jenkins master with "ip-172-26-30-207.ec2.internal" and this instance doesn't exist. Seems like some bug in the K8S plugin and the communication to get the SPOT correct IP.

Eddie Mashayev added a comment - 2019-10-14 11:34

Seems like need to add more resources to the JNLP container, by editing the yaml resources(I have added 4CPU and 4G Ram):

yaml: """
spec:
 containers:
   - name: "jnlp"
     resources:
       requests:
         cpu: "4"
         memory: "4Gi"
"""

I was getting the Pod evicted notification for every time this error appeared in Jenkins console.

Eddie Mashayev added a comment - 2019-10-14 11:34 Seems like need to add more resources to the JNLP container, by editing the yaml resources(I have added 4CPU and 4G Ram): yaml: """ spec: containers: - name: "jnlp" resources: requests: cpu: "4" memory: "4Gi" """ I was getting the Pod evicted notification for every time this error appeared in Jenkins console.

Eddie Mashayev added a comment - 2019-10-14 13:17

Need to add more resources to JNLP container, default resources sometimes are not enough.

Eddie Mashayev added a comment - 2019-10-14 13:17 Need to add more resources to JNLP container, default resources sometimes are not enough.

Eddie Mashayev made changes - 2019-10-14 13:17

Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Fixed but Unreleased [ 10203 ]

Karol Gil added a comment - 2019-10-22 13:56

eddiem21 did this stop for you after you increased resources? We're observing these failures on a daily basis and in all cases it's trying to connect to non-existing node hostnames. We're running on demand worker nodes on EKS (no spots used).

According to our monitoring JNLP container never uses more than 1.2 GB RAM and ~0.8 CPU, hence I doubt it's because of resources.

Karol Gil added a comment - 2019-10-22 13:56 eddiem21 did this stop for you after you increased resources? We're observing these failures on a daily basis and in all cases it's trying to connect to non-existing node hostnames. We're running on demand worker nodes on EKS (no spots used). According to our monitoring JNLP container never uses more than 1.2 GB RAM and ~0.8 CPU, hence I doubt it's because of resources.

Eddie Mashayev added a comment - 2019-10-23 05:42

karolgil Hey we still face this issue once in a while. I worked on it a lot and described all my actions in this ticket.

These things are NOT related to the issue:

increasing JNLP resources.
using spot/ondemand.

There is one thing which fixing it reduce this issue to "Once in a while" :

Increasing the root volume size for each EKS node - We are building many docker images and the root volume get full very quickly, increasing it to 250G(Default is 20G) and clean the images frequently fixed the majority of the failures.

BUT we still facing this issue, I have suspicions it's related to the fact that Jenkins is scheduling a job in EKS node that is going down as part of the autoscaler policy. Job is being triggered and at the same time autoscaler components mark the same node to be cordoned. I dont have a prove yet for it. and it's being investigated.

Eddie Mashayev added a comment - 2019-10-23 05:42 karolgil Hey we still face this issue once in a while. I worked on it a lot and described all my actions in this ticket. These things are NOT related to the issue: increasing JNLP resources. using spot/ondemand. There is one thing which fixing it reduce this issue to "Once in a while" : Increasing the root volume size for each EKS node - We are building many docker images and the root volume get full very quickly, increasing it to 250G(Default is 20G) and clean the images frequently fixed the majority of the failures. BUT we still facing this issue, I have suspicions it's related to the fact that Jenkins is scheduling a job in EKS node that is going down as part of the autoscaler policy. Job is being triggered and at the same time autoscaler components mark the same node to be cordoned. I dont have a prove yet for it. and it's being investigated.

Eddie Mashayev made changes - 2019-10-23 06:06

Resolution	Original: Fixed [ 1 ]
Status	Original: Fixed but Unreleased [ 10203 ]	New: Reopened [ 4 ]

Karol Gil added a comment - 2019-10-23 06:57

Hey eddiem21, thanks for the response. I've been fighting this one for a while now as well and can confirm that your "not related" section is correct - we did both changes and issues are still being observed once in a while.

Our monitoring shows that root volumes are far from full in any of the nodes being used for running our jobs so I doubt it's related - maybe the symptom is similar?

I think it may be related to autoscaling as you said - we're observing this mostly in jobs that are using specific autoscaling group that has default capacity set to 0 and in peaks scales up to 80 nodes - this is when issue is most common. What bugs be is the fact that I can't track the hostnames that are listed in build log - these machines are not defined in AWS nor can I see them in autoscaler logs.

By any chance - did you manage to reproduce that effectively? Or it appears to be "random"?

Karol Gil added a comment - 2019-10-23 06:57 Hey eddiem21 , thanks for the response. I've been fighting this one for a while now as well and can confirm that your "not related" section is correct - we did both changes and issues are still being observed once in a while. Our monitoring shows that root volumes are far from full in any of the nodes being used for running our jobs so I doubt it's related - maybe the symptom is similar? I think it may be related to autoscaling as you said - we're observing this mostly in jobs that are using specific autoscaling group that has default capacity set to 0 and in peaks scales up to 80 nodes - this is when issue is most common. What bugs be is the fact that I can't track the hostnames that are listed in build log - these machines are not defined in AWS nor can I see them in autoscaler logs. By any chance - did you manage to reproduce that effectively? Or it appears to be "random"?

Assignee:: Unassigned

Reporter:: Eddie Mashayev

Votes:: 5 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2019-10-08 15:17

Updated:: 2020-11-17 09:38

Resolved:: 2020-11-17 09:38

Jenkins

Details

Description

Attachments

Activity

Collapse comment: Eddie Mashayev added a comment - 2019-10-09 08:36

Expand comment: Eddie Mashayev added a comment - 2019-10-09 08:36

Collapse comment: Eddie Mashayev added a comment - 2019-10-14 11:34

Expand comment: Eddie Mashayev added a comment - 2019-10-14 11:34

Collapse comment: Eddie Mashayev added a comment - 2019-10-14 13:17

Expand comment: Eddie Mashayev added a comment - 2019-10-14 13:17

Collapse comment: Karol Gil added a comment - 2019-10-22 13:56

Expand comment: Karol Gil added a comment - 2019-10-22 13:56

Collapse comment: Eddie Mashayev added a comment - 2019-10-23 05:42

Expand comment: Eddie Mashayev added a comment - 2019-10-23 05:42

Collapse comment: Karol Gil added a comment - 2019-10-23 06:57

Expand comment: Karol Gil added a comment - 2019-10-23 06:57

People

Dates