-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Jenkins 2.414.3, Kubernetes plugin
Version4054.v2da_8e2794884, Agent jenkins/inbound-agent:3181.v78543a_987053-1-alpine-jdk17
We're running our CI pipelines on Kubernetes where each pull request can spawn up to 40 parallel stages. On occasion, we would experience a build failure because of a ClosedChannelException occurring just after the controller recognizes that the agent's pod is running. Recently, we've experienced a 4x increase in these types of failures – almost always associated during the time of a scale-up event in the cluster. From the controller:
2023-11-03 16:13:40.442+0000 [id=999274] INFO hudson.remoting.Request$2#run: Failed to send back a reply to the request RPCRequest:hudson.remoting.RemoteClassLoader$IClassLoader.fetch3[java.lang.String](2): hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@121ef5f4:JNLP4-connect connection from ip-172-16-23-139.us-west-2.compute.internal/172.16.23.139:57914": channel is already closed 2023-11-03 16:13:41.593+0000 [id=999274] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/github-cdf-main-4924-hpzfn-njbgn-zrsv8 2023-11-03 16:13:43.506+0000 [id=999274] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/github-cdf-main-4924-hpzfn-njbgn-zrsv8 2023-11-03 16:13:53.850+0000 [id=999274] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/github-cdf-main-4925-pcbsj-wm756-gdcsl 2023-11-03 16:15:11.482+0000 [id=999825] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4924-qms8v-zmxbl-xgs5s], container: [clumio-base][230 ms] 2023-11-03 16:15:23.316+0000 [id=999825] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4925-96lxr-n5s81-f83qm], container: [clumio-base][230 ms] 2023-11-03 16:15:25.021+0000 [id=999274] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/github-cdf-main-4925-pcbsj-wm756-gdcsl 2023-11-03 16:15:37.815+0000 [id=999825] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4925-bjj6c-wfdcl-rn53j], container: [clumio-base][239 ms] 2023-11-03 16:15:53.128+0000 [id=90] INFO hudson.slaves.NodeProvisioner#update: github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t provisioning successfully completed. We have now 111 computer(s) 2023-11-03 16:15:53.274+0000 [id=999825] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t 2023-11-03 16:16:01.278+0000 [id=999274] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [githubjira-cdf-pr-42-3-j24cp-d0x9w-6x4n0], container: [clumio-base][229 ms] 2023-11-03 16:16:13.049+0000 [id=1000278] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #100276 from /172.16.23.81:51200 2023-11-03 16:16:19.599+0000 [id=999274] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [github-cdf-main-4925-ql51w-h2r2w-8mnff], container: [clumio-base][243 ms] 2023-11-03 16:16:52.713+0000 [id=999825] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t 2023-11-03 16:16:53.006+0000 [id=999274] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/172.16.26.111:50000 remote=ip-172-16-23-81.us-west-2.compute.internal/172.16.23.81:51200]] / Computer.threadPoolForRemoting [#73270] for github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t terminated: java.nio.channels.ClosedChannelException 2023-11-03 16:17:41.779+0000 [id=999174] INFO o.c.j.p.k.p.r.Reaper$TerminateAgentOnContainerTerminated#lambda$onEvent$1: jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t Container clumio-base was just terminated, so removing the corresponding Jenkins agent 2023-11-03 16:17:41.779+0000 [id=999174] INFO o.c.j.p.k.p.r.Reaper$TerminateAgentOnContainerTerminated#lambda$onEvent$1: jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t Container jnlp was just terminated, so removing the corresponding Jenkins agent 2023-11-03 16:17:41.808+0000 [id=999174] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t 2023-11-03 16:17:41.813+0000 [id=999174] WARNING o.c.j.p.k.KubernetesSlave#deleteSlavePod: Failed to delete pod for agent jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t: not found 2023-11-03 16:17:41.814+0000 [id=999174] INFO o.c.j.p.k.KubernetesSlave#_terminate: Disconnected computer github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t 2023-11-03 16:17:41.819+0000 [id=999174] INFO o.c.j.p.k.p.r.Reaper$TerminateAgentOnPodFailed#onEvent: jenkins/github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t Pod just failed. Removing the corresponding Jenkins agent. Reason: null, Message: null 2023-11-03 16:17:41.835+0000 [id=999174] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t 2023-11-03 16:17:41.835+0000 [id=999174] SEVERE o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t 2023-11-03 16:17:41.836+0000 [id=999174] INFO hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null: github-cdf-pr-42-4-3hd5s-wtj9l-m9z1t 2023-11-03 16:22:04.573+0000 [id=1001374] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #100384 from /172.16.23.81:43584
Java Runtime on Controller
openjdk 17.0.8.1 2023-08-24 OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1) OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode)
One other item of note is that our pipelines can't simply be restarted from the point of failure; they must undergo the entire build cycle when the job fails. Therefore, a configurable retry w/ backoff for initial pod startup would significantly help address this issue.