-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major
-
Component/s: kubernetes-plugin, remoting
-
Environment:Jenkins 2.346.2 with Java java-11-openjdk-headless-11.0.15.0.9-2.el7_9.x86_64
Kubernetes plugin 3670.v6ca_059233222
Kubernetes client API plugin 5.12.2-193.v26a_6078f65a_9
Kubernetes agents image inbound-agent:4.11-1-jdk11
We're seeing an issue that just started after updating to Jenkins 2.346.2 (we were on 2.332.3), and with this upgrade we did the required JRE update from 8->11. Since this update we're having issues where some pipelines running on Kubernetes agents stall with the error message shown in the console output:
Cannot contact e2e-integrations-e2e-personal-e2e-april-494-x1rj1-wntgr-lf3s3: java.io.IOException: Remote call on JNLP4-connect connection from kubernetes_worker_hostname/X.X.X.X:42193 failed
No further output to the pipeline appears, no matter how long we leave it. To be clear, this error happens mid-run, we get some output then it dies usually in a similar spot, but not always. Not all pipelines experience this issue. This error doesn't make a lot of sense for a few reasons:
- We see an established TCP connections to Jenkins still in netstat for this port/agent
- If running tcpdump we see "data" still being exchanged over this connection, packets constantly flow.Â
- There is no packet loss/drop/delay issue, we have confirmed this with packet captures.Â
Whatever is happening here, its not the network at fault as the error messages seems to indicate.Â
When we take a look at the kubernetes pod for this job, the JNLP container is running fine with the last log message of:
Jul 28, 2022 1:35:38 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
Looking at our executor container in this pod, our test processes are still running fine:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDe2euser 8 0.0 0.0 5996 3760 pts/1 Ss 01:46 0:00 bashe2euser 164 0.0 0.0 8596 3292 pts/1 R+ 01:48 0:00 \_ ps auxfe2euser 1 0.0 0.0 2176 576 ? Ss 01:46 0:00 /usr/bin/dumb-init cate2euser 7 0.0 0.0 4364 512 pts/0 Ss+ 01:46 0:00 cate2euser 69 0.0 0.0 2432 116 ? S 01:47 0:00 sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agent/woe2euser 70 0.0 0.0 2432 156 ? S 01:47 0:00 \_ sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agene2euser 163 0.0 0.0 4236 588 ? S 01:48 0:00 | \_ sleep 3e2euser 72 0.0 0.0 2432 544 ? S 01:47 0:00 \_ sh -xe /home/jenkins/agent/workspace/test1/e2e-personal/e2e-/test-code@tmp/durable-ba81b591/script.she2euser 73 20.0 1.5 4624652 4538752 ? Sl 01:47 0:14 \_ /usr/local/bin/python /usr/local/bin/pytest --capture tee-sys -rA tests/test.py --log DEBUG --runslo
There are no errors logged on the Jenkins server side beyond what it shown in the pipeline output pasted above. Â
Downgrading Java or Jenkins at this point isn't really an option due to the sheer number of agents we have and the amount of teams involved.Â