-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Jenkins 2.346.2 with Java java-11-openjdk-headless-11.0.15.0.9-2.el7_9.x86_64
Kubernetes plugin 3670.v6ca_059233222
Kubernetes client API plugin 5.12.2-193.v26a_6078f65a_9
Kubernetes agents image inbound-agent:4.11-1-jdk11
We're seeing an issue that just started after updating to Jenkins 2.346.2 (we were on 2.332.3), and with this upgrade we did the required JRE update from 8->11. Since this update we're having issues where some pipelines running on Kubernetes agents stall with the error message shown in the console output:
Cannot contact e2e-integrations-e2e-personal-e2e-april-494-x1rj1-wntgr-lf3s3: java.io.IOException: Remote call on JNLP4-connect connection from kubernetes_worker_hostname/X.X.X.X:42193 failed
No further output to the pipeline appears, no matter how long we leave it. To be clear, this error happens mid-run, we get some output then it dies usually in a similar spot, but not always. Not all pipelines experience this issue. This error doesn't make a lot of sense for a few reasons:
- We see an established TCP connections to Jenkins still in netstat for this port/agent
- If running tcpdump we see "data" still being exchanged over this connection, packets constantly flow.
- There is no packet loss/drop/delay issue, we have confirmed this with packet captures.
Whatever is happening here, its not the network at fault as the error messages seems to indicate.
When we take a look at the kubernetes pod for this job, the JNLP container is running fine with the last log message of:
Jul 28, 2022 1:35:38 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
Looking at our executor container in this pod, our test processes are still running fine:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDe2euser 8 0.0 0.0 5996 3760 pts/1 Ss 01:46 0:00 bashe2euser 164 0.0 0.0 8596 3292 pts/1 R+ 01:48 0:00 \_ ps auxfe2euser 1 0.0 0.0 2176 576 ? Ss 01:46 0:00 /usr/bin/dumb-init cate2euser 7 0.0 0.0 4364 512 pts/0 Ss+ 01:46 0:00 cate2euser 69 0.0 0.0 2432 116 ? S 01:47 0:00 sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agent/woe2euser 70 0.0 0.0 2432 156 ? S 01:47 0:00 \_ sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agene2euser 163 0.0 0.0 4236 588 ? S 01:48 0:00 | \_ sleep 3e2euser 72 0.0 0.0 2432 544 ? S 01:47 0:00 \_ sh -xe /home/jenkins/agent/workspace/test1/e2e-personal/e2e-/test-code@tmp/durable-ba81b591/script.she2euser 73 20.0 1.5 4624652 4538752 ? Sl 01:47 0:14 \_ /usr/local/bin/python /usr/local/bin/pytest --capture tee-sys -rA tests/test.py --log DEBUG --runslo
There are no errors logged on the Jenkins server side beyond what it shown in the pipeline output pasted above.
Downgrading Java or Jenkins at this point isn't really an option due to the sheer number of agents we have and the amount of teams involved.