[JENKINS-69152] Remote call on JNLP4-connect connection from XXX failed

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: kubernetes-plugin, remoting
Labels:
None
Environment:
Jenkins 2.346.2 with Java java-11-openjdk-headless-11.0.15.0.9-2.el7_9.x86_64
Kubernetes plugin 3670.v6ca_059233222
Kubernetes client API plugin 5.12.2-193.v26a_6078f65a_9
Kubernetes agents image inbound-agent:4.11-1-jdk11

Similar Issues:
Powered by SuggestiMate

Show

We're seeing an issue that just started after updating to Jenkins 2.346.2 (we were on 2.332.3), and with this upgrade we did the required JRE update from 8->11. Since this update we're having issues where some pipelines running on Kubernetes agents stall with the error message shown in the console output:

Cannot contact e2e-integrations-e2e-personal-e2e-april-494-x1rj1-wntgr-lf3s3: java.io.IOException: Remote call on JNLP4-connect connection from kubernetes_worker_hostname/X.X.X.X:42193 failed

No further output to the pipeline appears, no matter how long we leave it. To be clear, this error happens mid-run, we get some output then it dies usually in a similar spot, but not always. Not all pipelines experience this issue. This error doesn't make a lot of sense for a few reasons:

We see an established TCP connections to Jenkins still in netstat for this port/agent
If running tcpdump we see "data" still being exchanged over this connection, packets constantly flow.
There is no packet loss/drop/delay issue, we have confirmed this with packet captures.

Whatever is happening here, its not the network at fault as the error messages seems to indicate.

When we take a look at the kubernetes pod for this job, the JNLP container is running fine with the last log message of:

Jul 28, 2022 1:35:38 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected

Looking at our executor container in this pod, our test processes are still running fine:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMANDe2euser        8  0.0  0.0   5996  3760 pts/1    Ss   01:46   0:00 bashe2euser      164  0.0  0.0   8596  3292 pts/1    R+   01:48   0:00  \_ ps auxfe2euser        1  0.0  0.0   2176   576 ?        Ss   01:46   0:00 /usr/bin/dumb-init cate2euser        7  0.0  0.0   4364   512 pts/0    Ss+  01:46   0:00 cate2euser       69  0.0  0.0   2432   116 ?        S    01:47   0:00 sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agent/woe2euser       70  0.0  0.0   2432   156 ?        S    01:47   0:00  \_ sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agene2euser      163  0.0  0.0   4236   588 ?        S    01:48   0:00  |   \_ sleep 3e2euser       72  0.0  0.0   2432   544 ?        S    01:47   0:00  \_ sh -xe /home/jenkins/agent/workspace/test1/e2e-personal/e2e-/test-code@tmp/durable-ba81b591/script.she2euser       73 20.0  1.5 4624652 4538752 ?     Sl   01:47   0:14      \_ /usr/local/bin/python /usr/local/bin/pytest --capture tee-sys -rA tests/test.py --log DEBUG --runslo

There are no errors logged on the Jenkins server side beyond what it shown in the pipeline output pasted above.

Downgrading Java or Jenkins at this point isn't really an option due to the sheer number of agents we have and the amount of teams involved.

Anthony created issue - 2022-07-28 01:54

Anthony made changes - 2022-07-28 15:00

Component/s

New: remoting [ 15489 ]

Anthony made changes - 2022-07-28 15:22

Description

Original: We're seeing an issue that just started after updating to Jenkins 2.346.2 (we were on 2.332.3), and with this upgrade we did the required JRE update from 8->11. Since this update we're having issues where some pipelines running on Kubernetes agents stall with the error message shown in the console output:
{code:java}
Cannot contact e2e-integrations-e2e-personal-e2e-april-494-x1rj1-wntgr-lf3s3: java.io.IOException: Remote call on JNLP4-connect connection from kubernetes_worker_hostname/X.X.X.X:42193 failed {code}
No further output to the pipeline appears, no matter how long we leave it. To be clear, this error happens mid-run, we get some output then it dies usually in a similar spot, but not always. Not all pipelines experience this issue. This error doesn't make a lot of sense for a few reasons:
# We see an established TCP connections to Jenkins still in netstat for this port/agent
# If running tcpdump we see "data" still being exchanged over this connection, packets constantly flow.
# There is no packet loss/drop/delay issue, we have confirmed this with packet captures.

Whatever is happening here, its not the network at fault as the error messages to indicate.

When we take a look at the kubernetes pod for this job, the JNLP container is running fine with the last log message of:
{code:java}
Jul 28, 2022 1:35:38 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected {code}
Looking at our executor container in this pod, our test processes are still running fine:
{code:java}
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDe2euser 8 0.0 0.0 5996 3760 pts/1 Ss 01:46 0:00 bashe2euser 164 0.0 0.0 8596 3292 pts/1 R+ 01:48 0:00 \_ ps auxfe2euser 1 0.0 0.0 2176 576 ? Ss 01:46 0:00 /usr/bin/dumb-init cate2euser 7 0.0 0.0 4364 512 pts/0 Ss+ 01:46 0:00 cate2euser 69 0.0 0.0 2432 116 ? S 01:47 0:00 sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agent/woe2euser 70 0.0 0.0 2432 156 ? S 01:47 0:00 \_ sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agene2euser 163 0.0 0.0 4236 588 ? S 01:48 0:00 | \_ sleep 3e2euser 72 0.0 0.0 2432 544 ? S 01:47 0:00 \_ sh -xe /home/jenkins/agent/workspace/test1/e2e-personal/e2e-/test-code@tmp/durable-ba81b591/script.she2euser 73 20.0 1.5 4624652 4538752 ? Sl 01:47 0:14 \_ /usr/local/bin/python /usr/local/bin/pytest --capture tee-sys -rA tests/test.py --log DEBUG --runslo{code}
There are no errors logged on the Jenkins server side beyond what it shown in the pipeline output pasted above.

Downgrading Java or Jenkins at this point isn't really an option due to the sheer number of agents we have and the amount of teams involved.

New: We're seeing an issue that just started after updating to Jenkins 2.346.2 (we were on 2.332.3), and with this upgrade we did the required JRE update from 8->11. Since this update we're having issues where some pipelines running on Kubernetes agents stall with the error message shown in the console output:
{code:java}
Cannot contact e2e-integrations-e2e-personal-e2e-april-494-x1rj1-wntgr-lf3s3: java.io.IOException: Remote call on JNLP4-connect connection from kubernetes_worker_hostname/X.X.X.X:42193 failed {code}
No further output to the pipeline appears, no matter how long we leave it. To be clear, this error happens mid-run, we get some output then it dies usually in a similar spot, but not always. Not all pipelines experience this issue. This error doesn't make a lot of sense for a few reasons:
# We see an established TCP connections to Jenkins still in netstat for this port/agent
# If running tcpdump we see "data" still being exchanged over this connection, packets constantly flow.
# There is no packet loss/drop/delay issue, we have confirmed this with packet captures.

Whatever is happening here, its not the network at fault as the error messages seems to indicate.

When we take a look at the kubernetes pod for this job, the JNLP container is running fine with the last log message of:
{code:java}
Jul 28, 2022 1:35:38 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected {code}
Looking at our executor container in this pod, our test processes are still running fine:
{code:java}
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDe2euser 8 0.0 0.0 5996 3760 pts/1 Ss 01:46 0:00 bashe2euser 164 0.0 0.0 8596 3292 pts/1 R+ 01:48 0:00 \_ ps auxfe2euser 1 0.0 0.0 2176 576 ? Ss 01:46 0:00 /usr/bin/dumb-init cate2euser 7 0.0 0.0 4364 512 pts/0 Ss+ 01:46 0:00 cate2euser 69 0.0 0.0 2432 116 ? S 01:47 0:00 sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agent/woe2euser 70 0.0 0.0 2432 156 ? S 01:47 0:00 \_ sh -c ({ while [ -d '/home/jenkins/agent/workspace/test1/e2e-personal/e2e/test-code@tmp/durable-ba81b591' -a \! -f '/home/jenkins/agene2euser 163 0.0 0.0 4236 588 ? S 01:48 0:00 | \_ sleep 3e2euser 72 0.0 0.0 2432 544 ? S 01:47 0:00 \_ sh -xe /home/jenkins/agent/workspace/test1/e2e-personal/e2e-/test-code@tmp/durable-ba81b591/script.she2euser 73 20.0 1.5 4624652 4538752 ? Sl 01:47 0:14 \_ /usr/local/bin/python /usr/local/bin/pytest --capture tee-sys -rA tests/test.py --log DEBUG --runslo{code}
There are no errors logged on the Jenkins server side beyond what it shown in the pipeline output pasted above.

Downgrading Java or Jenkins at this point isn't really an option due to the sheer number of agents we have and the amount of teams involved.

Jenkins

Details

Description

Attachments

Activity

People

Dates