-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Server: Jenkins 2.361.4, deployed on OpenShift
Remoting: multiple, up to and including 3107.v665000b_51092 (via docker.io/jenkins/inbound-agent:latest-jdk11 Docker image)
I have Jenkins running on an OpenShift cluster (cluster 1). I have configured a Kubernetes cloud where agent pods can be created on a second OpenShift cluster (cluster 2). I have a long-running job which simply echoes output and then sleeps for a minute, in a 3-hour loop.
If I introduce a temporary network issue between cluster 1 and cluster 2, the connection from agent running on cluster 2 to the server running on cluster 1 is broken: this is expected. I can provide details on the exact mechanism I use to cause this failure if necessary, but I don't know that they're relevant. The focus of this issue isn't on the network failure itself, but on the failure of the agent to recover from it.
After the connection is broken, the agent attempts to reconnect to the controller; however, the controller rejects the connection with the message "<agent-name> is already connected to this controller. Rejecting this connection." I'm attaching a file containing a portion of the logs from the agent pod showing the disconnection and failed reconnect attempt. After a while, the agent pod is simply terminated, and the long-running job fails.
From looking in the code base, it appears the error message comes from here, which suggests that there is a mismatch between the cookie and the channelCookie.
I have found several other issues about agents disconnecting, but those are primarily focused on addressing the causes of the disconnect, rather than the failure to recover. Most relevant to this issue seems to be JENKINS-64510, which suggests that TCP agent reconnects had not been working for quite some time, and the description there even implies that the code delivered for that issue may not have been a full solution.
I have tried using WebSocket to connect agents rather than the TCP port. This provided no benefit in terms of agents being able to reconnect after a network issue.
I'd be interested in understanding the conditions under which reconnection ought to be possible: the presence of the reconnection code in the codebase, as well as the aforementioned Jira issue, seems to imply that Jenkins agents ought to be able to recover from network interruption; however, that seems not to be the case in our particular setup. Are there changes to our configuration we could pursue that might make our agents more resilient?
I can reproduce this behavior with 100% reliability and am happy to collect more complete logs, or provide other additional information that might be helpful. Thank you!