-
Bug
-
Resolution: Unresolved
-
Critical
-
Docker image based on jenkins/jenkins:2.204.5-jdk11
Both with and without Nginx 1.17.6 as reverse proxy
Ubuntu 18.04
-
Powered by SuggestiMate
Investigating a spike in builds queue size we've found out that TcpSlaveAgent listener thread was dead with the following logs:
2019-10-23 09:02:17.236+0000 [id=200815] SEVERE h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #1715 with /10.125.100.99:47700,5,main] java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271) 2019-10-23 09:02:17.237+0000 [id=200815] WARNING hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271)
Followed by logs from nodes created by Jenkins Kubernetes Plugin:
SEVERE: http://jenkins-master.example.com/ provided port:50000 is not reachable java.io.IOException: http://jenkins-master.example.com/ provided port:50000 is not reachable at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:287) at hudson.remoting.Engine.innerRun(Engine.java:523) at hudson.remoting.Engine.run(Engine.java:474)
Changing JNLP port from 50000 to 50001 and back in Jenkins settings helped to restore connection and then nodes were able to connect to master again.
- is duplicated by
-
JENKINS-70161 Blocked JNLP port
-
- Closed
-
- is related to
-
JENKINS-70334 When TcpSlaveAgentListener dies it is not restarted
-
- Reopened
-
[JENKINS-59910] Java 11 agent disconnection: UnsupportedOperationException from ProtocolStack$Ptr.isSendOpen
Nope it DID occur and still occur sometimes, we on Jenkins 2.204.5 now.
jglick could you please give me some hints of how to debug this further to find the root cause?
Sorry, I am not familiar with ProtocolStack really. You can try WebSocket mode to see if it behaves any better.
Thanks! I'll give it a shot.
For future readers of this ticket Jesse referenced to WebSocket mode introduced in 2.217 https://www.jenkins.io/blog/2020/02/02/web-socket/
I've also seen this in version 2.263.1
Stack trace
Jan 23 01:06:04 jenkins tomcat9[15234]: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #5925 with /10.0.210.153:58316,5,main] Jan 23 01:06:04 jenkins tomcat9[15234]: java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:342) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:342) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:342) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) Jan 23 01:06:04 jenkins tomcat9[15234]: at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:155) Jan 23 01:06:04 jenkins tomcat9[15234]: at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:195) Jan 23 01:06:04 jenkins tomcat9[15234]: at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:284)
Seeing same thing on 2.289.1
Interesting that TcpSlaveAgentListenerRescheduler doesn't seem to be doing it's job.
Looks like doAperiodicRun is never called.
Seeing this with 2.263.4
Happened 5 times over the last 2 weeks.
Jenkins' TCP port seems to be bound to an IPv6 address, if it makes sense. When the crash is in place, the socket is no longer open:
~$ sudo netstat -tulpn | grep LISTEN | grep java
tcp6 0 0 :::xxxx :::* LISTEN 14373/java # HTTPS
tcp6 0 0 :::xxxx :::* LISTEN 14373/java # HTTP
tcp6 0 0 :::xxxx :::* LISTEN 14373/java # This port is no longer listening when the above crash is reproduced
tcp6 0 0 :::xxxx :::* LISTEN 14373/java # ?
Temporary resolution: change TCP port for inbound agents to some other value, then change back.
ngg1 tdaniely timor_raiman could you please share more details about your Jenkins setup? Let's find what's we have in common, this will reduce list of candidates to blame.
We never actually tried to use Websockets connection for our inbound agents - at least for this bug we have an alert and can switch the port after the failure occurs.
I don't think we've seen this issue happening since I reported it in January.
Our server is running on a Debian 10 machine with Tomcat 9.
All our agents (Linux, Windows, macOS) are connecting to the server using the downloaded agent.jar file via Inbound TCP Agent Protocol/4 (TLS encryption)
What other info do you need?
Well, it differs a lot from our configuration. What Java version do you use?
We run Jenkins 2.289.1 with JDK 11
openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
It is official LTS Docker image - jenkins/jenkins:2.289.1-jdk11
We use Inbound TCP Agent Protocol/4 (TLS encryption) too.
We use SSH Slaves plugin to connect MacOS nodes, download jnlp jar on Windows node, and provision Linux agents via Kubernetes plugin.
Kubernetes uses our own Ubuntu 18 images to run JNLP container. We install this https://github.com/jenkinsci/docker-inbound-agent/blob/3.35-5/jenkins-agent and https://repo.jenkins-ci.org/public/org/jenkins-ci/main/remoting/3.36/remoting-3.36.jar , and use default-jre package, which is
openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04) OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
May be we should just try to upgrade our remoting jars
The very same thing just happened here with the latest LTS 2.303.1 (for the first time I have to say). Anything we can do to help tracking down the issue?
The same problem occurred again, twice in two days now, right after we updated from a Java 8-based Jenkins to Java 11. This is a strong indication that is has something to do with Java 11 since we never had this issue before.
sithmein hmm, I don't remember exactly, but there's a chance it has started when we've switched to JDK 11 official Docker images.
We've switched to WebSocket connection recently, but I want to check if the issue will persist with JDK 8 Docker images. I'll post our observations here.
Well, we've switched Docker image for our production Jenkins instance back to JDK 8. The issues hasn't appeared so far
After switching back to Java 8 two weeks ago the problem did not occur any more. This is a very strong indication that something is wrong with the Java 11 version.
Is anybody looking into this? Jenkins now shows a warning in the UI when you are still using the Java 8-based version. Not sure when Java 8 support will be dropped but unless this bug is fixed you simply cannot switch to Java 11.
Anything we can do to help tracking down the issue?
Ideally, find a way to reproduce the error in a minimal self-contained environment.
Or someone may be able to figure out what is wrong just by thinking about the stack trace. I doubt the original author of the JNLP4-connect transport is around, though, and it is far more complex than the newer WebSocket transport.
I don't think it's the JDK bug. The problem occurs for us a few month ago, a long time after the bug was supposed to be fixed already (11.0.2).
Reliably reproducing the issue is really difficult because it only happens randomly. We are currently looking at WebSocket communication. If this is working reliable and performant then it would be a viable alternative.
I don't think it's the JDK bug. The problem occurs for us a few month ago, a long time after the bug was supposed to be fixed already (11.0.2).
I should have clarified that I suspected the problem in Jenkins was introduced by the fix for that Java bug, which apparently changed the behavior of some methods involved in TLS handling. In other words that we would see the problem on all sufficiently new Java versions, but not Java 8. On the other hand there is mention there of the fix having been backported to 8, which if true would invalidate that hypothesis (depending on whether it was a full or partial backport).
Not sure if we get the exact same issue or something different with the similar behavior.
Some background first. We have lot of jenkins instances (regularly updated, so not a single version issue) currently on 2.333, running with Java 8.
We have recurring events of getting the following log:
SEVERE hudson.TcpSlaveAgentListener lambda$new$0 Uncaught exception in TcpSlaveAgentListener Thread[TCP agent listener port=50000,5,main], attempting to reschedule thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at hudson.TcpSlaveAgentListener.run(TcpSlaveAgentListener.java:194)
After that the listener thread is not restarted ever, except if we do the w/a in bug description.
strimpak your issue sounds unrelated, probably an environment issue (lack of RAM in the agent computer).
thanks jglick I would expect though that the jenkins retry creating the listener thread, while it doesn't. I will create a new Bug if this sounds unrelated. I co-related that because of the w/a.
You also mentioned that this could be lack of RAM on agent computer, but I don't think so. The OutOfMemory seems to occur on controller.
Once you get an OutOfMemoryError, any further behavior is undefined—everything should be assumed to be broken.
The OutOfMemory seems to occur on controller.
Inadequate RAM on the controller then.
I cannot tell yet. We switched to Websockets recently but still with the Java 8-based controller. This works nicely so far. Next week we change the controller to Java 11, after that I can report here.
The same happens in one of our build clusters once a week or so. No OutOfMemoryError there. We have increased the log level for TcpSlaveAgentListener and will see if we can get any context. If there's anything else we could do for debugging this, please advise.
Java 11 with Websocket communication has been working without problems for about a week by now.
> Could you try the web socket mode please?
Sorry, I didn't mention it. We did try websocket mode already, but we also got an error under heavy load. We didn't pursue this further, yet. I also think that this heavily depends on the environment (Jenkins version/remoting version, kubernetes-plugin version, direct/non-direct connect). FWIW, this happened with Jenkins 2.264.3.
capf can you provide more details on the error that you saw when running in websocket mode?
Unfortunately not, I'm only relaying this from a colleague who didn't copy any logs that might be helpful. They're now using 2.264.3 patched with remoting 4.13 to benefit from JENKINS-66915 (https://github.com/jenkinsci/remoting/commit/31096eff5efbe4390cc6f4b070ee337a99e0110e).
AFAIK there haven't been any JNLP-issues since then. We can try websocket mode again after updating to a new Jenkins version.
We are also getting the same error (we are with core version 2.303.3). On coincident getting this error since we upgraded the Kub8 plugin from 1.24.1 to 1.30.11..
Can you please provide additional info related to the 'Patch remoting v.4.13".!! How to apply, any document to follow for the same.. And/OR by moving out from 'jnlp-slave' to 'inbound-agent' will it be helpful here.!!
{{}}
akmjenkins In the meantime, we experienced this problem also with remoting 4.13, so I cannot recommend this as a fix. We will try Jenkins 2.346.2 with remoting 4.13.2, latest kubernetes-plugin and websockets next.
We're seeing this same issue on jenkins 2.346.2, so I don't suspect upgrading that will help.
> doubt the original author of the JNLP4-connect transport is around, though, and it is far more complex than the newer WebSocket transport.
I was heavily involved so have context here. Once the LTS is prepared I will try and block some time out to look at this.
If anyone can reproduce this easily adding FINEST logs for `org.jenkinsci.remoting.protocol` would help (this may well be a large amount of logging!)
Alternatively if this is due to TLSv1.3 then disabling 1.3 so 1.2 would be used can be acheived by adding `TLSv1.3` (temporarily!) to `jdk.tls.disabledAlgorithms` in `$JRE/conf/security/java.security`
Notes to self.
- happens on Ubuntu and RedHat so unrelated to any RedHat specific TLS patches and environment (yup they do some fun stuff)
After moving to java11 from java8 this problem occurred on 2.346.3 with 4.13.3 remote.
No finest logs unfortunately, first time occuring. Trace:
java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:516) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:258) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:108) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:155) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:196) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:281)
We have upgraded to 2.361.1 and switched from JNLP to websockets on 2022-09-26 (i.e. almost 3 weeks ago) and haven't had a problem, since. Even though https://issues.jenkins.io/browse/JENKINS-69543 popped up in the meantime and looks scary (and we will update ASAP), we're happy to have switched.
teilo I have created the log as you describe. When the issue appears again I will post the log here.
For people impacted here, a workaround - if this is https://bugs.openjdk.java.net/browse/JDK-8207009 - could be to add the system property jdk.tls.acknowledgeCloseNotify=true to switch to duplex close policy as per https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-901C61EF-D347-4912-8ADF-0BC4ECD598D0. This could help narrow this down to this specific JDK fix and TLS1.3.
As per my understanding, this should be added to both the Jenkins controller JVM and the agent JVM. Not 100% sure if this must be done in the $JRE/conf/security/java.security or if this can be simply passed to the starting command with -D.
This bug can ocasionally be observed when running jenkins.slaves.restarter.JnlpSlaveRestarterInstallerTest#tcpReconnection on a very slow machine, as in this test run:
0.266 [id=147] INFO o.jvnet.hudson.test.JenkinsRule#createWebServer: Running on http://localhost:56050/jenkins/ 0.392 [id=160] INFO jenkins.InitReactorRunner$1#onAttained: Started initialization 0.405 [id=159] INFO jenkins.InitReactorRunner$1#onAttained: Listed all plugins 0.408 [id=159] INFO j.b.api.BouncyCastlePlugin#start: C:\Jenkins\workspace\Core_jenkins_PR-7561\test\target\j h3202844659674388371\plugins\bouncycastle-api\WEB-INF\optional-lib not found; for non RealJenkinsRule this is fine and can be ignored. 1.513 [id=164] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins 1.530 [id=164] INFO jenkins.InitReactorRunner$1#onAttained: Started all plugins 1.538 [id=166] INFO jenkins.InitReactorRunner$1#onAttained: Augmented all extensions 3.013 [id=166] INFO jenkins.InitReactorRunner$1#onAttained: System config loaded 3.013 [id=166] INFO jenkins.InitReactorRunner$1#onAttained: System config adapted 3.014 [id=166] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs 3.017 [id=166] INFO jenkins.InitReactorRunner$1#onAttained: Configuration for all jobs updated 3.182 [id=162] INFO jenkins.InitReactorRunner$1#onAttained: Completed initialization 3.257 [id=90] FINE j.s.r.JnlpSlaveRestarterInstaller$Install#install: Effective SlaveRestarter on : null Running: [C:\tools\jdk-17\bin\java.exe, -Xmx512m, -XX:+PrintCommandLineFlags, -Djava.awt.headless=true, -jar, C:\Jenkins\workspace\Core_jenkins_PR-7561\test\target\j h3202844659674388371\agent.jar, -jnlpUrl, http://localhost:56050/jenkins/computer/remote/slave-agent.jnlp, -secret, d84160ae18655d915885c04fc316b6b1b4cce2c45427c9107e471f670bab751b, -workDir, C:\Jenkins\workspace\Core_jenkins_PR-7561\test\target\j h3202844659674388371\agent-work-dirs\remote] [remote] -XX:ConcGCThreads=1 -XX:G1ConcRefinementThreads=4 -XX:GCDrainStackTargetSize=64 -XX:InitialHeapSize=134200512 -XX:MarkStackSize=4194304 -XX:MaxHeapSize=536870912 -XX:MinHeapSize=6815736 -XX:+PrintCommandLineFlags -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:-UseLargePagesIndividualAllocation [remote] Jan 04, 2023 1:57:29 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir [remote] INFO: Using C:\Jenkins\workspace\Core_jenkins_PR-7561\test\target\j h3202844659674388371\agent-work-dirs\remote\remoting as a remoting work directory [remote] Jan 04, 2023 1:57:30 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging [remote] INFO: Both error and output logs will be printed to C:\Jenkins\workspace\Core_jenkins_PR-7561\test\target\j h3202844659674388371\agent-work-dirs\remote\remoting [remote] Jan 04, 2023 1:57:35 AM hudson.remoting.jnlp.Main createEngine [remote] INFO: Setting up agent: remote [remote] Jan 04, 2023 1:57:35 AM hudson.remoting.Engine startEngine [remote] INFO: Using Remoting version: 3085.vc4c6977c075a [remote] Jan 04, 2023 1:57:35 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir [remote] INFO: Using C:\Jenkins\workspace\Core_jenkins_PR-7561\test\target\j h3202844659674388371\agent-work-dirs\remote\remoting as a remoting work directory [remote] Jan 04, 2023 1:57:36 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Locating server among [http://localhost:56050/jenkins/] [remote] Jan 04, 2023 1:57:36 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve [remote] INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] [remote] Jan 04, 2023 1:57:36 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Agent discovery successful [remote] Agent address: localhost [remote] Agent port: 56051 [remote] Identity: 2d:b4:01:2f:d0:13:bc:19:f6:c9:fc:38:6e:f9:4a:41 [remote] Jan 04, 2023 1:57:36 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Handshaking [remote] Jan 04, 2023 1:57:36 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Connecting to localhost:56051 [remote] Jan 04, 2023 1:57:36 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Trying protocol: JNLP4-connect 11.018 [id=192] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #3 from /127.0.0.1:56055 failed: null [remote] Jan 04, 2023 1:57:37 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run [remote] INFO: Waiting for ProtocolStack to start. 11.429 [id=193] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #4 from /127.0.0.1:56056 [remote] Jan 04, 2023 1:57:38 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Remote identity confirmed: 2d:b4:01:2f:d0:13:bc:19:f6:c9:fc:38:6e:f9:4a:41 [remote] Jan 04, 2023 1:57:39 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Connected 19.837 [id=84] FINE j.s.r.JnlpSlaveRestarterInstaller$Install#install: Effective SlaveRestarter on remote: [] [remote] Jan 04, 2023 1:57:45 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Terminated 19.870 [id=84] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: IOHub#3: Worker[channel:java.nio.channels.SocketChannel[connected local=/127.0.0.1:56051 remote=127.0.0.1/127.0.0.1:56056]] / Computer.threadPoolForRemoting [#5] for remote terminated: java.nio.channels.ClosedChannelException 19.920 [id=147] INFO hudson.lifecycle.Lifecycle#onStatusUpdate: Stopping Jenkins 20.019 [id=147] INFO hudson.lifecycle.Lifecycle#onStatusUpdate: Jenkins stopped 0.047 [id=201] INFO o.jvnet.hudson.test.JenkinsRule#createWebServer: Running on http://localhost:56050/jenkins/ 0.176 [id=214] INFO jenkins.InitReactorRunner$1#onAttained: Started initialization 0.179 [id=214] INFO jenkins.InitReactorRunner$1#onAttained: Listed all plugins 0.183 [id=214] INFO j.b.api.BouncyCastlePlugin#start: C:\Jenkins\workspace\Core_jenkins_PR-7561\test\target\j h3202844659674388371\plugins\bouncycastle-api\WEB-INF\optional-lib not found; for non RealJenkinsRule this is fine and can be ignored. 0.914 [id=220] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins 0.923 [id=220] INFO jenkins.InitReactorRunner$1#onAttained: Started all plugins 0.935 [id=216] INFO jenkins.InitReactorRunner$1#onAttained: Augmented all extensions 1.173 [id=216] INFO jenkins.InitReactorRunner$1#onAttained: System config loaded 1.176 [id=219] INFO jenkins.InitReactorRunner$1#onAttained: System config adapted 1.178 [id=219] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs 1.182 [id=219] INFO jenkins.InitReactorRunner$1#onAttained: Configuration for all jobs updated 1.322 [id=217] INFO jenkins.InitReactorRunner$1#onAttained: Completed initialization 1.368 [id=87] FINE j.s.r.JnlpSlaveRestarterInstaller$Install#install: Effective SlaveRestarter on : null [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Performing onReconnect operation. [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: onReconnect operation completed. [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Locating server among [http://localhost:56050/jenkins/] [remote] Jan 04, 2023 1:57:57 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve [remote] INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Agent discovery successful [remote] Agent address: localhost [remote] Agent port: 56058 [remote] Identity: 2d:b4:01:2f:d0:13:bc:19:f6:c9:fc:38:6e:f9:4a:41 [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Handshaking [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Connecting to localhost:56058 [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Trying protocol: JNLP4-connect [remote] Jan 04, 2023 1:57:57 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run [remote] INFO: Waiting for ProtocolStack to start. 11.417 [id=244] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #7 from /127.0.0.1:56062 11.417 [id=243] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #6 from /127.0.0.1:56061 failed: null 11.543 [id=244] SEVERE h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #7 with /127.0.0.1:56062,5,main] java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:516) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:259) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:156) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:177) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:282) 11.554 [id=244] WARNING hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:516) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:259) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:156) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:177) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:282) [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Protocol JNLP4-connect encountered an unexpected exception [remote] java.util.concurrent.ExecutionException: java.nio.channels.ClosedChannelException [remote] at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223) [remote] at hudson.remoting.Engine.innerRun(Engine.java:805) [remote] at hudson.remoting.Engine.run(Engine.java:543) [remote] Caused by: java.nio.channels.ClosedChannelException [remote] at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155) [remote] at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$700(BIONetworkLayer.java:51) [remote] at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:257) [remote] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [remote] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [remote] at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:125) [remote] at java.base/java.lang.Thread.run(Thread.java:833) [remote] [remote] Jan 04, 2023 1:57:57 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: reconnect rejected, sleeping 10s: [remote] java.lang.Exception: The server rejected the connection: None of the protocols were accepted [remote] at hudson.remoting.Engine.onConnectionRejected(Engine.java:884) [remote] at hudson.remoting.Engine.innerRun(Engine.java:831) [remote] at hudson.remoting.Engine.run(Engine.java:543) [remote] [remote] Jan 04, 2023 1:58:07 AM hudson.remoting.jnlp.Main$CuiListener status [remote] INFO: Locating server among [http://localhost:56050/jenkins/] [remote] Jan 04, 2023 1:58:07 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve [remote] INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] [remote] Jan 04, 2023 1:58:07 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver isPortVisible [remote] WARNING: Connection refused: no further information [remote] Jan 04, 2023 1:58:07 AM hudson.remoting.jnlp.Main$CuiListener error [remote] SEVERE: http://localhost:56050/jenkins/ provided port:56058 is not reachable on host localhost [remote] java.io.IOException: http://localhost:56050/jenkins/ provided port:56058 is not reachable on host localhost [remote] at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:303) [remote] at hudson.remoting.Engine.innerRun(Engine.java:751) [remote] at hudson.remoting.Engine.run(Engine.java:543) [remote] 179.955 [id=1] WARNING o.j.hudson.test.JenkinsRule$2#evaluate: Test timed out (after 180 seconds). 180.057 [id=201] INFO hudson.lifecycle.Lifecycle#onStatusUpdate: Stopping Jenkins
Note that the exception "Network layer is not supposed to call isSendOpen" emanates from FilterLayer, not NetworkLayer:
java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:516) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:156) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:259) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:156) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:177) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:282)
Consider the protocol stack used by Remoting (ChannelApplicationLayer → ConnectionHeadersFilterLayer → SSLEngineFilterLayer → AckFilterLayer → AgentProtocolClientFilterLayer → NetworkLayer) and the stack frames shown above. There are two or three frames in the stack trace for each logical layer, starting with ApplicationLayer.write, so by the time we are at the top of the stack trace in ProtocolStack.java:739 we must be in AckFilterLayer or AgentProtocolClientFilterLayer. Here getNextSend returned null, which is only ever supposed to happen for the head or tail of the doubly-linked list (i.e., either the NetworkLayer or the ApplicationLayer), hence the assumption in the error message "Network layer is not supposed to call isSendOpen" that we are in NetworkLayer. But it is clear from the stack trace that we are in AckFilterLayer or AgentProtocolClientFilterLayer. So how can getNextSend have returned null here? It is reading the nextSend pointer from the AckFilterLayer or AgentProtocolClientFilterLayer list list node, which we know was set up correctly when the protocol stack was initialized. There is only one explanation for how it could be null by the time we reach this exception, which is that we are reading from a garbage list node that has been unlinked from the list. The deletion logic clears the nextSend pointer after unlinking the list node, which should never be visible to consumers. Yet here we are: a consumer getting a garbage list node. https://github.com/jenkinsci/remoting/pull/615 notes two deficiencies in the locking implementation during node removal that might lead to this pathological outcome.
After fixing the three bugs that I could see during code inspection I still got the same UnsupportedOperationException: Network layer is not supposed to call isSendOpen after 57 repetitions of jenkins.slaves.restarter.JnlpSlaveRestarterInstallerTest#tcpReconnection. I would recommend that anyone who is interested in working on this start running jenkins.slaves.restarter.JnlpSlaveRestarterInstallerTest#tcpReconnection in a loop and adding more logging to figure out what is going on.
JENKINS-70334 (released in 2.388, requested for backport to LTS) should at least ameliorate the problem by properly restarting the TCP agent listener in such situations.
The issue is tagged as JDK 11 although I think it can happen with JDK 8. I have seen this happening in environment with JDK 8u352 and Jenkins 2.346.x.
Support for TLS1.3 was backported to JDK 8u261 along with the fix for https://bugs.openjdk.org/browse/JDK-8207009. So if this causes the problem, this would also impact JDK 8u261+. TLS v1.3 support looks more and more like the right clue..
basil Can you give more details on how to reproduce using this test ? I have setup a job that do that in a loop with FINEST logging and rotation but I am not able to reproduce the problem... I run this in pod on unix node so I can adjust resources... Or maybe I do need Windows nodes ? I have attached repoPipeline.groovy as an example of what I do. The job can do many iterations without failing. It does fail sometimes but when it does it's because the test timeout and I don't see the exception we are trying to catch here, also the thread dump returned by the Jenkins rule still shows the TcpSlaveAgentListener thread... Maybe I just need to tune this a little.
I, too, have been unable to reproduce the issue locally. The only place I have seen the issue reproduce is on our Windows agents (which are known to have slower performance characteristics) as in jenkinsci/jenkins#7565 which could reproduce the problem after 57 iterations. These days a reproduction would probably also require reverting or disabling JENKINS-70334 so that the error can be observed in the form of a test timeout. FINEST logging could also be chasing the problem away, so it might be better to add targeted log statements that provide the data needed to confirm or refute a particular theory without having a drastic impact on timing or performance.
I see. Thanks Basil I will continue try reproducing this.
In your analysis, the protocol stack in this context is ChannelApplicationLayer → ConnectionHeadersFilterLayer → SSLEngineFilterLayer → AckFilterLayer → NetworkLayer because it is the handle stack. What the stacktrace does eventually is going through the AckFilterLayer that does not override the isSendOpen so the generic FilterLayer is executed and then we do reach the NetworkingLayer. That last layer is identified by the nextSend being null, per the initialization of the NetworkLayer pointer here. Per the stacktrace and the protocol stack, it seems to me that we truly reaches the NetworkLayer.
After further reading, I get it now. Yeah it looks like in certain condition (that we have not yet identified), the link downward to the NetworkLayer is lost.
Per my understanding, this might happen very early since we are in the org.jenkinsci.remoting.protocol.ProtocolStack.init. I wonder why the AckFilterLayer would have already been removed (completed) at that point..
Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start ? Or per the stacktrace, I do think that we actually have the AckFilterLayer since we see a FilterLayer after the SSLEngineFilterLayer:
# Following 2 originates from an object that does not override FilterLayer.isSendOpen() most likely AckFilterLayer
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730)
at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
# Following 2 originates from super.isSendOpen() from SSLEngineFilterLayer
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237)
Maybe while we are on the AckFilterLayer layer checking on ProtocolStack$Ptr.isSendOpen(), this layer is being removed and therefore the nextSend becomes null. I am not sure if this is a possible scenario.
There used to be synchronization for removal that has been removed some time ago https://github.com/jenkinsci/remoting/pull/289.
Could we maybe add some logging just before throwing the UnsupportedOperationException That would dump the current state of the stack pointers ?
Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start?
Try stepping through ProtocolStack#start in the normal (successful) case. It's a few hundred statements but it should give you a feel for what should be happening. I did that a few weeks ago and if I recall correctly it was normal for some of the layers to be removed during ProtocolStack#start.
There is definitely something wrong with list removal, as the pointers are being initialized in a valid state and they are in an illegal state by the time we hit the error. Both the original code and the fix in https://github.com/jenkinsci/remoting/pull/289 look incorrect to me. I did a visual inspection/audit of the list code, found several things that looked wrong to me, and fixed them all in https://github.com/jenkinsci/remoting/pull/615 — but it was not enough. Since visual inspection/audit failed, direct analysis is all that is left and I did not have the time to do that.
The heart of the problem is this: the pointer chain is becoming corrupt at some point, and by the time we hit the fatal exception we are observing a downstream symptom rather than the root cause. To make progress on the problem we need to find the smoking gun that is corrupting the pointer chain. This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably. Once such a validation method is written, it could be called both before and after any write operations (i.e., anything that grabs stackLock.writeLock()) to catch the list corruption at the moment it occurs. From there it should be possible to reason about the root cause. In other words, the idea would be to run https://github.com/jenkinsci/jenkins/pull/7565 (modulo reverting or disabling JENKINS-70334) in a loop to get a feel for how many iterations it takes to hit the failure on our Windows agents (in my experience, about 57 iterations), then start running with the validation method which would hopefully trip in a similar number of iterations before we get to the UnsupportedOperationException (unless the cost of doing the validation perturbs the timing enough to chase away the failure).
Hi basil and allan_burdajewicz, I've seen that JENKINS-70334 has not been included in the yesterday's 2.353.3 LTS release. Is there some hesitance regarding porting it back to LTS because of this issue (59910) still being open?
Not that I know of. Questions about backporting should be directed to the Release Lead and/or Release Officer.
Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start?
It's normal, as this logging demonstrates:
Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /tmp/remote/remoting as a remoting work directory Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /tmp/remote/remoting Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: test Feb 09, 2023 1:52:19 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 999999-SNAPSHOT (private-02/09/2023 21:51 GMT-basil) Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /tmp/remote/remoting as a remoting work directory Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://127.0.0.1/] Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: 127.0.0.1 Agent port: 59100 Identity: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0 Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to 127.0.0.1:59100 Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Successfully verified list of size 6 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Feb 09, 2023 1:52:20 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run INFO: Waiting for ProtocolStack to start. Successfully verified list of size 6 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 5 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 4 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0 Successfully verified list of size 4 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 3 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Successfully verified list of size 3 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […]
In other words AgentProtocolClientFilterLayer is first to be removed, followed by AckFilterLayer, followed by ConnectionHeadersFilterLayer. Of the filter layers, the only one that remains by the end is SSLEngineFilterLayer.
This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably.
https://github.com/jenkinsci/remoting/commit/9f6785c250d03881aab8dbce4f3b7f805c1f87c3 (implemented on top of the linked list simplification in https://github.com/jenkinsci/remoting/commit/d65debc3f5409c40f0831d6445b4f932cc335965) is an example of what I mean. The current linked list logic in trunk is so bad that 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 fails right away (reproducibly) when applied to trunk (without d65debc3f5409c40f0831d6445b4f932cc335965). But d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 passes in my local testing. When I originally saw this, I had great hope that I had fixed the problem, but running d65debc3f5409c40f0831d6445b4f932cc335965 in a loop in https://github.com/jenkinsci/jenkins/pull/7567 still failed. At that point I surmised the next step was to run d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 in a loop and see where the verification routine trips, but I lost the will to keep going.
A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayers (one for the send direction and one for the receive direction) in ProtocolStack. The existing methods in Ptr could all be reimplemented in terms of these lists with much less code. The main functionality in Ptr is to allow one ProtocolLayer to pass control to the next one (in either the send or receive direction), to allow a ProtocolLayer to be closed in one direction first and then the other, and to notify the ProtocolLayer when it has been closed in both directions. This could all be implemented in ProtocolStack with standard Java functionality like ArrayList#indexOf without the need for all the complexity of Ptr. I suspect this approach would have the greatest likelihood of success. But it is also the most work, tantamount to reimplementing the whole class from scratch.
A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayers (one for the send direction and one for the receive direction) in ProtocolStack. The existing methods in Ptr could all be reimplemented in terms of these lists with much less code.
https://github.com/basil/remoting/tree/rewrite is a sketch of what a complete rewrite of this class could look like. The result is 200 lines shorter than the original while also being simpler and easier to reason about: a simpler locking scheme & no custom linked list implementation (to name a few simplifications). It is not quite as optimized as the original code, but I suspect the original code was prematurely optimized and that this rewrite is likely good enough or close to it. It seems to hold up fine to local testing, though I am a bit too afraid to try JnlpSlaveRestarterInstallerTest#tcpReconnection in a loop with this code.
FYI.: For us this error seems to be caused by Java8 JNLP-based Kubernetes POD connection attempts. While the connection does not work, it seems to kill the TcpSlaveAgentListener thread leaving the exact same stack trace as described in the ticket. For the past 6 occurrences, we can say for sure that 5 of them were caused by a wrong configuration in the yaml, using a Java8 JNLP image. For the 6th one we simply cannot tell...
We are running Jenkins 2.375.3 on Java 11.
Heya! Sorry to be that person, but is there any update on a fix for this?
Is it confirmed that either downgrading TLS to v1.2 or setting the `jdk.tls.acknowledgeCloseNotify=true` system property fixes this?
We are running Jenkins `2.346.3`, would upgrading and pulling in the fix to restart the TCP agent listener sufficiently "fix" this?
Thanks!
It didn't occur after rollback to Jenkins 2.176.4