Investigating a spike in builds queue size we've found out that TcpSlaveAgent listener thread was dead with the following logs:
2019-10-23 09:02:17.236+0000 [id=200815] SEVERE h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #1715 with /10.125.100.99:47700,5,main] java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271) 2019-10-23 09:02:17.237+0000 [id=200815] WARNING hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690) at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201) at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271)
Followed by logs from nodes created by Jenkins Kubernetes Plugin:
SEVERE: http://jenkins-master.example.com/ provided port:50000 is not reachable java.io.IOException: http://jenkins-master.example.com/ provided port:50000 is not reachable at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:287) at hudson.remoting.Engine.innerRun(Engine.java:523) at hudson.remoting.Engine.run(Engine.java:474)
Changing JNLP port from 50000 to 50001 and back in Jenkins settings helped to restore connection and then nodes were able to connect to master again.
- is duplicated by
-
JENKINS-70161 Blocked JNLP port
-
- Closed
-
- is related to
-
JENKINS-70334 When TcpSlaveAgentListener dies it is not restarted
-
- Reopened
-
After further reading, I get it now. Yeah it looks like in certain condition (that we have not yet identified), the link downward to the NetworkLayer is lost.
Per my understanding, this might happen very early since we are in the org.jenkinsci.remoting.protocol.ProtocolStack.init. I wonder why the AckFilterLayer would have already been removed (completed) at that point..
Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start ? Or per the stacktrace, I do think that we actually have the AckFilterLayer since we see a FilterLayer after the SSLEngineFilterLayer:
# Following 2 originates from an object that does not override FilterLayer.isSendOpen() most likely AckFilterLayer at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) # Following 2 originates from super.isSendOpen() from SSLEngineFilterLayer at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237)
Maybe while we are on the AckFilterLayer layer checking on ProtocolStack$Ptr.isSendOpen(), this layer is being removed and therefore the nextSend becomes null. I am not sure if this is a possible scenario.
There used to be synchronization for removal that has been removed some time ago https://github.com/jenkinsci/remoting/pull/289.
Could we maybe add some logging just before throwing the UnsupportedOperationException That would dump the current state of the stack pointers ?