Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59910

Java 11 agent disconnection: UnsupportedOperationException from ProtocolStack$Ptr.isSendOpen

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • core, remoting
    • Docker image based on jenkins/jenkins:2.204.5-jdk11
      Both with and without Nginx 1.17.6 as reverse proxy
      Ubuntu 18.04

      Investigating a spike in builds queue size we've found out that TcpSlaveAgent listener thread was dead with the following logs:

      2019-10-23 09:02:17.236+0000 [id=200815]        SEVERE  h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #1715 with /10.125.100.99:47700,5,main]
      java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690)
              at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157)
              at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
              at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
              at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
              at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153)
              at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203)
              at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271)
      2019-10-23 09:02:17.237+0000 [id=200815]        WARNING hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener
      java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690)
              at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157)
              at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
              at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
              at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
              at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153)
              at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203)
              at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271) 

      Followed by logs from nodes created by Jenkins Kubernetes Plugin:

      SEVERE: http://jenkins-master.example.com/ provided port:50000 is not reachable
      java.io.IOException: http://jenkins-master.example.com/ provided port:50000 is not reachable
              at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:287)
              at hudson.remoting.Engine.innerRun(Engine.java:523)
              at hudson.remoting.Engine.run(Engine.java:474)
       

      Changing JNLP port from 50000 to 50001 and back in Jenkins settings helped to restore connection and then nodes were able to connect to master again.

          [JENKINS-59910] Java 11 agent disconnection: UnsupportedOperationException from ProtocolStack$Ptr.isSendOpen

          Allan BURDAJEWICZ added a comment - - edited

          After further reading, I get it now. Yeah it looks like in certain condition (that we have not yet identified), the link downward to the NetworkLayer is lost.

          Per my understanding, this might happen very early since we are in the org.jenkinsci.remoting.protocol.ProtocolStack.init. I wonder why the AckFilterLayer would have already been removed (completed) at that point..

          Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start ? Or per the stacktrace, I do think that we actually have the AckFilterLayer since we see a FilterLayer after the SSLEngineFilterLayer:

                  # Following 2 originates from an object that does not override FilterLayer.isSendOpen() most likely AckFilterLayer
                  at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730)
                  at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
                  # Following 2 originates from super.isSendOpen() from SSLEngineFilterLayer
                  at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
                  at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
                  at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237)
          

          Maybe while we are on the AckFilterLayer layer checking on ProtocolStack$Ptr.isSendOpen(), this layer is being removed and therefore the nextSend becomes null. I am not sure if this is a possible scenario.

          There used to be synchronization for removal that has been removed some time ago https://github.com/jenkinsci/remoting/pull/289.

          Could we maybe add some logging just before throwing the UnsupportedOperationException That would dump the current state of the stack pointers ?

          Allan BURDAJEWICZ added a comment - - edited After further reading, I get it now. Yeah it looks like in certain condition (that we have not yet identified), the link downward to the NetworkLayer is lost. Per my understanding, this might happen very early since we are in the org.jenkinsci.remoting.protocol.ProtocolStack.init . I wonder why the AckFilterLayer would have already been removed (completed) at that point.. Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start ? Or per the stacktrace, I do think that we actually have the AckFilterLayer since we see a FilterLayer after the SSLEngineFilterLayer : # Following 2 originates from an object that does not override FilterLayer.isSendOpen() most likely AckFilterLayer at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) # Following 2 originates from super .isSendOpen() from SSLEngineFilterLayer at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237) Maybe while we are on the AckFilterLayer layer checking on ProtocolStack$Ptr.isSendOpen() , this layer is being removed and therefore the nextSend becomes null . I am not sure if this is a possible scenario. There used to be synchronization for removal that has been removed some time ago https://github.com/jenkinsci/remoting/pull/289 . Could we maybe add some logging just before throwing the UnsupportedOperationException That would dump the current state of the stack pointers ?

          Basil Crow added a comment - - edited

          Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start

          Try stepping through ProtocolStack#start in the normal (successful) case. It's a few hundred statements but it should give you a feel for what should be happening. I did that a few weeks ago and if I recall correctly it was normal for some of the layers to be removed during ProtocolStack#start.

          There is definitely something wrong with list removal, as the pointers are being initialized in a valid state and they are in an illegal state by the time we hit the error. Both the original code and the fix in https://github.com/jenkinsci/remoting/pull/289 look incorrect to me. I did a visual inspection/audit of the list code, found several things that looked wrong to me, and fixed them all in https://github.com/jenkinsci/remoting/pull/615 — but it was not enough. Since visual inspection/audit failed, direct analysis is all that is left and I did not have the time to do that.

          The heart of the problem is this: the pointer chain is becoming corrupt at some point, and by the time we hit the fatal exception we are observing a downstream symptom rather than the root cause. To make progress on the problem we need to find the smoking gun that is corrupting the pointer chain. This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably. Once such a validation method is written, it could be called both before and after any write operations (i.e., anything that grabs stackLock.writeLock()) to catch the list corruption at the moment it occurs. From there it should be possible to reason about the root cause. In other words, the idea would be to run https://github.com/jenkinsci/jenkins/pull/7565 (modulo reverting or disabling JENKINS-70334) in a loop to get a feel for how many iterations it takes to hit the failure on our Windows agents (in my experience, about 57 iterations), then start running with the validation method which would hopefully trip in a similar number of iterations before we get to the UnsupportedOperationException (unless the cost of doing the validation perturbs the timing enough to chase away the failure).

          Basil Crow added a comment - - edited Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start ?  Try stepping through ProtocolStack#start in the normal (successful) case. It's a few hundred statements but it should give you a feel for what should be happening. I did that a few weeks ago and if I recall correctly it was normal for some of the layers to be removed during ProtocolStack#start . There is definitely something wrong with list removal, as the pointers are being initialized in a valid state and they are in an illegal state by the time we hit the error. Both the original code and the fix in https://github.com/jenkinsci/remoting/pull/289 look incorrect to me. I did a visual inspection/audit of the list code, found several things that looked wrong to me, and fixed them all in https://github.com/jenkinsci/remoting/pull/615 — but it was not enough. Since visual inspection/audit failed, direct analysis is all that is left and I did not have the time to do that. The heart of the problem is this: the pointer chain is becoming corrupt at some point, and by the time we hit the fatal exception we are observing a downstream symptom rather than the root cause. To make progress on the problem we need to find the smoking gun that is corrupting the pointer chain. This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably. Once such a validation method is written, it could be called both before and after any write operations (i.e., anything that grabs stackLock.writeLock() ) to catch the list corruption at the moment it occurs. From there it should be possible to reason about the root cause. In other words, the idea would be to run https://github.com/jenkinsci/jenkins/pull/7565 (modulo reverting or disabling JENKINS-70334 ) in a loop to get a feel for how many iterations it takes to hit the failure on our Windows agents (in my experience, about 57 iterations), then start running with the validation method which would hopefully trip in a similar number of iterations before we get to the UnsupportedOperationException (unless the cost of doing the validation perturbs the timing enough to chase away the failure).

          Hi basil and allan_burdajewicz, I've seen that JENKINS-70334 has not been included in the yesterday's 2.353.3 LTS release. Is there some hesitance regarding porting it back to LTS because of this issue (59910) still being open?

          Janek Suchocki added a comment - Hi basil and allan_burdajewicz , I've seen that JENKINS-70334 has not been included in the yesterday's 2.353.3 LTS release. Is there some hesitance regarding porting it back to LTS because of this issue (59910) still being open?

          Basil Crow added a comment -

          Not that I know of. Questions about backporting should be directed to the Release Lead and/or Release Officer.

          Basil Crow added a comment - Not that I know of. Questions about backporting should be directed to the Release Lead and/or Release Officer.

          Basil Crow added a comment -

          Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start?

          It's normal, as this logging demonstrates:

          Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
          INFO: Using /tmp/remote/remoting as a remoting work directory
          Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
          INFO: Both error and output logs will be printed to /tmp/remote/remoting
          Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up agent: test
          Feb 09, 2023 1:52:19 PM hudson.remoting.Engine startEngine
          INFO: Using Remoting version: 999999-SNAPSHOT (private-02/09/2023 21:51 GMT-basil)
          Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
          INFO: Using /tmp/remote/remoting as a remoting work directory
          Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [http://127.0.0.1/]
          Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
          INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
          Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Agent discovery successful
            Agent address: 127.0.0.1
            Agent port:    59100
            Identity:      34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0
          Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to 127.0.0.1:59100
          Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP4-connect
          Successfully verified list of size 6
          org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
          org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer
          org.jenkinsci.remoting.protocol.impl.AckFilterLayer
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
          org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
          […]
          Feb 09, 2023 1:52:20 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
          INFO: Waiting for ProtocolStack to start.
          Successfully verified list of size 6
          org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
          org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer
          org.jenkinsci.remoting.protocol.impl.AckFilterLayer
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
          org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
          […]
          Successfully verified list of size 5
          org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
          org.jenkinsci.remoting.protocol.impl.AckFilterLayer
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
          org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
          […]
          Successfully verified list of size 4
          org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
          org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
          Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Remote identity confirmed: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0
          Successfully verified list of size 4
          org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
          org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
          […]
          Successfully verified list of size 3
          org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
          Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connected
          Successfully verified list of size 3
          org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
          […]
          

          In other words AgentProtocolClientFilterLayer is first to be removed, followed by AckFilterLayer, followed by ConnectionHeadersFilterLayer. Of the filter layers, the only one that remains by the end is SSLEngineFilterLayer.

          This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably.

          https://github.com/jenkinsci/remoting/commit/9f6785c250d03881aab8dbce4f3b7f805c1f87c3 (implemented on top of the linked list simplification in https://github.com/jenkinsci/remoting/commit/d65debc3f5409c40f0831d6445b4f932cc335965) is an example of what I mean. The current linked list logic in trunk is so bad that 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 fails right away (reproducibly) when applied to trunk (without d65debc3f5409c40f0831d6445b4f932cc335965). But d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 passes in my local testing. When I originally saw this, I had great hope that I had fixed the problem, but running d65debc3f5409c40f0831d6445b4f932cc335965 in a loop in https://github.com/jenkinsci/jenkins/pull/7567 still failed. At that point I surmised the next step was to run d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 in a loop and see where the verification routine trips, but I lost the will to keep going.

          Basil Crow added a comment - Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start ? It's normal, as this logging demonstrates: Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /tmp/remote/remoting as a remoting work directory Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /tmp/remote/remoting Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: test Feb 09, 2023 1:52:19 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 999999-SNAPSHOT (private-02/09/2023 21:51 GMT-basil) Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /tmp/remote/remoting as a remoting work directory Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://127.0.0.1/] Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: 127.0.0.1 Agent port: 59100 Identity: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0 Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to 127.0.0.1:59100 Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Successfully verified list of size 6 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Feb 09, 2023 1:52:20 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run INFO: Waiting for ProtocolStack to start. Successfully verified list of size 6 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 5 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 4 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0 Successfully verified list of size 4 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 3 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Successfully verified list of size 3 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] In other words AgentProtocolClientFilterLayer is first to be removed, followed by AckFilterLayer , followed by ConnectionHeadersFilterLayer . Of the filter layers, the only one that remains by the end is SSLEngineFilterLayer . This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably. https://github.com/jenkinsci/remoting/commit/9f6785c250d03881aab8dbce4f3b7f805c1f87c3 (implemented on top of the linked list simplification in https://github.com/jenkinsci/remoting/commit/d65debc3f5409c40f0831d6445b4f932cc335965 ) is an example of what I mean. The current linked list logic in trunk is so bad that 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 fails right away (reproducibly) when applied to trunk (without d65debc3f5409c40f0831d6445b4f932cc335965). But d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 passes in my local testing. When I originally saw this, I had great hope that I had fixed the problem, but running d65debc3f5409c40f0831d6445b4f932cc335965 in a loop in https://github.com/jenkinsci/jenkins/pull/7567 still failed. At that point I surmised the next step was to run d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 in a loop and see where the verification routine trips, but I lost the will to keep going.

          Basil Crow added a comment -

          A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayers (one for the send direction and one for the receive direction) in ProtocolStack. The existing methods in Ptr could all be reimplemented in terms of these lists with much less code. The main functionality in Ptr is to allow one ProtocolLayer to pass control to the next one (in either the send or receive direction), to allow a ProtocolLayer to be closed in one direction first and then the other, and to notify the ProtocolLayer when it has been closed in both directions. This could all be implemented in ProtocolStack with standard Java functionality like ArrayList#indexOf without the need for all the complexity of Ptr. I suspect this approach would have the greatest likelihood of success. But it is also the most work, tantamount to reimplementing the whole class from scratch.

          Basil Crow added a comment - A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayer s (one for the send direction and one for the receive direction) in ProtocolStack . The existing methods in Ptr could all be reimplemented in terms of these lists with much less code. The main functionality in Ptr is to allow one ProtocolLayer to pass control to the next one (in either the send or receive direction), to allow a ProtocolLayer to be closed in one direction first and then the other, and to notify the ProtocolLayer when it has been closed in both directions. This could all be implemented in ProtocolStack with standard Java functionality like ArrayList#indexOf without the need for all the complexity of Ptr . I suspect this approach would have the greatest likelihood of success. But it is also the most work, tantamount to reimplementing the whole class from scratch.

          Basil Crow added a comment -

          A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayers (one for the send direction and one for the receive direction) in ProtocolStack. The existing methods in Ptr could all be reimplemented in terms of these lists with much less code.

          https://github.com/basil/remoting/tree/rewrite is a sketch of what a complete rewrite of this class could look like. The result is 200 lines shorter than the original while also being simpler and easier to reason about: a simpler locking scheme & no custom linked list implementation (to name a few simplifications). It is not quite as optimized as the original code, but I suspect the original code was prematurely optimized and that this rewrite is likely good enough or close to it. It seems to hold up fine to local testing, though I am a bit too afraid to try JnlpSlaveRestarterInstallerTest#tcpReconnection in a loop with this code.

          Basil Crow added a comment - A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayer s (one for the send direction and one for the receive direction) in ProtocolStack . The existing methods in Ptr could all be reimplemented in terms of these lists with much less code. https://github.com/basil/remoting/tree/rewrite is a sketch of what a complete rewrite of this class could look like. The result is 200 lines shorter than the original while also being simpler and easier to reason about: a simpler locking scheme & no custom linked list implementation (to name a few simplifications). It is not quite as optimized as the original code, but I suspect the original code was prematurely optimized and that this rewrite is likely good enough or close to it. It seems to hold up fine to local testing, though I am a bit too afraid to try JnlpSlaveRestarterInstallerTest#tcpReconnection in a loop with this code.

          Joerg Schwaerzler added a comment - - edited

          FYI.: For us this error seems to be caused by Java8 JNLP-based Kubernetes POD connection attempts. While the connection does not work, it seems to kill the TcpSlaveAgentListener thread leaving the exact same stack trace as described in the ticket. For the past 6 occurrences, we can say for sure that 5 of them were caused by a wrong configuration in the yaml, using a Java8 JNLP image. For the 6th one we simply cannot tell...

          We are running Jenkins 2.375.3 on Java 11.

          Joerg Schwaerzler added a comment - - edited FYI.: For us this error seems to be caused by Java8 JNLP-based Kubernetes POD connection attempts. While the connection does not work, it seems to kill the TcpSlaveAgentListener thread leaving the exact same stack trace as described in the ticket. For the past 6 occurrences, we can say for sure that 5 of them were caused by a wrong configuration in the yaml, using a Java8 JNLP image. For the 6th one we simply cannot tell... We are running Jenkins 2.375.3 on Java 11.

          Adam Dougal added a comment -

          Heya! Sorry to be that person, but is there any update on a fix for this?

           

          Is it confirmed that either downgrading TLS to v1.2 or setting the `jdk.tls.acknowledgeCloseNotify=true` system property fixes this? 

           

          We are running Jenkins `2.346.3`, would upgrading and pulling in the fix to restart the TCP agent listener sufficiently "fix" this? 

           

          Thanks!

          Adam Dougal added a comment - Heya! Sorry to be that person, but is there any update on a fix for this?   Is it confirmed that either downgrading TLS to v1.2 or setting the `jdk.tls.acknowledgeCloseNotify=true` system property fixes this?    We are running Jenkins `2.346.3`, would upgrading and pulling in the fix to restart the TCP agent listener sufficiently "fix" this?    Thanks!

          Dima Rudaev added a comment - - edited

          We have the same issue on 2.319.3 with Java 8 and ECS plugin

          OS: OS Linux, 5.15.93-55.139.amzn2.x86_64 , amd64/64 (128 cores)
          Java: OpenJDK Runtime Environment, 1.8.0_322-b06
          JVM: OpenJDK 64-Bit Server VM, 25.322-b06, mixed mode

          Server side log:
          [id=8348096] SEVERE h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #431578 with /IP:PORT,5,main]

          Slave side log:
          SEVERE: https://XXXXXX:9080/ provided port:8443 is not reachable
          java.io.IOException: https://XXXXXX:9080/ provided port:8443 is not reachable

          Dima Rudaev added a comment - - edited We have the same issue on 2.319.3 with Java 8 and ECS plugin OS: OS Linux, 5.15.93-55.139.amzn2.x86_64 , amd64/64 (128 cores) Java: OpenJDK Runtime Environment, 1.8.0_322-b06 JVM: OpenJDK 64-Bit Server VM, 25.322-b06, mixed mode Server side log: [id=8348096] SEVERE h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread [TCP agent connection handler #431578 with /IP:PORT,5,main] Slave side log: SEVERE: https://XXXXXX:9080/ provided port:8443 is not reachable java.io.IOException: https://XXXXXX:9080/ provided port:8443 is not reachable

            Unassigned Unassigned
            oxygenxo Andrey Babushkin
            Votes:
            22 Vote for this issue
            Watchers:
            40 Start watching this issue

              Created:
              Updated: