Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59910

Java 11 agent disconnection: UnsupportedOperationException from ProtocolStack$Ptr.isSendOpen

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • core, remoting
    • Docker image based on jenkins/jenkins:2.204.5-jdk11
      Both with and without Nginx 1.17.6 as reverse proxy
      Ubuntu 18.04

    Description

      Investigating a spike in builds queue size we've found out that TcpSlaveAgent listener thread was dead with the following logs:

      2019-10-23 09:02:17.236+0000 [id=200815]        SEVERE  h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #1715 with /10.125.100.99:47700,5,main]
      java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690)
              at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157)
              at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
              at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
              at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
              at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153)
              at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203)
              at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271)
      2019-10-23 09:02:17.237+0000 [id=200815]        WARNING hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener
      java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:730)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:237)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:738)
              at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:340)
              at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.isSendOpen(ConnectionHeadersFilterLayer.java:514)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:690)
              at org.jenkinsci.remoting.protocol.ApplicationLayer.write(ApplicationLayer.java:157)
              at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.start(ChannelApplicationLayer.java:230)
              at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:201)
              at org.jenkinsci.remoting.protocol.ProtocolStack.access$700(ProtocolStack.java:106)
              at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:554)
              at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:153)
              at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:203)
              at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:271) 

      Followed by logs from nodes created by Jenkins Kubernetes Plugin:

      SEVERE: http://jenkins-master.example.com/ provided port:50000 is not reachable
      java.io.IOException: http://jenkins-master.example.com/ provided port:50000 is not reachable
              at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:287)
              at hudson.remoting.Engine.innerRun(Engine.java:523)
              at hudson.remoting.Engine.run(Engine.java:474)
       

      Changing JNLP port from 50000 to 50001 and back in Jenkins settings helped to restore connection and then nodes were able to connect to master again.

      Attachments

        Issue Links

          Activity

            Hi basil and allan_burdajewicz, I've seen that JENKINS-70334 has not been included in the yesterday's 2.353.3 LTS release. Is there some hesitance regarding porting it back to LTS because of this issue (59910) still being open?

            jsuchocki Janek Suchocki added a comment - Hi basil and allan_burdajewicz , I've seen that JENKINS-70334 has not been included in the yesterday's 2.353.3 LTS release. Is there some hesitance regarding porting it back to LTS because of this issue (59910) still being open?
            basil Basil Crow added a comment -

            Not that I know of. Questions about backporting should be directed to the Release Lead and/or Release Officer.

            basil Basil Crow added a comment - Not that I know of. Questions about backporting should be directed to the Release Lead and/or Release Officer.
            basil Basil Crow added a comment -

            Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start?

            It's normal, as this logging demonstrates:

            Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
            INFO: Using /tmp/remote/remoting as a remoting work directory
            Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
            INFO: Both error and output logs will be printed to /tmp/remote/remoting
            Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main createEngine
            INFO: Setting up agent: test
            Feb 09, 2023 1:52:19 PM hudson.remoting.Engine startEngine
            INFO: Using Remoting version: 999999-SNAPSHOT (private-02/09/2023 21:51 GMT-basil)
            Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
            INFO: Using /tmp/remote/remoting as a remoting work directory
            Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Locating server among [http://127.0.0.1/]
            Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
            INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
            Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Agent discovery successful
              Agent address: 127.0.0.1
              Agent port:    59100
              Identity:      34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0
            Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Handshaking
            Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connecting to 127.0.0.1:59100
            Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Trying protocol: JNLP4-connect
            Successfully verified list of size 6
            org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
            org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer
            org.jenkinsci.remoting.protocol.impl.AckFilterLayer
            org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
            org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
            org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
            […]
            Feb 09, 2023 1:52:20 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
            INFO: Waiting for ProtocolStack to start.
            Successfully verified list of size 6
            org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
            org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer
            org.jenkinsci.remoting.protocol.impl.AckFilterLayer
            org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
            org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
            org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
            […]
            Successfully verified list of size 5
            org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
            org.jenkinsci.remoting.protocol.impl.AckFilterLayer
            org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
            org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
            org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
            […]
            Successfully verified list of size 4
            org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
            org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
            org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
            org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
            Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Remote identity confirmed: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0
            Successfully verified list of size 4
            org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
            org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
            org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer
            org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
            […]
            Successfully verified list of size 3
            org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
            org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
            org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
            Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status
            INFO: Connected
            Successfully verified list of size 3
            org.jenkinsci.remoting.protocol.impl.BIONetworkLayer
            org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer
            org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer
            […]
            

            In other words AgentProtocolClientFilterLayer is first to be removed, followed by AckFilterLayer, followed by ConnectionHeadersFilterLayer. Of the filter layers, the only one that remains by the end is SSLEngineFilterLayer.

            This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably.

            https://github.com/jenkinsci/remoting/commit/9f6785c250d03881aab8dbce4f3b7f805c1f87c3 (implemented on top of the linked list simplification in https://github.com/jenkinsci/remoting/commit/d65debc3f5409c40f0831d6445b4f932cc335965) is an example of what I mean. The current linked list logic in trunk is so bad that 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 fails right away (reproducibly) when applied to trunk (without d65debc3f5409c40f0831d6445b4f932cc335965). But d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 passes in my local testing. When I originally saw this, I had great hope that I had fixed the problem, but running d65debc3f5409c40f0831d6445b4f932cc335965 in a loop in https://github.com/jenkinsci/jenkins/pull/7567 still failed. At that point I surmised the next step was to run d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 in a loop and see where the verification routine trips, but I lost the will to keep going.

            basil Basil Crow added a comment - Maybe that is the scenario that causes this.. that when the AckFilterLayer (or any filter layer) is removed before the end of the ProtocolStack#start ? It's normal, as this logging demonstrates: Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /tmp/remote/remoting as a remoting work directory Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /tmp/remote/remoting Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: test Feb 09, 2023 1:52:19 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 999999-SNAPSHOT (private-02/09/2023 21:51 GMT-basil) Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /tmp/remote/remoting as a remoting work directory Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://127.0.0.1/] Feb 09, 2023 1:52:19 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: 127.0.0.1 Agent port: 59100 Identity: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0 Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to 127.0.0.1:59100 Feb 09, 2023 1:52:19 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Successfully verified list of size 6 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Feb 09, 2023 1:52:20 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run INFO: Waiting for ProtocolStack to start. Successfully verified list of size 6 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AgentProtocolClientFilterLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 5 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.AckFilterLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 4 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: 34:ee:ff:fd:a6:3e:ed:fc:aa:76:9d:4b:aa:d6:e1:a0 Successfully verified list of size 4 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] Successfully verified list of size 3 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer Feb 09, 2023 1:52:22 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Successfully verified list of size 3 org.jenkinsci.remoting.protocol.impl.BIONetworkLayer org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer […] In other words AgentProtocolClientFilterLayer is first to be removed, followed by AckFilterLayer , followed by ConnectionHeadersFilterLayer . Of the filter layers, the only one that remains by the end is SSLEngineFilterLayer . This could be done by writing a list validation method that walks the whole doubly-linked list and validates that each previous and next pointer is as expected (i.e., as drawn in https://github.com/jenkinsci/remoting/blob/master/docs/protocols.md modulo any node removals). The latter may or may not be easier to build on top of https://github.com/jenkinsci/remoting/pull/615 which simplifies the linked list logic considerably. https://github.com/jenkinsci/remoting/commit/9f6785c250d03881aab8dbce4f3b7f805c1f87c3 (implemented on top of the linked list simplification in https://github.com/jenkinsci/remoting/commit/d65debc3f5409c40f0831d6445b4f932cc335965 ) is an example of what I mean. The current linked list logic in trunk is so bad that 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 fails right away (reproducibly) when applied to trunk (without d65debc3f5409c40f0831d6445b4f932cc335965). But d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 passes in my local testing. When I originally saw this, I had great hope that I had fixed the problem, but running d65debc3f5409c40f0831d6445b4f932cc335965 in a loop in https://github.com/jenkinsci/jenkins/pull/7567 still failed. At that point I surmised the next step was to run d65debc3f5409c40f0831d6445b4f932cc335965 + 9f6785c250d03881aab8dbce4f3b7f805c1f87c3 in a loop and see where the verification routine trips, but I lost the will to keep going.
            basil Basil Crow added a comment -

            A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayers (one for the send direction and one for the receive direction) in ProtocolStack. The existing methods in Ptr could all be reimplemented in terms of these lists with much less code. The main functionality in Ptr is to allow one ProtocolLayer to pass control to the next one (in either the send or receive direction), to allow a ProtocolLayer to be closed in one direction first and then the other, and to notify the ProtocolLayer when it has been closed in both directions. This could all be implemented in ProtocolStack with standard Java functionality like ArrayList#indexOf without the need for all the complexity of Ptr. I suspect this approach would have the greatest likelihood of success. But it is also the most work, tantamount to reimplementing the whole class from scratch.

            basil Basil Crow added a comment - A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayer s (one for the send direction and one for the receive direction) in ProtocolStack . The existing methods in Ptr could all be reimplemented in terms of these lists with much less code. The main functionality in Ptr is to allow one ProtocolLayer to pass control to the next one (in either the send or receive direction), to allow a ProtocolLayer to be closed in one direction first and then the other, and to notify the ProtocolLayer when it has been closed in both directions. This could all be implemented in ProtocolStack with standard Java functionality like ArrayList#indexOf without the need for all the complexity of Ptr . I suspect this approach would have the greatest likelihood of success. But it is also the most work, tantamount to reimplementing the whole class from scratch.
            basil Basil Crow added a comment -

            A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayers (one for the send direction and one for the receive direction) in ProtocolStack. The existing methods in Ptr could all be reimplemented in terms of these lists with much less code.

            https://github.com/basil/remoting/tree/rewrite is a sketch of what a complete rewrite of this class could look like. The result is 200 lines shorter than the original while also being simpler and easier to reason about: a simpler locking scheme & no custom linked list implementation (to name a few simplifications). It is not quite as optimized as the original code, but I suspect the original code was prematurely optimized and that this rewrite is likely good enough or close to it. It seems to hold up fine to local testing, though I am a bit too afraid to try JnlpSlaveRestarterInstallerTest#tcpReconnection in a loop with this code.

            basil Basil Crow added a comment - A different idea would be to completely delete the problematic Ptr class, instead representing the protocol stack as two (synchronized) lists of ProtocolLayer s (one for the send direction and one for the receive direction) in ProtocolStack . The existing methods in Ptr could all be reimplemented in terms of these lists with much less code. https://github.com/basil/remoting/tree/rewrite is a sketch of what a complete rewrite of this class could look like. The result is 200 lines shorter than the original while also being simpler and easier to reason about: a simpler locking scheme & no custom linked list implementation (to name a few simplifications). It is not quite as optimized as the original code, but I suspect the original code was prematurely optimized and that this rewrite is likely good enough or close to it. It seems to hold up fine to local testing, though I am a bit too afraid to try JnlpSlaveRestarterInstallerTest#tcpReconnection in a loop with this code.

            People

              allan_burdajewicz Allan BURDAJEWICZ
              oxygenxo Andrey Babushkin
              Votes:
              19 Vote for this issue
              Watchers:
              36 Start watching this issue

              Dates

                Created:
                Updated: