Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-70334

When TcpSlaveAgentListener dies it is not restarted

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • core
    • core:2.375.1
    • 2.388

      When the TCP Agent listener crashes, it prints:

      2022-12-22 01:29:20.541+0000 [id=632]	WARNING	hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener
      

      However, it never seems to be restarted.

      How to Reproduce

      (thanks duemir for providing those steps)

      It can be reproduced with a debugger (at least an IntelliJ one)

      • Start a 2.375.1 Jenkins instance with debugger enabled
      • Prepare the IDE
      • Checkout tag jenkins-2.375.1 from the jenkinsci/jenkins repo
      • Open hudson.TcpSlaveAgentListener
      • Set breakpoint somewhere in the run method of the ConnectionHandler (line 279)
      • Enable the TCP port
      • Set up an inbound agent and test that it connects
      • Connect the debugger to the controller
      • Use Throw an exception to throw some exception that is not handled in the run, e.g. new IllegalStateException("BOOM")

      As a result, something similar to the lines below should be printed in the controller logs

      2022-12-22 01:29:20.540+0000 [id=632]	SEVERE	h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #6 with /127.0.0.1:61392,5,main]
      java.lang.IllegalStateException: BOOM
      	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:280)
      2022-12-22 01:29:20.541+0000 [id=632]	WARNING	hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener
      java.lang.IllegalStateException: BOOM
      	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:280)
      

      May need to fiddle a bit with a breakpoint (it didn't work the first time for me for some reason. I ended up with a breakpoint that only suspended a Thread, and I had to do the "Throw exception" action twice)

      Expected: As the log says, the TCP Agent listener is restarted after the crash
      Actual: It is not. The Port is not up, and agents cannot connect.

      The workaround is to disable the port and then enable it again.

          [JENKINS-70334] When TcpSlaveAgentListener dies it is not restarted

          Joerg Schwaerzler added a comment - - edited

          For us the fix does not seem to work. We updated to Jenkins 2.387.2 to get a fix for our JNLP nodes randomly not being able to connect with an error like:

          SEVERE: https://our.jenkins/ provided port:34047 is not reachable on host our.jenkins
          java.io.IOException: https://sdk90-jenkinstest.vih.infineon.com/ provided port:34047 is not reachable on host our.jenkins
          	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:303)
          	at hudson.remoting.Engine.innerRun(Engine.java:755)
          	at hudson.remoting.Engine.run(Engine.java:543)
          

          Now when I do the steps to reproduce the issue mentioned in this ticket, I get exactly that JNLP-agents-cannot-connect issue (after disconnecting an agent).

          The mentioned workaround to disable (or change) the port number still works for us.

          To summarize:

          • After throwing the exception as mentioned in the PR, we cannot connect any JNLP agents until changing the port number
          • The 'TCP agent listener port=xxxxx' is not running until changing the port number

          Am I possibly missing anything?

          Joerg Schwaerzler added a comment - - edited For us the fix does not seem to work. We updated to Jenkins 2.387.2 to get a fix for our JNLP nodes randomly not being able to connect with an error like: SEVERE: https: //our.jenkins/ provided port:34047 is not reachable on host our.jenkins java.io.IOException: https: //sdk90-jenkinstest.vih.infineon.com/ provided port:34047 is not reachable on host our.jenkins at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:303) at hudson.remoting.Engine.innerRun(Engine.java:755) at hudson.remoting.Engine.run(Engine.java:543) Now when I do the steps to reproduce the issue mentioned in this ticket, I get exactly that JNLP-agents-cannot-connect issue (after disconnecting an agent). The mentioned workaround to disable (or change) the port number still works for us. To summarize: After throwing the exception as mentioned in the PR, we cannot connect any JNLP agents until changing the port number The 'TCP agent listener port=xxxxx' is not running until changing the port number Am I possibly missing anything?

          Maybe I am using the wrong method to reproduce the issue?

          jenkins.model.Jenkins.get().tcpSlaveAgentListener.getUncaughtExceptionHandler().uncaughtException(jenkins.model.Jenkins.get().tcpSlaveAgentListener, new UnsupportedOperationException("Test"));
          

          Joerg Schwaerzler added a comment - Maybe I am using the wrong method to reproduce the issue? jenkins.model.Jenkins.get().tcpSlaveAgentListener.getUncaughtExceptionHandler().uncaughtException(jenkins.model.Jenkins.get().tcpSlaveAgentListener, new UnsupportedOperationException( "Test" ));

          macdrega We actually do not restart on uncaught exception from the parent thread as discussed in the PR https://github.com/jenkinsci/jenkins/pull/7547#issuecomment-1375189467.

          Are you able to capture the actual exception that is killing the TCP Agent Listener ? (essentially looking for TcpSlaveAgentListener in your Jenkins logs).

          Allan BURDAJEWICZ added a comment - macdrega We actually do not restart on uncaught exception from the parent thread as discussed in the PR https://github.com/jenkinsci/jenkins/pull/7547#issuecomment-1375189467 . Are you able to capture the actual exception that is killing the TCP Agent Listener ? (essentially looking for TcpSlaveAgentListener in your Jenkins logs).

          allan_burdajewicz By actual exception you are not referring to the exception I used to reproce the issue, do you?

          • Last time on our productive Jenkins, the TcpSlaveAgentListener kill was caused by the same exception as in the linked ticket: JENKINS-59910.
          • On the test instance (where we are running the Jenkins 2.387.2 already) I cannot (and could not) reproce this easily, unfortunately.

          Joerg Schwaerzler added a comment - allan_burdajewicz By actual exception you are not referring to the exception I used to reproce the issue, do you? Last time on our productive Jenkins, the TcpSlaveAgentListener kill was caused by the same exception as in the linked ticket: JENKINS-59910 . On the test instance (where we are running the Jenkins 2.387.2 already) I cannot (and could not) reproce this easily, unfortunately.

          Right. I am not referring to the exception used to reproduce the problem. This reproduction script was used initially to reproduce the problem because anything that wasn't an IOException or InterruptedException was uncaught... Now we catch Throwable so we should not except to reach uncaught exception..
          If the thread died, there should be logs and/or a stacktrace mentioning TcpSlaveAgentListener in the log.

          Allan BURDAJEWICZ added a comment - Right. I am not referring to the exception used to reproduce the problem. This reproduction script was used initially to reproduce the problem because anything that wasn't an IOException or InterruptedException was uncaught... Now we catch Throwable so we should not except to reach uncaught exception.. If the thread died, there should be logs and/or a stacktrace mentioning TcpSlaveAgentListener in the log.

          I believe that the stack trace is essentially the same as in JENKINS-59910. Nevertheless, I made a copy of the stack trace to paste it here:
          Please note that this stack trace is taken from Jenkins 2.375.3.

          Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #2795266 with /10.162.132.16:60750,5,main]
          java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739)
          	at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747)
          	at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343)
          	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699)
          	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:474)
          	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.start(ConnectionHeadersFilterLayer.java:138)
          	at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209)
          	at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563)
          	at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:156)
          	at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:196)
          	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:282)
          

          I am not sure what has been fixed in the PR. I would have expected that the tread would be restarted no matter which exception killed the thread?

          Joerg Schwaerzler added a comment - I believe that the stack trace is essentially the same as in JENKINS-59910 . Nevertheless, I made a copy of the stack trace to paste it here: Please note that this stack trace is taken from Jenkins 2.375.3. Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread [TCP agent connection handler #2795266 with /10.162.132.16:60750,5,main] java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:474) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.start(ConnectionHeadersFilterLayer.java:138) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:156) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:196) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:282) I am not sure what has been fixed in the PR. I would have expected that the tread would be restarted no matter which exception killed the thread?

          We really need to see what happens in 2.387.1+. The isSendOpen may still happen but it will not be an "Uncaught exception" and the TCP Agent Listener socket will be restarted.

          Allan BURDAJEWICZ added a comment - We really need to see what happens in 2.387.1+. The isSendOpen may still happen but it will not be an "Uncaught exception" and the TCP Agent Listener socket will be restarted.

          I see.
          However I was wondering why we would not want to have the thread restarted in the case of an "Uncaught exception"?
          I have to admin that for me it was not easy to follow all the discussions in the PR.
          Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread?

          I might have found a way to reproduce the isSendOpen exception. Will have to check. It could be related to a java8 JNLP Kubernetes client trying to connect to the Java11 Jenkins master.

          Joerg Schwaerzler added a comment - I see. However I was wondering why we would not want to have the thread restarted in the case of an "Uncaught exception"? I have to admin that for me it was not easy to follow all the discussions in the PR. Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread? I might have found a way to reproduce the isSendOpen exception. Will have to check. It could be related to a java8 JNLP Kubernetes client trying to connect to the Java11 Jenkins master.

          allan_burdajewicz As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier?

          Joerg Schwaerzler added a comment - allan_burdajewicz As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier?

          > Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread?

          Kind of. Giving that we catch Throwable, we don't except to face uncaught exceptions.. I am not sure in what scenario in Java we would still get there.. basil maybe have an answer to that.

          > As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier?

          It should still be logged in the catch methods. The exception was usually happening in the ConnectionHandler:

          In rare case where it might happen in the Agent Listener thread itself:

          There is one case where it would not be logged and that is when the Agent Listener is being shutdown (that would only happen when Jenkins is shutting down or when you are changing the port through the UI):

          Allan BURDAJEWICZ added a comment - > Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread? Kind of. Giving that we catch Throwable, we don't except to face uncaught exceptions.. I am not sure in what scenario in Java we would still get there.. basil maybe have an answer to that. > As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier? It should still be logged in the catch methods. The exception was usually happening in the ConnectionHandler: https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L287 https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L294-L298 In rare case where it might happen in the Agent Listener thread itself: https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L192-L201 There is one case where it would not be logged and that is when the Agent Listener is being shutdown (that would only happen when Jenkins is shutting down or when you are changing the port through the UI): https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L191 https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/jenkins/model/Jenkins.java#L1333-L1334

          Thanks for the explanation.
          In that case I will try to downgrade our test instance to see whether we will be able to reproduce the issue then.

          FYI.: On the productive instance it really looks like the issue is caused by JAVA8 JNLP images. Will post that in the linked ticket, too.

          Joerg Schwaerzler added a comment - Thanks for the explanation. In that case I will try to downgrade our test instance to see whether we will be able to reproduce the issue then. FYI.: On the productive instance it really looks like the issue is caused by JAVA8 JNLP images. Will post that in the linked ticket, too.

          macdrega any update ?

          Allan BURDAJEWICZ added a comment - macdrega any update ?

          Rahali added a comment -

          macdrega any update please for this issue ?

          Rahali added a comment - macdrega any update please for this issue ?

          Joerg Schwaerzler added a comment - - edited

          We fully migrated to Java11 and do not see this issues anymore. Currently we are running 2.401.3.
          Sorry for the late response.

          Joerg Schwaerzler added a comment - - edited We fully migrated to Java11 and do not see this issues anymore. Currently we are running 2.401.3. Sorry for the late response.

            allan_burdajewicz Allan BURDAJEWICZ
            allan_burdajewicz Allan BURDAJEWICZ
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: