[JENKINS-70334] When TcpSlaveAgentListener dies it is not restarted

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: core
Labels:
- 2.387.1-fixed
Environment:
core:2.375.1

Similar Issues:
Powered by SuggestiMate

Show
Released As:
2.388

When the TCP Agent listener crashes, it prints:

2022-12-22 01:29:20.541+0000 [id=632]	WARNING	hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener

However, it never seems to be restarted.

How to Reproduce

(thanks duemir for providing those steps)

It can be reproduced with a debugger (at least an IntelliJ one)

Start a 2.375.1 Jenkins instance with debugger enabled
Prepare the IDE
Checkout tag jenkins-2.375.1 from the jenkinsci/jenkins repo
Open hudson.TcpSlaveAgentListener
Set breakpoint somewhere in the run method of the ConnectionHandler (line 279)
Enable the TCP port
Set up an inbound agent and test that it connects
Connect the debugger to the controller
Use Throw an exception to throw some exception that is not handled in the run, e.g. new IllegalStateException("BOOM")

As a result, something similar to the lines below should be printed in the controller logs

2022-12-22 01:29:20.540+0000 [id=632]	SEVERE	h.TcpSlaveAgentListener$ConnectionHandler#lambda$new$0: Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #6 with /127.0.0.1:61392,5,main]
java.lang.IllegalStateException: BOOM
	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:280)
2022-12-22 01:29:20.541+0000 [id=632]	WARNING	hudson.TcpSlaveAgentListener$1#run: Connection handler failed, restarting listener
java.lang.IllegalStateException: BOOM
	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:280)

May need to fiddle a bit with a breakpoint (it didn't work the first time for me for some reason. I ended up with a breakpoint that only suspended a Thread, and I had to do the "Throw exception" action twice)

Expected: As the log says, the TCP Agent listener is restarted after the crash
Actual: It is not. The Port is not up, and agents cannot connect.

The workaround is to disable the port and then enable it again.

is related to

JENKINS-59910 Java 11 agent disconnection: UnsupportedOperationException from ProtocolStack$Ptr.isSendOpen

Open

links to

jenkins #7547

Joerg Schwaerzler added a comment - 2023-04-14 08:38 - edited

For us the fix does not seem to work. We updated to Jenkins 2.387.2 to get a fix for our JNLP nodes randomly not being able to connect with an error like:

SEVERE: https://our.jenkins/ provided port:34047 is not reachable on host our.jenkins
java.io.IOException: https://sdk90-jenkinstest.vih.infineon.com/ provided port:34047 is not reachable on host our.jenkins
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:303)
	at hudson.remoting.Engine.innerRun(Engine.java:755)
	at hudson.remoting.Engine.run(Engine.java:543)

Now when I do the steps to reproduce the issue mentioned in this ticket, I get exactly that JNLP-agents-cannot-connect issue (after disconnecting an agent).

The mentioned workaround to disable (or change) the port number still works for us.

To summarize:

After throwing the exception as mentioned in the PR, we cannot connect any JNLP agents until changing the port number
The 'TCP agent listener port=xxxxx' is not running until changing the port number

Am I possibly missing anything?

Joerg Schwaerzler added a comment - 2023-04-14 08:38 - edited For us the fix does not seem to work. We updated to Jenkins 2.387.2 to get a fix for our JNLP nodes randomly not being able to connect with an error like: SEVERE: https: //our.jenkins/ provided port:34047 is not reachable on host our.jenkins java.io.IOException: https: //sdk90-jenkinstest.vih.infineon.com/ provided port:34047 is not reachable on host our.jenkins at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:303) at hudson.remoting.Engine.innerRun(Engine.java:755) at hudson.remoting.Engine.run(Engine.java:543) Now when I do the steps to reproduce the issue mentioned in this ticket, I get exactly that JNLP-agents-cannot-connect issue (after disconnecting an agent). The mentioned workaround to disable (or change) the port number still works for us. To summarize: After throwing the exception as mentioned in the PR, we cannot connect any JNLP agents until changing the port number The 'TCP agent listener port=xxxxx' is not running until changing the port number Am I possibly missing anything?

Joerg Schwaerzler added a comment - 2023-04-14 08:51

Maybe I am using the wrong method to reproduce the issue?

jenkins.model.Jenkins.get().tcpSlaveAgentListener.getUncaughtExceptionHandler().uncaughtException(jenkins.model.Jenkins.get().tcpSlaveAgentListener, new UnsupportedOperationException("Test"));

Joerg Schwaerzler added a comment - 2023-04-14 08:51 Maybe I am using the wrong method to reproduce the issue? jenkins.model.Jenkins.get().tcpSlaveAgentListener.getUncaughtExceptionHandler().uncaughtException(jenkins.model.Jenkins.get().tcpSlaveAgentListener, new UnsupportedOperationException( "Test" ));

Allan BURDAJEWICZ added a comment - 2023-04-17 12:03

macdrega We actually do not restart on uncaught exception from the parent thread as discussed in the PR https://github.com/jenkinsci/jenkins/pull/7547#issuecomment-1375189467.

Are you able to capture the actual exception that is killing the TCP Agent Listener ? (essentially looking for TcpSlaveAgentListener in your Jenkins logs).

Allan BURDAJEWICZ added a comment - 2023-04-17 12:03 macdrega We actually do not restart on uncaught exception from the parent thread as discussed in the PR https://github.com/jenkinsci/jenkins/pull/7547#issuecomment-1375189467 . Are you able to capture the actual exception that is killing the TCP Agent Listener ? (essentially looking for TcpSlaveAgentListener in your Jenkins logs).

Joerg Schwaerzler added a comment - 2023-04-17 12:15

allan_burdajewicz By actual exception you are not referring to the exception I used to reproce the issue, do you?

Last time on our productive Jenkins, the TcpSlaveAgentListener kill was caused by the same exception as in the linked ticket: JENKINS-59910.
On the test instance (where we are running the Jenkins 2.387.2 already) I cannot (and could not) reproce this easily, unfortunately.

Joerg Schwaerzler added a comment - 2023-04-17 12:15 allan_burdajewicz By actual exception you are not referring to the exception I used to reproce the issue, do you? Last time on our productive Jenkins, the TcpSlaveAgentListener kill was caused by the same exception as in the linked ticket: JENKINS-59910 . On the test instance (where we are running the Jenkins 2.387.2 already) I cannot (and could not) reproce this easily, unfortunately.

Allan BURDAJEWICZ added a comment - 2023-04-17 21:57

Right. I am not referring to the exception used to reproduce the problem. This reproduction script was used initially to reproduce the problem because anything that wasn't an IOException or InterruptedException was uncaught... Now we catch Throwable so we should not except to reach uncaught exception..
If the thread died, there should be logs and/or a stacktrace mentioning TcpSlaveAgentListener in the log.

Allan BURDAJEWICZ added a comment - 2023-04-17 21:57 Right. I am not referring to the exception used to reproduce the problem. This reproduction script was used initially to reproduce the problem because anything that wasn't an IOException or InterruptedException was uncaught... Now we catch Throwable so we should not except to reach uncaught exception.. If the thread died, there should be logs and/or a stacktrace mentioning TcpSlaveAgentListener in the log.

Joerg Schwaerzler added a comment - 2023-04-18 12:16

I believe that the stack trace is essentially the same as in JENKINS-59910. Nevertheless, I made a copy of the stack trace to paste it here:
Please note that this stack trace is taken from Jenkins 2.375.3.

Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread[TCP agent connection handler #2795266 with /10.162.132.16:60750,5,main]
java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739)
	at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747)
	at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699)
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:474)
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.start(ConnectionHeadersFilterLayer.java:138)
	at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563)
	at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:156)
	at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:196)
	at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:282)

I am not sure what has been fixed in the PR. I would have expected that the tread would be restarted no matter which exception killed the thread?

Joerg Schwaerzler added a comment - 2023-04-18 12:16 I believe that the stack trace is essentially the same as in JENKINS-59910 . Nevertheless, I made a copy of the stack trace to paste it here: Please note that this stack trace is taken from Jenkins 2.375.3. Uncaught exception in TcpSlaveAgentListener ConnectionHandler Thread [TCP agent connection handler #2795266 with /10.162.132.16:60750,5,main] java.lang.UnsupportedOperationException: Network layer is not supposed to call isSendOpen at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:739) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.isSendOpen(ProtocolStack.java:747) at org.jenkinsci.remoting.protocol.FilterLayer.isSendOpen(FilterLayer.java:343) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.isSendOpen(SSLEngineFilterLayer.java:233) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doSend(ProtocolStack.java:699) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.doSend(ConnectionHeadersFilterLayer.java:474) at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.start(ConnectionHeadersFilterLayer.java:138) at org.jenkinsci.remoting.protocol.ProtocolStack.init(ProtocolStack.java:209) at org.jenkinsci.remoting.protocol.ProtocolStack$Builder.build(ProtocolStack.java:563) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler.handle(JnlpProtocol4Handler.java:156) at jenkins.slaves.JnlpSlaveAgentProtocol4.handle(JnlpSlaveAgentProtocol4.java:196) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:282) I am not sure what has been fixed in the PR. I would have expected that the tread would be restarted no matter which exception killed the thread?

Allan BURDAJEWICZ added a comment - 2023-04-18 23:34

We really need to see what happens in 2.387.1+. The isSendOpen may still happen but it will not be an "Uncaught exception" and the TCP Agent Listener socket will be restarted.

Allan BURDAJEWICZ added a comment - 2023-04-18 23:34 We really need to see what happens in 2.387.1+. The isSendOpen may still happen but it will not be an "Uncaught exception" and the TCP Agent Listener socket will be restarted.

Joerg Schwaerzler added a comment - 2023-04-20 14:33

I see.
However I was wondering why we would not want to have the thread restarted in the case of an "Uncaught exception"?
I have to admin that for me it was not easy to follow all the discussions in the PR.
Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread?

I might have found a way to reproduce the isSendOpen exception. Will have to check. It could be related to a java8 JNLP Kubernetes client trying to connect to the Java11 Jenkins master.

Joerg Schwaerzler added a comment - 2023-04-20 14:33 I see. However I was wondering why we would not want to have the thread restarted in the case of an "Uncaught exception"? I have to admin that for me it was not easy to follow all the discussions in the PR. Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread? I might have found a way to reproduce the isSendOpen exception. Will have to check. It could be related to a java8 JNLP Kubernetes client trying to connect to the Java11 Jenkins master.

Joerg Schwaerzler added a comment - 2023-04-20 15:25

allan_burdajewicz As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier?

Joerg Schwaerzler added a comment - 2023-04-20 15:25 allan_burdajewicz As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier?

Allan BURDAJEWICZ added a comment - 2023-04-21 06:40

> Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread?

Kind of. Giving that we catch Throwable, we don't except to face uncaught exceptions.. I am not sure in what scenario in Java we would still get there.. basil maybe have an answer to that.

> As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier?

It should still be logged in the catch methods. The exception was usually happening in the ConnectionHandler:

In rare case where it might happen in the Agent Listener thread itself:

https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L192-L201

There is one case where it would not be logged and that is when the Agent Listener is being shutdown (that would only happen when Jenkins is shutting down or when you are changing the port through the UI):

Allan BURDAJEWICZ added a comment - 2023-04-21 06:40 > Does it mean that with 2.387.1+ there will be no more uncaught exceptions in the TcpSlaveAgentListener thread? Kind of. Giving that we catch Throwable, we don't except to face uncaught exceptions.. I am not sure in what scenario in Java we would still get there.. basil maybe have an answer to that. > As I am currently not able to see the isSendOpen exception on 2.378.2: Could it be that because of the change the exception may no longer appear in the logs as it has been caught earlier? It should still be logged in the catch methods. The exception was usually happening in the ConnectionHandler: https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L287 https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L294-L298 In rare case where it might happen in the Agent Listener thread itself: https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L192-L201 There is one case where it would not be logged and that is when the Agent Listener is being shutdown (that would only happen when Jenkins is shutting down or when you are changing the port through the UI): https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/hudson/TcpSlaveAgentListener.java#L191 https://github.com/jenkinsci/jenkins/blob/jenkins-2.387.2/core/src/main/java/jenkins/model/Jenkins.java#L1333-L1334

Joerg Schwaerzler added a comment - 2023-04-21 07:15

Thanks for the explanation.
In that case I will try to downgrade our test instance to see whether we will be able to reproduce the issue then.

FYI.: On the productive instance it really looks like the issue is caused by JAVA8 JNLP images. Will post that in the linked ticket, too.

Joerg Schwaerzler added a comment - 2023-04-21 07:15 Thanks for the explanation. In that case I will try to downgrade our test instance to see whether we will be able to reproduce the issue then. FYI.: On the productive instance it really looks like the issue is caused by JAVA8 JNLP images. Will post that in the linked ticket, too.

Allan BURDAJEWICZ added a comment - 2023-11-08 00:59

macdrega any update ?

Allan BURDAJEWICZ added a comment - 2023-11-08 00:59 macdrega any update ?

Rahali added a comment - 2023-12-03 01:42

macdrega any update please for this issue ?

Rahali added a comment - 2023-12-03 01:42 macdrega any update please for this issue ?

Joerg Schwaerzler added a comment - 2023-12-04 19:10 - edited

We fully migrated to Java11 and do not see this issues anymore. Currently we are running 2.401.3.
Sorry for the late response.

Joerg Schwaerzler added a comment - 2023-12-04 19:10 - edited We fully migrated to Java11 and do not see this issues anymore. Currently we are running 2.401.3. Sorry for the late response.

Assignee:: Allan BURDAJEWICZ

Reporter:: Allan BURDAJEWICZ

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022-12-23 02:33

Updated:: 2023-12-04 19:11

Jenkins

Details

Description

How to Reproduce

Attachments

Issue Links

Activity

Collapse comment: Joerg Schwaerzler added a comment - 2023-04-14 08:38, Edited by Joerg Schwaerzler - 2023-04-14 08:44

Expand comment: Joerg Schwaerzler added a comment - 2023-04-14 08:38, Edited by Joerg Schwaerzler - 2023-04-14 08:44

Collapse comment: Joerg Schwaerzler added a comment - 2023-04-14 08:51

Expand comment: Joerg Schwaerzler added a comment - 2023-04-14 08:51

Collapse comment: Allan BURDAJEWICZ added a comment - 2023-04-17 12:03

Expand comment: Allan BURDAJEWICZ added a comment - 2023-04-17 12:03

Collapse comment: Joerg Schwaerzler added a comment - 2023-04-17 12:15

Expand comment: Joerg Schwaerzler added a comment - 2023-04-17 12:15

Collapse comment: Allan BURDAJEWICZ added a comment - 2023-04-17 21:57

Expand comment: Allan BURDAJEWICZ added a comment - 2023-04-17 21:57

Collapse comment: Joerg Schwaerzler added a comment - 2023-04-18 12:16

Expand comment: Joerg Schwaerzler added a comment - 2023-04-18 12:16

Collapse comment: Allan BURDAJEWICZ added a comment - 2023-04-18 23:34

Expand comment: Allan BURDAJEWICZ added a comment - 2023-04-18 23:34

Collapse comment: Joerg Schwaerzler added a comment - 2023-04-20 14:33

Expand comment: Joerg Schwaerzler added a comment - 2023-04-20 14:33

Collapse comment: Joerg Schwaerzler added a comment - 2023-04-20 15:25

Expand comment: Joerg Schwaerzler added a comment - 2023-04-20 15:25

Collapse comment: Allan BURDAJEWICZ added a comment - 2023-04-21 06:40

Expand comment: Allan BURDAJEWICZ added a comment - 2023-04-21 06:40

Collapse comment: Joerg Schwaerzler added a comment - 2023-04-21 07:15

Expand comment: Joerg Schwaerzler added a comment - 2023-04-21 07:15

Collapse comment: Allan BURDAJEWICZ added a comment - 2023-11-08 00:59

Expand comment: Allan BURDAJEWICZ added a comment - 2023-11-08 00:59

Collapse comment: Rahali added a comment - 2023-12-03 01:42

Expand comment: Rahali added a comment - 2023-12-03 01:42

Collapse comment: Joerg Schwaerzler added a comment - 2023-12-04 19:10, Edited by Joerg Schwaerzler - 2023-12-04 19:11

Expand comment: Joerg Schwaerzler added a comment - 2023-12-04 19:10, Edited by Joerg Schwaerzler - 2023-12-04 19:11

People

Dates