[JENKINS-48865] JNLP Agents/Slaves Disconnecting Unpredictably

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: remoting
Labels:
- slave
Environment:
Jenkins Master - 2.100, Ubuntu
Linux Agent - Running inside a container on Ubuntu, 2.100 agent jar
Windows Agent - Running inside a container on Windows Server 1709

Similar Issues:
Powered by SuggestiMate

Show

I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.

Overall, the agents are able to connect and perform builds through to completion. Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents. Especially after they've been idle for a bit.

I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail. I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.

I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing. One recent disconnect produced this on the linux agent:

Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)

This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again. Here's what the Jenkins master reports for the agent:

java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.

I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker. If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.

I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jenkins-from-jacob-anderson.log
171 kB
2022-02-01 17:59
jswum_jenkins_log.txt
68 kB
2019-04-01 21:19

relates to

JENKINS-48895 Channels closed exception after upgrade Jenkins version 2.90

Closed

Oleg Nenashev added a comment - 2018-01-09 20:08

Yeah, generally these messages appear when a virtualized agent VM/container gets terminated. In ~~JENKINS-48616~~ I see the same for EC2.

Any chance it is somehow related to Meltdown restarts? CC gjphilp

Oleg Nenashev added a comment - 2018-01-09 20:08 Yeah, generally these messages appear when a virtualized agent VM/container gets terminated. In JENKINS-48616 I see the same for EC2. Any chance it is somehow related to Meltdown restarts? CC gjphilp

Alexander Trauzzi added a comment - 2018-01-09 20:15

This has been happening before meltdown, since about mid December when I started working on moving our build infrastructure over.

Alexander Trauzzi added a comment - 2018-01-09 20:15 This has been happening before meltdown, since about mid December when I started working on moving our build infrastructure over.

Oleg Nenashev added a comment - 2018-01-25 12:26

It seems to be the same as JENKINS-48865

Oleg Nenashev added a comment - 2018-01-25 12:26 It seems to be the same as JENKINS-48865

Piotr Plenik added a comment - 2018-10-26 10:53

oleg_nenashev indeed JENKINS-48865 and JENKINS-48865 is precisely the same issue.

I guess that you mean ~~JENKINS-44132~~. Isn't?

Piotr Plenik added a comment - 2018-10-26 10:53 oleg_nenashev indeed JENKINS-48865 and JENKINS-48865 is precisely the same issue. I guess that you mean JENKINS-44132 . Isn't?

Jeff Thompson added a comment - 2018-10-26 17:57

I suspect Oleg meant ~~JENKINS-48895~~.

Ping failures on the agent can occur because of some issue on the master, perhaps a restart, or excessive resource issue causing it to delay in responding to the ping, or some other system or networking issue.

Jeff Thompson added a comment - 2018-10-26 17:57 I suspect Oleg meant JENKINS-48895 . Ping failures on the agent can occur because of some issue on the master, perhaps a restart, or excessive resource issue causing it to delay in responding to the ping, or some other system or networking issue.

Jeff Thompson added a comment - 2018-12-11 19:08

Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.

Jeff Thompson added a comment - 2018-12-11 19:08 Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.

Alfred Wong added a comment - 2019-04-01 21:19

I have an issue very similar to this issue. My observation is that the slave has lost connectivity and tried to re-establish a connection and the master is rejecting the connection because master thinks it already have the connection. While at the same time master is trying to ping the slave and waiting for the 4 minutes timeout. I think the error condition can be handle a bit differently, if ping is not responding and a new connection request is coming in, it should accept the new connection instead of waiting for 4 minutes before destroying the old connection. I have attached a log file from the master. The only thing I am not sure is why the slave needs to request a new connection, maybe because the connection to the master is not very stable. It would be nice to have more slave logs to see why the connection is dropped.

The Jenkins version is 2.150.3 and run under Kunbernetes and the slaves are Windows slaves started using JNLP.

Alfred Wong added a comment - 2019-04-01 21:19 I have an issue very similar to this issue. My observation is that the slave has lost connectivity and tried to re-establish a connection and the master is rejecting the connection because master thinks it already have the connection. While at the same time master is trying to ping the slave and waiting for the 4 minutes timeout. I think the error condition can be handle a bit differently, if ping is not responding and a new connection request is coming in, it should accept the new connection instead of waiting for 4 minutes before destroying the old connection. I have attached a log file from the master. The only thing I am not sure is why the slave needs to request a new connection, maybe because the connection to the master is not very stable. It would be nice to have more slave logs to see why the connection is dropped. The Jenkins version is 2.150.3 and run under Kunbernetes and the slaves are Windows slaves started using JNLP.

Jeff Thompson added a comment - 2019-04-25 17:09

awong29, your description sounds different from the original report. The original report was about unpredictable disconnects. These can happen for many reasons, but often occur because of system, network, or environmental issues. Your description concerns re-connection problems. I think it would be better for you to create a separate ticket for your issue.

Could you share more information about what is occurring? Information about how you launch your agents. Anything relevant about their configuration. Agent logs would be essential.

Jeff Thompson added a comment - 2019-04-25 17:09 awong29 , your description sounds different from the original report. The original report was about unpredictable disconnects. These can happen for many reasons, but often occur because of system, network, or environmental issues. Your description concerns re-connection problems. I think it would be better for you to create a separate ticket for your issue. Could you share more information about what is occurring? Information about how you launch your agents. Anything relevant about their configuration. Agent logs would be essential.

Alfred Wong added a comment - 2019-04-25 19:14

Sure, I can create a new JIRA, I think the original problem I got was the disconnect and it is still happening a few times a day. Our vendor OpenShift and our container team has been spending the last few weeks investigating the issue. I will put the re-connection issue in another JIRA. Thanks.

Alfred Wong added a comment - 2019-04-25 19:14 Sure, I can create a new JIRA, I think the original problem I got was the disconnect and it is still happening a few times a day. Our vendor OpenShift and our container team has been spending the last few weeks investigating the issue. I will put the re-connection issue in another JIRA. Thanks.

Jeff Thompson added a comment - 2019-04-25 20:33

Yes, disconnect issues can be very difficult to track down. They're usually due to something closing the connection at the TCP layer. Or one end being overloaded and unable to maintain its side.

I think we should re-close this ticket.

Jeff Thompson added a comment - 2019-04-25 20:33 Yes, disconnect issues can be very difficult to track down. They're usually due to something closing the connection at the TCP layer. Or one end being overloaded and unable to maintain its side. I think we should re-close this ticket.

Alfred Wong added a comment - 2019-04-26 16:24

I will update if we find anything more about why the disconnection happen from our IT. Thanks.

Alfred Wong added a comment - 2019-04-26 16:24 I will update if we find anything more about why the disconnection happen from our IT. Thanks.

Anargyros Tomaras added a comment - 2020-03-22 00:04

Why is this closed? I have the same problem. Remoting v3.36, Jenkins v2.213

Anargyros Tomaras added a comment - 2020-03-22 00:04 Why is this closed? I have the same problem. Remoting v3.36, Jenkins v2.213

Mark Waite added a comment - 2020-03-22 22:01

tomahawk1187 there is a comment from Jeff Thompson which says that he is closing it for lack of information that will allow the problem to be duplicated. If you can provide a set of steps which will allow someone else to duplicate the failures, I'm sure he'd be delighted to see those steps and experiment with them.

Mark Waite added a comment - 2020-03-22 22:01 tomahawk1187 there is a comment from Jeff Thompson which says that he is closing it for lack of information that will allow the problem to be duplicated. If you can provide a set of steps which will allow someone else to duplicate the failures, I'm sure he'd be delighted to see those steps and experiment with them.

Chris Valean added a comment - 2020-03-22 22:08

markewaite you need to understand this was not an issue of "step 1, 2, 3, repro".

Everyone's environment is different, and errors do go away after people are trying to do anything to pass this issue.

personally I'm going to unsubscribe from this thread as it's no longer relevant to me. Reporting issues here is very disappointing when trash get hidden under the "no repro" tag, rather than trying to understand the problem and offer any type of suggestions.

Chris Valean added a comment - 2020-03-22 22:08 markewaite you need to understand this was not an issue of "step 1, 2, 3, repro". Everyone's environment is different, and errors do go away after people are trying to do anything to pass this issue. personally I'm going to unsubscribe from this thread as it's no longer relevant to me. Reporting issues here is very disappointing when trash get hidden under the "no repro" tag, rather than trying to understand the problem and offer any type of suggestions.

Mark Waite added a comment - 2020-03-22 23:42

chvalean I accept that many issues are not "step 1, 2, 3, repro", many environments are different, and that workarounds often help users find ways to avoid issues. I was trying to answer the question from tomahawk1187.

I'm open to any suggestions that a volunteer maintainer should do to fix a bug that can't be duplicated. What would persuade a volunteer maintainer to be more interested in this issue than the other issues they are investigating or the other features they are adding?

I've spent many hours making guesses about bug reports, trying various experiments in hopes of seeing the problem that the user reported. The investigations are usually focused on helping a user find an alternative which will allow them to avoid an issue they have detected. Those investigations have the added hope that if I understand how to duplicate the problem, I can assess how many other users will see the problem. The investigations may also help me understand how to fix the problem. The investigations are done on personal time and for personal passion.

I empathize with user frustration that the issue they are seeing is not visible to the maintainer. I don't see what maintainers can do to fix a problem they cannot see.

I empathize with maintainers that don't receive enough information from submitters. I understand that users may not want to spend any more time reporting an issue than is absolutely necessary.

I don't see a lot of benefit to leaving an issue open as a maintainer when I've tried my best to duplicate it and I cannot duplicate it. If it is left open, it may mislead users that someone might work on it. If I can't duplicate the problem, it is much less likely that I will work on the problem. I don't see any loss of information in marking an issue as "Cannot reproduce" and closing it. If others find a way to duplicate the problem, they can provide the detailed information to duplicate the problem and reopen the issue.

Mark Waite added a comment - 2020-03-22 23:42 chvalean I accept that many issues are not "step 1, 2, 3, repro", many environments are different, and that workarounds often help users find ways to avoid issues. I was trying to answer the question from tomahawk1187 . I'm open to any suggestions that a volunteer maintainer should do to fix a bug that can't be duplicated. What would persuade a volunteer maintainer to be more interested in this issue than the other issues they are investigating or the other features they are adding? I've spent many hours making guesses about bug reports, trying various experiments in hopes of seeing the problem that the user reported. The investigations are usually focused on helping a user find an alternative which will allow them to avoid an issue they have detected. Those investigations have the added hope that if I understand how to duplicate the problem, I can assess how many other users will see the problem. The investigations may also help me understand how to fix the problem. The investigations are done on personal time and for personal passion. I empathize with user frustration that the issue they are seeing is not visible to the maintainer. I don't see what maintainers can do to fix a problem they cannot see. I empathize with maintainers that don't receive enough information from submitters. I understand that users may not want to spend any more time reporting an issue than is absolutely necessary. I don't see a lot of benefit to leaving an issue open as a maintainer when I've tried my best to duplicate it and I cannot duplicate it. If it is left open, it may mislead users that someone might work on it. If I can't duplicate the problem, it is much less likely that I will work on the problem. I don't see any loss of information in marking an issue as "Cannot reproduce" and closing it. If others find a way to duplicate the problem, they can provide the detailed information to duplicate the problem and reopen the issue.

Jeff Thompson added a comment - 2020-03-23 14:43

As I mentioned previously, these sorts of issues are almost always caused by some problem in the local environment. Something to do with system, network, or environment configuration. Sometimes it results from a conflict between plugins or job execution errors, which mistakenly appear as Remoting issues. All of these types of issues require troubleshooting in the local environment. Without providing a substantial amount of troubleshooting data, which usually ends up identifying the configuration issue anyway, there is nothing that anyone else can do.

Frequently with these issues, when someone reports they have the same issue, it often turns out to be something quite different. Alfred's, earlier here, is an excellent example. On another similar ticket, there were multiple reports from different people as to how they resolved the issue, most of them different.

If someone can provide sufficient diagnostics or reproduction steps, I'd be happy to take a look. Even better, submit a PR, as several people have done.

Jeff Thompson added a comment - 2020-03-23 14:43 As I mentioned previously, these sorts of issues are almost always caused by some problem in the local environment. Something to do with system, network, or environment configuration. Sometimes it results from a conflict between plugins or job execution errors, which mistakenly appear as Remoting issues. All of these types of issues require troubleshooting in the local environment. Without providing a substantial amount of troubleshooting data, which usually ends up identifying the configuration issue anyway, there is nothing that anyone else can do. Frequently with these issues, when someone reports they have the same issue, it often turns out to be something quite different. Alfred's, earlier here, is an excellent example. On another similar ticket, there were multiple reports from different people as to how they resolved the issue, most of them different. If someone can provide sufficient diagnostics or reproduction steps, I'd be happy to take a look. Even better, submit a PR, as several people have done.

Gopal Ahir added a comment - 2020-05-08 05:13

Any fix for this? I am also facing the same issue. Jenkins v2.204.1. ssh plugin version 1.31.0

Slave OS:- Windows Server 2016

I am facing this issue only when the build is in progress and there is no logs in job output for some time. The build is getting failed.

12:29:24 Z:\>rem \\zmy19nap01\HOME\pcrscm\PuTTY\plink.exe -ssh -i \\zmy19nap01\home\pcrscm\.ssh\pcrscm.ppk pcrscm@zmy33lxclient04 "/usr/atria/bin/cleartool setview -exec 'perl /view/cars_CARS_PCR_SU_PLIGHT1.1.50_SCM/vobs/ltd_tools/cars/common/cleartool_lscheckout.pl' pcrscm_Crete_host_I9998" 12:29:24 12:29:24 Z:\>exit 0 12:39:16
[Agent went offline during the build

https://pcrsub-jenkins.mot-solutions.com/computer/ZMY33-WIN2016/log]
12:39:16 ERROR: Connection was broken: java.util.concurrent.TimeoutException: Ping started at 1588912516061 hasn't completed by 1588912756062*12:39:16* at hudson.remoting.PingThread.ping(PingThread.java:133)12:39:16 at hudson.remoting.PingThread.run(PingThread.java:89)12:39:16 12:39:16 Build step 'Console output (build log) parsing' marked build as failure*12:39:16* ERROR: ZMY33-WIN2016 is offline; cannot locate JAVA_HOME

Gopal Ahir added a comment - 2020-05-08 05:13 Any fix for this? I am also facing the same issue. Jenkins v2.204.1. ssh plugin version 1.31.0 Slave OS:- Windows Server 2016 I am facing this issue only when the build is in progress and there is no logs in job output for some time. The build is getting failed. 12:29:24 Z:\>rem \\zmy19nap01\HOME\pcrscm\PuTTY\plink.exe -ssh -i \\zmy19nap01\home\pcrscm\.ssh\pcrscm.ppk pcrscm@zmy33lxclient04 "/usr/atria/bin/cleartool setview -exec 'perl /view/cars_CARS_PCR_SU_PLIGHT1.1.50_SCM/vobs/ltd_tools/cars/common/cleartool_lscheckout.pl' pcrscm_Crete_host_I9998" 12:29:24 12:29:24 Z:\>exit 0 12:39:16 [Agent went offline during the build https://pcrsub-jenkins.mot-solutions.com/computer/ZMY33-WIN2016/log] 12:39:16 ERROR: Connection was broken: java.util.concurrent.TimeoutException: Ping started at 1588912516061 hasn't completed by 1588912756062*12:39:16* at hudson.remoting.PingThread.ping(PingThread.java:133) 12:39:16 at hudson.remoting.PingThread.run(PingThread.java:89) 12:39:16 12:39:16 Build step 'Console output (build log) parsing' marked build as failure*12:39:16* ERROR: ZMY33-WIN2016 is offline; cannot locate JAVA_HOME

jacob anderson added a comment - 2022-02-01 16:06 - edited

I've noticed a similar behavior that seems to be correlated with jenkins' updates that also include changes to the agent.jar. If the subordinating agent (windows mx) does not update the agent.jar then the likelihood of the connection terminating at the jenkins main side is higher. The recovery is equally mysterious because I can restart the agents after the agent update and they will appear to be online only to go offline a few minutes later. After some amount of delay, maybe 10 minutes, the agent has a higher likelihood of becoming stable.

The exception are those days where the agent does not reach stability after several attempts to recover. This is usually around days when windows does an update at the same time that jenkins has an update.

My usual recovery steps (which I had to follow today):

Download and distribute the agent.jar to all agent machines
Check jenkins main host for pending updates or restarts and comply with OS recommendations
Once the jenkins main host is stable (wait about 5 minutes after availability), then continue to the next steps
Restart all agent machines
Monitor each agent and restart the services that manage the agents. Sometimes the agent will "lockup" during the first step in a build, e.g. the git clone. This is an indicator of instability in the agent and it will go offline in a few minutes. That will require a restart of the service, not the machine. The console logs on the client never show anything being wrong. The only indicators are on the jenkins main host where you see the same/similar stack trace that op posted (SSL connection termination on read).
If instability ensues then make sure all agents have up to date Java software. After update, go back to step 4 and repeat.

If there are some logs that the maintainers would like to see, I am happy to provide. Having maintained a popular open source project once, I am sympathetic...

From the logs today:

2022-02-01 13:14:04.759+0000 [id=27] INFO hudson.PluginManager#loadDetachedPlugins: Upgrading Jenkins. The last running version was 2.332. This Jenkins is version 2.333.
2022-02-01 13:14:04.819+0000 [id=27] INFO hudson.PluginManager#loadDetachedPlugins: Upgraded Jenkins from version 2.332 to version 2.333. Loaded detached plugins (and dependencies): []
2022-02-01 13:14:07.757+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Listed all plugins
2022-02-01 13:14:14.431+0000 [id=32] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins
2022-02-01 13:14:14.478+0000 [id=32] INFO jenkins.InitReactorRunner$1#onAttained: Started all plugins
2022-02-01 13:14:14.500+0000 [id=29] INFO jenkins.InitReactorRunner$1#onAttained: Augmented all extensions
2022-02-01 13:14:14.679+0000 [id=34] INFO jenkins.model.Jenkins#setBuildsAndWorkspacesDir: Using non default workspaces directories: ${JENKINS_HOME}/workspace/${ITEM_FULLNAME}.
2022-02-01 13:14:25.261+0000 [id=34] INFO hudson.slaves.SlaveComputer#tryReconnect: Attempting to reconnect gitsync
2022-02-01 13:14:25.319+0000 [id=34] INFO jenkins.InitReactorRunner$1#onAttained: System config loaded
2022-02-01 13:14:28.094+0000 [id=33] INFO jenkins.InitReactorRunner$1#onAttained: System config adapted
2022-02-01 13:14:28.440+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs
2022-02-01 13:14:28.444+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Configuration for all jobs updated
2022-02-01 13:14:28.493+0000 [id=74] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Download metadata
2022-02-01 13:14:28.499+0000 [id=74] INFO hudson.util.Retrier#start: Attempt #1 to do the action check updates server
2022-02-01 13:14:28.645+0000 [id=34] INFO jenkins.InitReactorRunner$1#onAttained: Completed initialization
2022-02-01 13:14:28.998+0000 [id=21] INFO hudson.lifecycle.Lifecycle#onReady: Jenkins is fully up and running

followed by:
2022-02-01 13:14:36.807+0000 [id=138] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #1 failed: java.io.EOFException
2022-02-01 13:14:36.807+0000 [id=139] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #2 failed: java.io.EOFException
2022-02-01 13:14:36.812+0000 [id=146] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #8 failed: java.io.EOFException
2022-02-01 13:14:36.817+0000 [id=143] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #6 failed: java.io.EOFException
2022-02-01 13:14:36.819+0000 [id=142] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #5 failed: java.io.EOFException

jacob anderson added a comment - 2022-02-01 16:06 - edited I've noticed a similar behavior that seems to be correlated with jenkins' updates that also include changes to the agent.jar. If the subordinating agent (windows mx) does not update the agent.jar then the likelihood of the connection terminating at the jenkins main side is higher. The recovery is equally mysterious because I can restart the agents after the agent update and they will appear to be online only to go offline a few minutes later. After some amount of delay, maybe 10 minutes, the agent has a higher likelihood of becoming stable. The exception are those days where the agent does not reach stability after several attempts to recover. This is usually around days when windows does an update at the same time that jenkins has an update. My usual recovery steps (which I had to follow today): Download and distribute the agent.jar to all agent machines Check jenkins main host for pending updates or restarts and comply with OS recommendations Once the jenkins main host is stable (wait about 5 minutes after availability), then continue to the next steps Restart all agent machines Monitor each agent and restart the services that manage the agents. Sometimes the agent will "lockup" during the first step in a build, e.g. the git clone. This is an indicator of instability in the agent and it will go offline in a few minutes. That will require a restart of the service, not the machine. The console logs on the client never show anything being wrong. The only indicators are on the jenkins main host where you see the same/similar stack trace that op posted (SSL connection termination on read). If instability ensues then make sure all agents have up to date Java software. After update, go back to step 4 and repeat. If there are some logs that the maintainers would like to see, I am happy to provide. Having maintained a popular open source project once, I am sympathetic... From the logs today: 2022-02-01 13:14:04.759+0000 [id=27] INFO hudson.PluginManager#loadDetachedPlugins: Upgrading Jenkins. The last running version was 2.332. This Jenkins is version 2.333. 2022-02-01 13:14:04.819+0000 [id=27] INFO hudson.PluginManager#loadDetachedPlugins: Upgraded Jenkins from version 2.332 to version 2.333. Loaded detached plugins (and dependencies): [] 2022-02-01 13:14:07.757+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Listed all plugins 2022-02-01 13:14:14.431+0000 [id=32] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins 2022-02-01 13:14:14.478+0000 [id=32] INFO jenkins.InitReactorRunner$1#onAttained: Started all plugins 2022-02-01 13:14:14.500+0000 [id=29] INFO jenkins.InitReactorRunner$1#onAttained: Augmented all extensions 2022-02-01 13:14:14.679+0000 [id=34] INFO jenkins.model.Jenkins#setBuildsAndWorkspacesDir: Using non default workspaces directories: ${JENKINS_HOME}/workspace/${ITEM_FULLNAME}. 2022-02-01 13:14:25.261+0000 [id=34] INFO hudson.slaves.SlaveComputer#tryReconnect: Attempting to reconnect gitsync 2022-02-01 13:14:25.319+0000 [id=34] INFO jenkins.InitReactorRunner$1#onAttained: System config loaded 2022-02-01 13:14:28.094+0000 [id=33] INFO jenkins.InitReactorRunner$1#onAttained: System config adapted 2022-02-01 13:14:28.440+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs 2022-02-01 13:14:28.444+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Configuration for all jobs updated 2022-02-01 13:14:28.493+0000 [id=74] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Download metadata 2022-02-01 13:14:28.499+0000 [id=74] INFO hudson.util.Retrier#start: Attempt #1 to do the action check updates server 2022-02-01 13:14:28.645+0000 [id=34] INFO jenkins.InitReactorRunner$1#onAttained: Completed initialization 2022-02-01 13:14:28.998+0000 [id=21] INFO hudson.lifecycle.Lifecycle#onReady: Jenkins is fully up and running followed by: 2022-02-01 13:14:36.807+0000 [id=138] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #1 failed: java.io.EOFException 2022-02-01 13:14:36.807+0000 [id=139] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #2 failed: java.io.EOFException 2022-02-01 13:14:36.812+0000 [id=146] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #8 failed: java.io.EOFException 2022-02-01 13:14:36.817+0000 [id=143] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #6 failed: java.io.EOFException 2022-02-01 13:14:36.819+0000 [id=142] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #5 failed: java.io.EOFException

jacob anderson added a comment - 2022-02-01 16:12 - edited

See prior comment. This occurs in 2.333 as of today on a fully patched OS with no kernel or java updates pending. Agents are Windows 10 and Windows Server 2019 running as services. Agent mx have malware countermeasures in place which could be a problem. Exceptions are configured for the build environments to prevent cyber tool interference.

Note that I have added my sanitize log from today as evidence for reopening this issue.

jacob anderson added a comment - 2022-02-01 16:12 - edited See prior comment. This occurs in 2.333 as of today on a fully patched OS with no kernel or java updates pending. Agents are Windows 10 and Windows Server 2019 running as services. Agent mx have malware countermeasures in place which could be a problem. Exceptions are configured for the build environments to prevent cyber tool interference. Note that I have added my sanitize log from today as evidence for reopening this issue.

jacob anderson added a comment - 2022-02-01 19:19 - edited

I tried a restart of jenkins now and the agents all recovered as expected, but two of them went offline (with red sign icon). This restart was caused by the main indicating that it was going to shutdown (not initiated by anyone). The Thin Client plugin was doing a differential backup during this time, so maybe that plugin is the cause of this issue with the agents not connecting...

The two agents did recover with just a service restart.

jacob anderson added a comment - 2022-02-01 19:19 - edited I tried a restart of jenkins now and the agents all recovered as expected, but two of them went offline (with red sign icon). This restart was caused by the main indicating that it was going to shutdown (not initiated by anyone). The Thin Client plugin was doing a differential backup during this time, so maybe that plugin is the cause of this issue with the agents not connecting... The two agents did recover with just a service restart.

jacob anderson added a comment - 2022-03-03 23:02

I rebuilt my Jenkins on a clean RedHat 8 and now the agents are very reliable. Where they would disconnect with every update, now that is not happening. I think the problem may have been related to JDK version mismatching where one party was using JDK8 instead of JDK11. All parties are now on JDK11.

jacob anderson added a comment - 2022-03-03 23:02 I rebuilt my Jenkins on a clean RedHat 8 and now the agents are very reliable. Where they would disconnect with every update, now that is not happening. I think the problem may have been related to JDK version mismatching where one party was using JDK8 instead of JDK11. All parties are now on JDK11.

Kapa Wo added a comment - 2022-04-05 17:22

We experience the same issue after upgrade from 2.319 to 2.332. Master and Agent running the same JDK11. I had open issue [~~JENKINS-68122~~|~~JENKINS-68122~~ Slave connection broken (randomly) with error java.util.concurrent.TimeoutException - Jenkins Jira]

The problem here is it is not consistent failed, it just failed randomly.

Kapa Wo added a comment - 2022-04-05 17:22 We experience the same issue after upgrade from 2.319 to 2.332. Master and Agent running the same JDK11. I had open issue [ JENKINS-68122 | JENKINS-68122 Slave connection broken (randomly) with error java.util.concurrent.TimeoutException - Jenkins Jira] The problem here is it is not consistent failed, it just failed randomly.

kredens added a comment - 2022-07-21 06:02 - edited

I managed to almost pinpoint the issue (at least in its current incarnation) - after updating to 2.346.2 everything was fine, then some plugin updates happened and I started having serious problems with agent connections + the ones that connected were very slow to update infos on Nodes page.

First I rolled back to 2.346.1 as it was the easiest solution - it didn't help

After some more investigating, I noticed Jenkins also stopped sending any emails, and email plugin(s) generated a lot of errors in the logs:

WARNING    jenkins.util.Listeners#lambda$notify$0
java.lang.NoSuchMethodError: 'javax.mail.Session hudson.tasks.Mailer$DescriptorImpl.createSession()'
    at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:116)
    at org.jenkinsci.plugins.mailwatcher.MailWatcherNotification.send(MailWatcherNotification.java:156)
    at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener$Notification$Builder.send(WatcherComputerListener.java:181)
    at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener.onOffline(WatcherComputerListener.java:91)
    at hudson.slaves.SlaveComputer.lambda$closeChannel$1(SlaveComputer.java:927)
    at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59)
    at jenkins.util.Listeners.notify(Listeners.java:67)
    at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:927)
    at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:756)
    at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:175)
    at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:337)
    at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:428)
    at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler.lambda$onChannel$0(JnlpProtocol4Handler.java:334)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

So I rolled back:

Mailer Plugin from 435.something to 414.something (excuse me, but could we please move back to actually useful and human friendly numbering?)

Email Extension from 2.90 to 2.89

And the problem was immediately solved. Not sure which one of these actually caused the issue, as I rolled them back simultaneously (also it was dependency hell, had to roll back about 10 more plugins because it was somehow crucial for them to have latest mail plugins, even though they don't send any mails).

Now I'm stuck with old blueocean/pipeline plugins due to weird dependencies, but at least agents work fine.

My Jenkins instances are running on Java 11 (latest Adoptium jre)

kredens added a comment - 2022-07-21 06:02 - edited I managed to almost pinpoint the issue (at least in its current incarnation) - after updating to 2.346.2 everything was fine, then some plugin updates happened and I started having serious problems with agent connections + the ones that connected were very slow to update infos on Nodes page. First I rolled back to 2.346.1 as it was the easiest solution - it didn't help After some more investigating, I noticed Jenkins also stopped sending any emails, and email plugin(s) generated a lot of errors in the logs: WARNING jenkins.util.Listeners#lambda$notify$0 java.lang.NoSuchMethodError: 'javax.mail.Session hudson.tasks.Mailer$DescriptorImpl.createSession()' at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:116) at org.jenkinsci.plugins.mailwatcher.MailWatcherNotification.send(MailWatcherNotification.java:156) at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener$Notification$Builder.send(WatcherComputerListener.java:181) at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener.onOffline(WatcherComputerListener.java:91) at hudson.slaves.SlaveComputer.lambda$closeChannel$1(SlaveComputer.java:927) at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59) at jenkins.util.Listeners.notify(Listeners.java:67) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:927) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:756) at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:175) at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:337) at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:428) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler.lambda$onChannel$0(JnlpProtocol4Handler.java:334) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang. Thread .run(Unknown Source) So I rolled back: Mailer Plugin from 435.something to 414.something (excuse me, but could we please move back to actually useful and human friendly numbering?) Email Extension from 2.90 to 2.89 And the problem was immediately solved. Not sure which one of these actually caused the issue, as I rolled them back simultaneously (also it was dependency hell, had to roll back about 10 more plugins because it was somehow crucial for them to have latest mail plugins, even though they don't send any mails). Now I'm stuck with old blueocean/pipeline plugins due to weird dependencies, but at least agents work fine. My Jenkins instances are running on Java 11 (latest Adoptium jre)

Mark Waite added a comment - 2022-07-21 17:19

kredens could you report that message from the mailer plugin as a separate issue and include steps that will allow someone else to duplicate the issue from a fresh Jenkins installation?

I'm surprised that a plugin upgrade would have any impact on agent connection reliability. I'd like to do more investigation, but your message does not provide enough context to do more investigation.

Mark Waite added a comment - 2022-07-21 17:19 kredens could you report that message from the mailer plugin as a separate issue and include steps that will allow someone else to duplicate the issue from a fresh Jenkins installation? I'm surprised that a plugin upgrade would have any impact on agent connection reliability. I'd like to do more investigation, but your message does not provide enough context to do more investigation.

Basil Crow added a comment - 2022-07-21 20:51

The error in Mail Watcher is ~~JENKINS-69088~~, which was fixed in jenkinsci/mail-watcher-plugin#11 and released in 1.17.

Basil Crow added a comment - 2022-07-21 20:51 The error in Mail Watcher is JENKINS-69088 , which was fixed in jenkinsci/mail-watcher-plugin#11 and released in 1.17 .

kredens added a comment - 2022-07-22 09:16

markewaite the only connection I can think of between agent connection reliability and those email plugins, is that on most agents I have enabled email agent offline/online status notifications - maybe without properly working email "subsystem", something awry happens.

I'm yet to try with updated Mail Watcher plugin, will report back whether the issue reappears or not when all three plugins get updated

kredens added a comment - 2022-07-22 09:16 markewaite the only connection I can think of between agent connection reliability and those email plugins, is that on most agents I have enabled email agent offline/online status notifications - maybe without properly working email "subsystem", something awry happens. I'm yet to try with updated Mail Watcher plugin, will report back whether the issue reappears or not when all three plugins get updated

kredens added a comment - 2022-07-25 06:01

With fixed Mail Watcher plugin everything seems to be back to normal.

kredens added a comment - 2022-07-25 06:01 With fixed Mail Watcher plugin everything seems to be back to normal.

Etienne Weiler added a comment - 2022-10-17 09:24

I have a similar issue. is a fix planned?

Etienne Weiler added a comment - 2022-10-17 09:24 I have a similar issue. is a fix planned?

Vishal added a comment - 2023-06-13 10:38

Still experiencing "Ping failed. Terminating the channel JNLP4-connect" / "TimeoutException" errors with Jenkins version 2.375.2 (and jdk 11).
Is there any work around ?

Vishal added a comment - 2023-06-13 10:38 Still experiencing "Ping failed. Terminating the channel JNLP4-connect" / "TimeoutException" errors with Jenkins version 2.375.2 (and jdk 11) . Is there any work around ?

Vishal added a comment - 2024-04-03 08:34

We still experience the issue with Jenkins 2.401.1 version, even though some workaround was made in Jenkins's 2.387.2 release.
Here is changelog link : https://www.jenkins.io/changelog-stable/#v2.387.2 .

The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller.

Your inputs / help would be greatly appreciated.

Vishal added a comment - 2024-04-03 08:34 We still experience the issue with Jenkins 2.401.1 version, even though some workaround was made in Jenkins's 2.387.2 release. Here is changelog link : https://www.jenkins.io/changelog-stable/#v2.387.2 . The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller. Your inputs / help would be greatly appreciated.

Vishal added a comment - 2024-09-26 11:15

Jenkins agent disconnects and reconnects back after few minutes without manual intervention, is there any workaround for this issue ?
Any input would be greatly appreciated. Thanks !

Vishal added a comment - 2024-09-26 11:15 Jenkins agent disconnects and reconnects back after few minutes without manual intervention, is there any workaround for this issue ? Any input would be greatly appreciated. Thanks !

Mark Waite added a comment - 2024-09-27 02:07

The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller.

That suggests that you are probably using the wrong technique to upgrade the container image.

The Jenkins war file inside the container image should not be upgraded. A new container image should be built with the newer Jenkins version. The new container image can then be tested to confirm it works in your environment. However, that is a question outside this issue. Please use the Jenkins community forum for question and answer, rather than using the issue tracker for question and answer.

Mark Waite added a comment - 2024-09-27 02:07 The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller. That suggests that you are probably using the wrong technique to upgrade the container image. The Jenkins war file inside the container image should not be upgraded. A new container image should be built with the newer Jenkins version. The new container image can then be tested to confirm it works in your environment. However, that is a question outside this issue. Please use the Jenkins community forum for question and answer, rather than using the issue tracker for question and answer.

Assignee:: Unassigned

Reporter:: Alexander Trauzzi

Votes:: 7 Vote for this issue

Watchers:: 22 Start watching this issue

Created:: 2018-01-09 15:49

Updated:: 2024-09-27 02:07

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Oleg Nenashev added a comment - 2018-01-09 20:08

Expand comment: Oleg Nenashev added a comment - 2018-01-09 20:08

Collapse comment: Alexander Trauzzi added a comment - 2018-01-09 20:15

Expand comment: Alexander Trauzzi added a comment - 2018-01-09 20:15

Collapse comment: Oleg Nenashev added a comment - 2018-01-25 12:26

Expand comment: Oleg Nenashev added a comment - 2018-01-25 12:26

Collapse comment: Piotr Plenik added a comment - 2018-10-26 10:53

Expand comment: Piotr Plenik added a comment - 2018-10-26 10:53

Collapse comment: Jeff Thompson added a comment - 2018-10-26 17:57

Expand comment: Jeff Thompson added a comment - 2018-10-26 17:57

Collapse comment: Jeff Thompson added a comment - 2018-12-11 19:08

Expand comment: Jeff Thompson added a comment - 2018-12-11 19:08

Collapse comment: Alfred Wong added a comment - 2019-04-01 21:19

Expand comment: Alfred Wong added a comment - 2019-04-01 21:19

Collapse comment: Jeff Thompson added a comment - 2019-04-25 17:09

Expand comment: Jeff Thompson added a comment - 2019-04-25 17:09

Collapse comment: Alfred Wong added a comment - 2019-04-25 19:14

Expand comment: Alfred Wong added a comment - 2019-04-25 19:14

Collapse comment: Jeff Thompson added a comment - 2019-04-25 20:33

Expand comment: Jeff Thompson added a comment - 2019-04-25 20:33

Collapse comment: Alfred Wong added a comment - 2019-04-26 16:24

Expand comment: Alfred Wong added a comment - 2019-04-26 16:24

Collapse comment: Anargyros Tomaras added a comment - 2020-03-22 00:04

Expand comment: Anargyros Tomaras added a comment - 2020-03-22 00:04

Collapse comment: Mark Waite added a comment - 2020-03-22 22:01

Expand comment: Mark Waite added a comment - 2020-03-22 22:01

Collapse comment: Chris Valean added a comment - 2020-03-22 22:08

Expand comment: Chris Valean added a comment - 2020-03-22 22:08

Collapse comment: Mark Waite added a comment - 2020-03-22 23:42

Expand comment: Mark Waite added a comment - 2020-03-22 23:42

Collapse comment: Jeff Thompson added a comment - 2020-03-23 14:43

Expand comment: Jeff Thompson added a comment - 2020-03-23 14:43

Collapse comment: Gopal Ahir added a comment - 2020-05-08 05:13

Expand comment: Gopal Ahir added a comment - 2020-05-08 05:13

Collapse comment: jacob anderson added a comment - 2022-02-01 16:06, Edited by jacob anderson - 2022-02-01 16:22

Expand comment: jacob anderson added a comment - 2022-02-01 16:06, Edited by jacob anderson - 2022-02-01 16:22

Collapse comment: jacob anderson added a comment - 2022-02-01 16:12, Edited by jacob anderson - 2022-02-01 18:00

Expand comment: jacob anderson added a comment - 2022-02-01 16:12, Edited by jacob anderson - 2022-02-01 18:00

Collapse comment: jacob anderson added a comment - 2022-02-01 19:19, Edited by jacob anderson - 2022-02-01 19:37

Expand comment: jacob anderson added a comment - 2022-02-01 19:19, Edited by jacob anderson - 2022-02-01 19:37

Collapse comment: jacob anderson added a comment - 2022-03-03 23:02

Expand comment: jacob anderson added a comment - 2022-03-03 23:02

Collapse comment: Kapa Wo added a comment - 2022-04-05 17:22

Expand comment: Kapa Wo added a comment - 2022-04-05 17:22

Collapse comment: kredens added a comment - 2022-07-21 06:02, Edited by kredens - 2022-07-21 12:03

Expand comment: kredens added a comment - 2022-07-21 06:02, Edited by kredens - 2022-07-21 12:03

Collapse comment: Mark Waite added a comment - 2022-07-21 17:19

Expand comment: Mark Waite added a comment - 2022-07-21 17:19

Collapse comment: Basil Crow added a comment - 2022-07-21 20:51

Expand comment: Basil Crow added a comment - 2022-07-21 20:51

Collapse comment: kredens added a comment - 2022-07-22 09:16

Expand comment: kredens added a comment - 2022-07-22 09:16

Collapse comment: kredens added a comment - 2022-07-25 06:01

Expand comment: kredens added a comment - 2022-07-25 06:01

Collapse comment: Etienne Weiler added a comment - 2022-10-17 09:24

Expand comment: Etienne Weiler added a comment - 2022-10-17 09:24

Collapse comment: Vishal added a comment - 2023-06-13 10:38

Expand comment: Vishal added a comment - 2023-06-13 10:38

Collapse comment: Vishal added a comment - 2024-04-03 08:34

Expand comment: Vishal added a comment - 2024-04-03 08:34

Collapse comment: Vishal added a comment - 2024-09-26 11:15

Expand comment: Vishal added a comment - 2024-09-26 11:15

Collapse comment: Mark Waite added a comment - 2024-09-27 02:07

Expand comment: Mark Waite added a comment - 2024-09-27 02:07

People

Dates