-
Bug
-
Resolution: Unresolved
-
Major
-
Jenkins Master - 2.100, Ubuntu
Linux Agent - Running inside a container on Ubuntu, 2.100 agent jar
Windows Agent - Running inside a container on Windows Server 1709
-
Powered by SuggestiMate
I've set up some permanent build agents that run as containers for my build server which I've got running on Azure virtual machines at the moment.
Overall, the agents are able to connect and perform builds through to completion. Unfortunately, I am experiencing unpredictable disconnects from both the linux and Windows based agents. Especially after they've been idle for a bit.
I've not been unable to establish any kind of common reason for the disconnects between both of them. Specifically for Azure, I've adjusted the "Idle Timeout" setting for all IP addresses (including the jenkins master) on Azure to be the maximum value, to no avail. I've also made sure that the TCP socket connect timeout is set to 6 on all my linux based machines, this hasn't helped.
I've been through a lot of the log information from both the master and the agents, but I can't piece together a clear idea of which side is necessarily failing. One recent disconnect produced this on the linux agent:
Jan 09, 2018 2:33:40 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel JNLP4-connect connection to 123.123.123.123/234.234.234.234:49187. java.util.concurrent.TimeoutException: Ping started at 1515508180945 hasn't completed by 1515508420945 at hudson.remoting.PingThread.ping(PingThread.java:134) at hudson.remoting.PingThread.run(PingThread.java:90)
This seems to indicate a ping timeout, but the networking on the machine is fine. If I connect and restart the agent container, it connects right away and seems to be healthy for a while again. Here's what the Jenkins master reports for the agent:
java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200) at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213) at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800) at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173) at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:313) at hudson.remoting.Channel.close(Channel.java:1405) at hudson.remoting.Channel.close(Channel.java:1358) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:737) at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:96) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:655) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
This message seems to come up quite often, but generally speaking seems to indicate that the agent vanished and Jenkins doesn't know why? So I don't know if it's any help.
I've been researching this issue for a while, so I've been trying quite a few suggestions from existing bugs here on this bug tracker. If there's anything I can do to get more conclusive information about the disconnects, let me know and I'll reply with it.
I'm pretty much at the end of my rope in trying to figure out what's going on here, so all help is appreciated!
[JENKINS-48865] JNLP Agents/Slaves Disconnecting Unpredictably
This has been happening before meltdown, since about mid December when I started working on moving our build infrastructure over.
oleg_nenashev indeed JENKINS-48865 and JENKINS-48865 is precisely the same issue.
I guess that you mean JENKINS-44132. Isn't?
I suspect Oleg meant JENKINS-48895.
Ping failures on the agent can occur because of some issue on the master, perhaps a restart, or excessive resource issue causing it to delay in responding to the ping, or some other system or networking issue.
Closing for lack of sufficient diagnostics and information to reproduce after no response for quite a while.
I have an issue very similar to this issue. My observation is that the slave has lost connectivity and tried to re-establish a connection and the master is rejecting the connection because master thinks it already have the connection. While at the same time master is trying to ping the slave and waiting for the 4 minutes timeout. I think the error condition can be handle a bit differently, if ping is not responding and a new connection request is coming in, it should accept the new connection instead of waiting for 4 minutes before destroying the old connection. I have attached a log file from the master. The only thing I am not sure is why the slave needs to request a new connection, maybe because the connection to the master is not very stable. It would be nice to have more slave logs to see why the connection is dropped.
The Jenkins version is 2.150.3 and run under Kunbernetes and the slaves are Windows slaves started using JNLP.
awong29, your description sounds different from the original report. The original report was about unpredictable disconnects. These can happen for many reasons, but often occur because of system, network, or environmental issues. Your description concerns re-connection problems. I think it would be better for you to create a separate ticket for your issue.
Could you share more information about what is occurring? Information about how you launch your agents. Anything relevant about their configuration. Agent logs would be essential.
Sure, I can create a new JIRA, I think the original problem I got was the disconnect and it is still happening a few times a day. Our vendor OpenShift and our container team has been spending the last few weeks investigating the issue. I will put the re-connection issue in another JIRA. Thanks.
Yes, disconnect issues can be very difficult to track down. They're usually due to something closing the connection at the TCP layer. Or one end being overloaded and unable to maintain its side.
I think we should re-close this ticket.
I will update if we find anything more about why the disconnection happen from our IT. Thanks.
Why is this closed? I have the same problem. Remoting v3.36, Jenkins v2.213
tomahawk1187 there is a comment from Jeff Thompson which says that he is closing it for lack of information that will allow the problem to be duplicated. If you can provide a set of steps which will allow someone else to duplicate the failures, I'm sure he'd be delighted to see those steps and experiment with them.
markewaite you need to understand this was not an issue of "step 1, 2, 3, repro".
Everyone's environment is different, and errors do go away after people are trying to do anything to pass this issue.
personally I'm going to unsubscribe from this thread as it's no longer relevant to me. Reporting issues here is very disappointing when trash get hidden under the "no repro" tag, rather than trying to understand the problem and offer any type of suggestions.
chvalean I accept that many issues are not "step 1, 2, 3, repro", many environments are different, and that workarounds often help users find ways to avoid issues. I was trying to answer the question from tomahawk1187.
I'm open to any suggestions that a volunteer maintainer should do to fix a bug that can't be duplicated. What would persuade a volunteer maintainer to be more interested in this issue than the other issues they are investigating or the other features they are adding?
I've spent many hours making guesses about bug reports, trying various experiments in hopes of seeing the problem that the user reported. The investigations are usually focused on helping a user find an alternative which will allow them to avoid an issue they have detected. Those investigations have the added hope that if I understand how to duplicate the problem, I can assess how many other users will see the problem. The investigations may also help me understand how to fix the problem. The investigations are done on personal time and for personal passion.
I empathize with user frustration that the issue they are seeing is not visible to the maintainer. I don't see what maintainers can do to fix a problem they cannot see.
I empathize with maintainers that don't receive enough information from submitters. I understand that users may not want to spend any more time reporting an issue than is absolutely necessary.
I don't see a lot of benefit to leaving an issue open as a maintainer when I've tried my best to duplicate it and I cannot duplicate it. If it is left open, it may mislead users that someone might work on it. If I can't duplicate the problem, it is much less likely that I will work on the problem. I don't see any loss of information in marking an issue as "Cannot reproduce" and closing it. If others find a way to duplicate the problem, they can provide the detailed information to duplicate the problem and reopen the issue.
As I mentioned previously, these sorts of issues are almost always caused by some problem in the local environment. Something to do with system, network, or environment configuration. Sometimes it results from a conflict between plugins or job execution errors, which mistakenly appear as Remoting issues. All of these types of issues require troubleshooting in the local environment. Without providing a substantial amount of troubleshooting data, which usually ends up identifying the configuration issue anyway, there is nothing that anyone else can do.
Frequently with these issues, when someone reports they have the same issue, it often turns out to be something quite different. Alfred's, earlier here, is an excellent example. On another similar ticket, there were multiple reports from different people as to how they resolved the issue, most of them different.
If someone can provide sufficient diagnostics or reproduction steps, I'd be happy to take a look. Even better, submit a PR, as several people have done.
Any fix for this? I am also facing the same issue. Jenkins v2.204.1. ssh plugin version 1.31.0
Slave OS:- Windows Server 2016
I am facing this issue only when the build is in progress and there is no logs in job output for some time. The build is getting failed.
12:29:24 Z:\>rem \\zmy19nap01\HOME\pcrscm\PuTTY\plink.exe -ssh -i \\zmy19nap01\home\pcrscm\.ssh\pcrscm.ppk pcrscm@zmy33lxclient04 "/usr/atria/bin/cleartool setview -exec 'perl /view/cars_CARS_PCR_SU_PLIGHT1.1.50_SCM/vobs/ltd_tools/cars/common/cleartool_lscheckout.pl' pcrscm_Crete_host_I9998" 12:29:24 12:29:24 Z:\>exit 0 12:39:16
[Agent went offline during the build
https://pcrsub-jenkins.mot-solutions.com/computer/ZMY33-WIN2016/log] 12:39:16 ERROR: Connection was broken: java.util.concurrent.TimeoutException: Ping started at 1588912516061 hasn't completed by 1588912756062*12:39:16* at hudson.remoting.PingThread.ping(PingThread.java:133)12:39:16 at hudson.remoting.PingThread.run(PingThread.java:89)12:39:16 12:39:16 Build step 'Console output (build log) parsing' marked build as failure*12:39:16* ERROR: ZMY33-WIN2016 is offline; cannot locate JAVA_HOME |
I've noticed a similar behavior that seems to be correlated with jenkins' updates that also include changes to the agent.jar. If the subordinating agent (windows mx) does not update the agent.jar then the likelihood of the connection terminating at the jenkins main side is higher. The recovery is equally mysterious because I can restart the agents after the agent update and they will appear to be online only to go offline a few minutes later. After some amount of delay, maybe 10 minutes, the agent has a higher likelihood of becoming stable.
The exception are those days where the agent does not reach stability after several attempts to recover. This is usually around days when windows does an update at the same time that jenkins has an update.
My usual recovery steps (which I had to follow today):
- Download and distribute the agent.jar to all agent machines
- Check jenkins main host for pending updates or restarts and comply with OS recommendations
- Once the jenkins main host is stable (wait about 5 minutes after availability), then continue to the next steps
- Restart all agent machines
- Monitor each agent and restart the services that manage the agents. Sometimes the agent will "lockup" during the first step in a build, e.g. the git clone. This is an indicator of instability in the agent and it will go offline in a few minutes. That will require a restart of the service, not the machine. The console logs on the client never show anything being wrong. The only indicators are on the jenkins main host where you see the same/similar stack trace that op posted (SSL connection termination on read).
- If instability ensues then make sure all agents have up to date Java software. After update, go back to step 4 and repeat.
If there are some logs that the maintainers would like to see, I am happy to provide. Having maintained a popular open source project once, I am sympathetic...
From the logs today:
2022-02-01 13:14:04.759+0000 [id=27] INFO hudson.PluginManager#loadDetachedPlugins: Upgrading Jenkins. The last running version was 2.332. This Jenkins is version 2.333.
2022-02-01 13:14:04.819+0000 [id=27] INFO hudson.PluginManager#loadDetachedPlugins: Upgraded Jenkins from version 2.332 to version 2.333. Loaded detached plugins (and dependencies): []
2022-02-01 13:14:07.757+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Listed all plugins
2022-02-01 13:14:14.431+0000 [id=32] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins
2022-02-01 13:14:14.478+0000 [id=32] INFO jenkins.InitReactorRunner$1#onAttained: Started all plugins
2022-02-01 13:14:14.500+0000 [id=29] INFO jenkins.InitReactorRunner$1#onAttained: Augmented all extensions
2022-02-01 13:14:14.679+0000 [id=34] INFO jenkins.model.Jenkins#setBuildsAndWorkspacesDir: Using non default workspaces directories: ${JENKINS_HOME}/workspace/${ITEM_FULLNAME}.
2022-02-01 13:14:25.261+0000 [id=34] INFO hudson.slaves.SlaveComputer#tryReconnect: Attempting to reconnect gitsync
2022-02-01 13:14:25.319+0000 [id=34] INFO jenkins.InitReactorRunner$1#onAttained: System config loaded
2022-02-01 13:14:28.094+0000 [id=33] INFO jenkins.InitReactorRunner$1#onAttained: System config adapted
2022-02-01 13:14:28.440+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs
2022-02-01 13:14:28.444+0000 [id=31] INFO jenkins.InitReactorRunner$1#onAttained: Configuration for all jobs updated
2022-02-01 13:14:28.493+0000 [id=74] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Download metadata
2022-02-01 13:14:28.499+0000 [id=74] INFO hudson.util.Retrier#start: Attempt #1 to do the action check updates server
2022-02-01 13:14:28.645+0000 [id=34] INFO jenkins.InitReactorRunner$1#onAttained: Completed initialization
2022-02-01 13:14:28.998+0000 [id=21] INFO hudson.lifecycle.Lifecycle#onReady: Jenkins is fully up and running
followed by:
2022-02-01 13:14:36.807+0000 [id=138] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #1 failed: java.io.EOFException
2022-02-01 13:14:36.807+0000 [id=139] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #2 failed: java.io.EOFException
2022-02-01 13:14:36.812+0000 [id=146] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #8 failed: java.io.EOFException
2022-02-01 13:14:36.817+0000 [id=143] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #6 failed: java.io.EOFException
2022-02-01 13:14:36.819+0000 [id=142] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #5 failed: java.io.EOFException
See prior comment. This occurs in 2.333 as of today on a fully patched OS with no kernel or java updates pending. Agents are Windows 10 and Windows Server 2019 running as services. Agent mx have malware countermeasures in place which could be a problem. Exceptions are configured for the build environments to prevent cyber tool interference.
Note that I have added my sanitize log from today as evidence for reopening this issue.
I tried a restart of jenkins now and the agents all recovered as expected, but two of them went offline (with red sign icon). This restart was caused by the main indicating that it was going to shutdown (not initiated by anyone). The Thin Client plugin was doing a differential backup during this time, so maybe that plugin is the cause of this issue with the agents not connecting...
The two agents did recover with just a service restart.
I rebuilt my Jenkins on a clean RedHat 8 and now the agents are very reliable. Where they would disconnect with every update, now that is not happening. I think the problem may have been related to JDK version mismatching where one party was using JDK8 instead of JDK11. All parties are now on JDK11.
We experience the same issue after upgrade from 2.319 to 2.332. Master and Agent running the same JDK11. I had open issue [JENKINS-68122|JENKINS-68122 Slave connection broken (randomly) with error java.util.concurrent.TimeoutException - Jenkins Jira]
The problem here is it is not consistent failed, it just failed randomly.
I managed to almost pinpoint the issue (at least in its current incarnation) - after updating to 2.346.2 everything was fine, then some plugin updates happened and I started having serious problems with agent connections + the ones that connected were very slow to update infos on Nodes page.
First I rolled back to 2.346.1 as it was the easiest solution - it didn't help
After some more investigating, I noticed Jenkins also stopped sending any emails, and email plugin(s) generated a lot of errors in the logs:
WARNING jenkins.util.Listeners#lambda$notify$0 java.lang.NoSuchMethodError: 'javax.mail.Session hudson.tasks.Mailer$DescriptorImpl.createSession()' at org.jenkinsci.plugins.mailwatcher.MailWatcherMailer.send(MailWatcherMailer.java:116) at org.jenkinsci.plugins.mailwatcher.MailWatcherNotification.send(MailWatcherNotification.java:156) at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener$Notification$Builder.send(WatcherComputerListener.java:181) at org.jenkinsci.plugins.mailwatcher.WatcherComputerListener.onOffline(WatcherComputerListener.java:91) at hudson.slaves.SlaveComputer.lambda$closeChannel$1(SlaveComputer.java:927) at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59) at jenkins.util.Listeners.notify(Listeners.java:67) at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:927) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:756) at jenkins.slaves.DefaultJnlpSlaveReceiver.afterChannel(DefaultJnlpSlaveReceiver.java:175) at org.jenkinsci.remoting.engine.JnlpConnectionState.fire(JnlpConnectionState.java:337) at org.jenkinsci.remoting.engine.JnlpConnectionState.fireAfterChannel(JnlpConnectionState.java:428) at org.jenkinsci.remoting.engine.JnlpProtocol4Handler$Handler.lambda$onChannel$0(JnlpProtocol4Handler.java:334) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)
So I rolled back:
Mailer Plugin from 435.something to 414.something (excuse me, but could we please move back to actually useful and human friendly numbering?)
Email Extension from 2.90 to 2.89
And the problem was immediately solved. Not sure which one of these actually caused the issue, as I rolled them back simultaneously (also it was dependency hell, had to roll back about 10 more plugins because it was somehow crucial for them to have latest mail plugins, even though they don't send any mails).
Now I'm stuck with old blueocean/pipeline plugins due to weird dependencies, but at least agents work fine.
My Jenkins instances are running on Java 11 (latest Adoptium jre)
kredens could you report that message from the mailer plugin as a separate issue and include steps that will allow someone else to duplicate the issue from a fresh Jenkins installation?
I'm surprised that a plugin upgrade would have any impact on agent connection reliability. I'd like to do more investigation, but your message does not provide enough context to do more investigation.
The error in Mail Watcher is JENKINS-69088, which was fixed in jenkinsci/mail-watcher-plugin#11 and released in 1.17.
markewaite the only connection I can think of between agent connection reliability and those email plugins, is that on most agents I have enabled email agent offline/online status notifications - maybe without properly working email "subsystem", something awry happens.
I'm yet to try with updated Mail Watcher plugin, will report back whether the issue reappears or not when all three plugins get updated
Still experiencing "Ping failed. Terminating the channel JNLP4-connect" / "TimeoutException" errors with Jenkins version 2.375.2 (and jdk 11).
Is there any work around ?
We still experience the issue with Jenkins 2.401.1 version, even though some workaround was made in Jenkins's 2.387.2 release.
Here is changelog link : https://www.jenkins.io/changelog-stable/#v2.387.2 .
The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller.
Your inputs / help would be greatly appreciated.
Jenkins agent disconnects and reconnects back after few minutes without manual intervention, is there any workaround for this issue ?
Any input would be greatly appreciated. Thanks !
The issue with our Jenkins server is that, we are blocked to upgrade Jenkins to latest release as server was set up with "docker run" command so when I try to deploy latest release "jenkins.war", agent fails to connect to Jenkins controller.
That suggests that you are probably using the wrong technique to upgrade the container image.
The Jenkins war file inside the container image should not be upgraded. A new container image should be built with the newer Jenkins version. The new container image can then be tested to confirm it works in your environment. However, that is a question outside this issue. Please use the Jenkins community forum for question and answer, rather than using the issue tracker for question and answer.
Yeah, generally these messages appear when a virtualized agent VM/container gets terminated. In
JENKINS-48616I see the same for EC2.Any chance it is somehow related to Meltdown restarts? CC gjphilp