-
Bug
-
Resolution: Unresolved
-
Major
-
Windows 7, Windows Server 2008
-
Powered by SuggestiMate
I am running Jenkins 1.570.
Occasionally out of the blue, a large chunk of my jenkins slaves will go offline and most importantly stay offline until Jenkins is rebooted. All of the slaves that go offline this way say the following as their reason why:
The current peer is reconnecting.
If I look in my Jenkins logs, I see this for some of my slaves that remain online:
Aug 07, 2014 11:13:07 AM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
Accepted connection #2018 from /172.16.100.79:51299
Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: dev-build-03 is already connected to this master. Rejecting this connection.
Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: Unrecognized name: dev-build-03
The logs are flooded with all of that, with another one coming in every second.
Lastly, there is one slave that is online still that should be offline. That slave is fully shut down, yet jenkins sees it as still fully online. All of the offline slaves are running Jenkins' slave.jar file in headless mode, so I can see the console output. All of them think that on their end they are "Online", but Jenkins itself has them all shut down.
This bug has been haunting me for quite a while now, and it is killing production for me. I really need to know if there's a fix for this, or at the very least, a version of jenkins I can downgrade to that doesn't have this issue.
Thank you!
- jenkins-slave.0.err.log
- 427 kB
- masterJenkins.log
- 370 kB
- log.txt
- 220 kB
- is related to
-
JENKINS-25218 Channel hangs due to the infinite loop in FifoBuffer within the lock
-
- Resolved
-
-
JENKINS-39231 WinSW: Automatically terminate runaway processes in Windows services
-
- Resolved
-
[JENKINS-24155] Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot
What's the last Jenkins version that did not have this problem? When you downgrade, does it go away?
I'm afraid I am infrequent on updates, and have always have issues with my nodes in one way or another, so it's hard to pinpoint exactly when this started. I would say at least since 1.565, but probably before then too.
When I say that the node still claims it is connected, I am referring to the console log that is displayed on the node itself. Jenkins still sees the node as offline and says "The current peer is reconnecting." in the node status.
I've seen the same issue on a Windows slave running a self-built version of 1.577-SNAPSHOT. The slave error log suggests that slave saw a connection reset but when it reconnected the master thought the slave was still connected and connection retries failed.
Aug 17, 2014 11:05:02 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.net.SocketException: Connection reset at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Aug 17, 2014 11:05:02 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Aug 17, 2014 11:05:12 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@5f849b Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276)
After 3 retries at restarting the windows service restarter gave up and unfortuntely I didn't attempt to reconnect until after I had restarted the master over 12 hours later.
The equivalent part of the master's logs are as follows (only the first restart included here but the others are equivalent).
Aug 17, 2014 11:05:18 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #7 from /192.168.1.115:60293 Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Cygnet is already connected to this master. Rejecting this connection. Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Unrecognized name: Cygnet Aug 17, 2014 11:06:19 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #8 from /192.168.1.115:60308
The master log does not show any evidence of the slave connection being broken in the 12 hours before the master restart.
If the master is not detecting that the connection has closed then it will still think that the slave is connected and will refuse re-connections as seen.
Sadly I didn't get a stacktrace of the master's threads to see if any threads were blocked anywhere.
I wonder whether this issue is related to the changes make circa 4th April 2014 to change JNLP slaves to use NIO.
See commit d4c74bf35d4 in 1.599 andalso the corresponding changes made in remoting 2.38.
May also be related to JENKINS-23248 but my build was definitely using the integrated fixes for that so it cannot be the whole solution.
After a bit of experimenting I can reproduce this scenario fairly easily using a Ubuntu 12.04 master and a Windows 7 slave (both using java 7). Once the slave is connected I forcibly suspend the windows 7 computer and monitor the TCP connection between the master and slave (on the master) using netstat.
netstat -na | grep 42715 tcp6 0 0 :::42715 :::* LISTEN tcp6 0 0 192.168.1.23:42715 192.168.1.24:58905 ESTABLISHED tcp6 0 2479 192.168.1.23:42715 192.168.1.115:61283 ESTABLISHED
When the master attempts to "ping" the slave the slave does not respond and the TCP send queue builds up (2479 in the example above).
Once the queue has been building for a few minutes bring the Windows 7 machine back to life and let things recover naturally.
I observe that the Windows 7 machine issues a TCP RST on the connection. But the Linux master does not seem to react to the RST and continues to add data into the send queue.
During this time the slave has attempted to restart the connection and failed because the master thinks that the slave is still connected. The windows slave service stops attempting a restart after a couple of failures.
After a few minutes the channel pinger on the master declares that the slave is dead
Aug 19, 2014 9:34:24 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel. java.util.concurrent.TimeoutException: Ping started on 1408480224640 hasn't completed at 1408480464640 at hudson.remoting.PingThread.ping(PingThread.java:120) at hudson.remoting.PingThread.run(PingThread.java:81)
But even at this time the TCP stream stays open the slave connection is continuing to operate.
After a further 10 minutes the connection does close. It seems like this is a standard TCP timeout.
WARNING: Communication problem java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Aug 19, 2014 9:44:13 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed WARNING: NioChannelHub keys=2 gen=2823: Computer.threadPoolForRemoting [#2] for + Cygnet terminated java.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514) ... 6 more
There does not seem to be a "fail fast" method in operation. It isn't clear whether this is due to the Linux networking stack or whether Java could have failed a lot quicker when it determined that the connection ping had timed out.
Sadly it is not immediately obvious that the disconnect could be done instantly because it all seems to be tied up with standard TCP retries.
One potential workaround to try is adding
-Djenkins.slaves.NioChannelSelector.disabled=true
onto the Jenkins master's launcher command line arguments. On Debian/Ubuntu that is as simple as adding the above to JAVA_ARGS in /etc/default/jenkins
If launching jenkins from the commandline it would be
java -Djenkins.slaves.NioChannelSelector.disabled=true -jar jenkins.war
I just tested this on my system and it does seem to change the behaviour when I run my test case. In 3 tests the slave continued working correctly on 3 occassions. 2 of these saw the queued traffic just delivered and things continued as before. In the other example the original TCP connection entered TIME_WAIT state and a new connection was started successfully by the recently suspended slave.
Wed Aug 20 11:18:23 BST 2014 tcp6 0 0 :::42715 :::* LISTEN tcp6 0 0 192.168.1.23:42715 192.168.1.115:50570 TIME_WAIT tcp6 0 0 192.168.1.23:42715 192.168.1.24:47835 ESTABLISHED tcp6 0 0 192.168.1.23:42715 192.168.1.115:50619 ESTABLISHED
From this I suspect that the new NIO based method of communicating with slaves is not causing the full TCP socket to get closed until TCP timers timeout. Whereas in the thread-per-slave method the connection is tore down almost immediately.
It would be good to know if the NioChannelSelector workaround outlined above helps others.
What's likely happening is that the NIO selector thread has died on the master, which uses NioChannelHub.abortAll() to terminate all the JNLP connections. It's consistent with the observation that the problem persists until Jenkins restarts. If anyone can take a thread dump on the master when Jenkins is in this state, we can verify this by checking if the said thread is listed or not.
When NIO selector thread is killed, it leaves a message in the Jenkins log. I'd like you to look for it, so that we can see why the NIO selector thread is killed. The reconnecting slaves failing to get through is red herring.
my slave and server are in this state now. How do I take a thread dump?
funeeldy: Go to the /threadDump URL, or install the Support Core Plugin, go to /support, and generate a bundle that contains at least the thread dumps.
200$ is up for grabs for this issue on freedomsponsors.org
https://freedomsponsors.org/issue/591/jenkins-slaves-go-offline-in-large-quantities-and-dont-reconnect-until-reboot
We have this problem, i have a thread dump here
http://pastebin.com/chMPus8C
Relevant bits from the thread dump.
NioChannelHub (and a few other things not included for compactness) are waiting on Channel@2323a3cd
NioChannelHub keys=4 gen=5352736: Computer.threadPoolForRemoting [#2] "NioChannelHub keys=4 gen=5352736: Computer.threadPoolForRemoting [#2]" Id=144 Group=main BLOCKED on hudson.remoting.Channel@2d23a3cd owned by "Finalizer" Id=3 at hudson.remoting.Channel.terminate(Channel.java:804) - blocked on hudson.remoting.Channel@2d23a3cd at hudson.remoting.Channel$2.terminate(Channel.java:491) at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1b9eef5
this is held by the Finalizer that is waiting on FifoBuffer@782372ec
Finalizer "Finalizer" Id=3 Group=system WAITING on org.jenkinsci.remoting.nio.FifoBuffer@782372ec at java.lang.Object.wait(Native Method) - waiting on org.jenkinsci.remoting.nio.FifoBuffer@782372ec at java.lang.Object.wait(Object.java:503) at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336) at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:324) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:222) at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83) at hudson.remoting.Channel.send(Channel.java:553) - locked hudson.remoting.Channel@2d23a3cd at hudson.remoting.RemoteInvocationHandler.finalize(RemoteInvocationHandler.java:240) at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method) at java.lang.ref.Finalizer.runFinalizer(Unknown Source) at java.lang.ref.Finalizer.access$100(Unknown Source) at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
It looks like the Finalizer is trying to write to a channel to cleanup an object with remote state and has effectively locked things up because nothing can come along and force the FifoBuffer write to terminate.
I'm still having this issue. 1.580.2
I have a large number of slaves that reboot as part of our testing process (tests complete, systems shut down).
These slaves go offline in this fashion.
Still happens, daily.
Jenkins 1.596.1 with all plugins up to date.
I encountered the same problem with my master jenkins server (version 1565) installed on Debian machine and slaves installed on Windows 7 machines.
When the slaves disconnects and reconnects, jenkins master does not detect the reconnection of slavs.
I followed the proposed solution of adding to JAVA_ARGS variable in the / etc / default / jenkins:
-Djenkins.slaves.NioChannelSelector.disabled=true
This has solved the problem but since this is a workaround, is there a permanent solution?
We have "peer reconnecting" issue ones per 2 weeks. After applying the WA proposed by Amal connections are crashing during heavy jobs (high CPU load and long duration) on regular basis.
So be aware...
I've been told that this issue is the same as JENKINS-28844 and has been resolved in the 1.609.3 LTS.
We are facing this issue on Jenkins ver. 1.605.
On most of the offline slaves I am seeing:
"JNLP agent connected from /x.y.z.a" in the node log.
Here is the threadDump link of the affected Jenkins instance.
http://pastebin.com/9hUR1Awf
Please provide a temporary workaround for this so that it can be avoided in future.
Note:
We are using 50+ nodes on a single master.
same problem since several days with jenkins 2.23, here is the extract of log with :
- first 'outofmemory' error
then - all 'java.lang.OutOfMemoryError: unable to create new native thread'
- then 'disconnection of all slaves'
2 slaves are not disconnected : slave.jar is more recent on those. I will update slave.jar on all and check if it happens again.... (waiting also the autoupdate of slave.jar files which is pending in another ticket....)
Hope this could help.
In my case this was a outofmemory problem : i fixed it by increasing the -Xmx in jenkins args and all seems to be ok since.
We are also facing the same issue on Jenkins 1.624. I had to reboot it. Please someone help. This looks like its been going on for while.
Just seen the same issue on Jenkins 1.642.1 Linux master. the fix was to restart tomcat and the windows slaves reconnected automatically.
Found several instances of: Ping started at xxxxxx hasn't completed by xxxxxxx in the logs.
Is setting jenkins.slaves.NioChannelSelector.disabled property to true a viable workaround?
Same issue here. Only this time, my tests never get done. The slaves area always dropping during the tests please halp!!
I am not sure we can proceed much on this issue. Just to summarize changes related to several reports above...
- Jenkins 2.50+ introduced runaway process termination in new Windows service termination. It should help with the "is already connected to this master" issues being reported to Windows service agents. See
JENKINS-39231 - Whatever happens in Jenkins after the "OutOfMemory" exception, it belongs to the "undefined behavior" area. Jenkins should ideally switch to the disabled state after it since the impact is not predictable
JENKINS-25218introduced fixes to FifoBuffer handling logic, all fixes are available in 2.60.1
In order to proceed with this issue, I need somebody to confirm it still happens on 2.60.1 and to provide new diagnostics info.
I'm having what seems to be this issue with Jenkins 2.138.3.
Every 3-4 days all the slaves node go offline although it seems to have no network problem. They return online once the master has been restarted.
In attachment you'll find the logs jenkins-slave.0.err.logmasterJenkins.log
Can one of you do the following. To help narrow down the possible leak areas it will be useful to capture process memory usage and JVM heap usage. Start your master process as normal. Then start 2 tools on the system and redirect the output to separate files. Both tools have low system resource usage.
Memory stats can be captured using pidstat. Specifically to capture resident set size.
$ pidstat -r -p <pid> 8 > /tmp/pidstat-capture.txt
JVM heap size and GC behavior. Specifically the percentage of reclaimed heap space after a full collection.
$ jstat -gcutil -t -h12 <pid> 8s > /tmp/jstat-capture.txt
Please attach the generated files to this issue.
Why would Jenkins think the nodes are already connected? Does it show the computers as still online? Is there anything in a specific node's log (assuming the excerpt is from main jenkins log)?