I was going to raise a second defect, but I think this is similar enough.
When the problem occurs, the Slaves Console shows 'Connected'. However, the master shows them all disconnected. The only way to recover so far is to restart Jenkins.
We are running Master on WindowsServer2012, on VMWare. We are running about 70 slaves, a mix OSX10.9, Win7, and Linux Sled 11 on VMWare. There are some other variants. We are running Jenkins 1.563.
This issue has occurred three times for us. Two cases are independent; one occurred shortly after the first and the JVM was not restarted, so perhaps recovery between the 1st and 2nd time was not complete. We have not identified a trigger cause for this problem.
The thread count starts to increase linearly once the problem occurs, but we believe that this is a symptom. In the JavaMelody Monitoring Plugin, there may be a difference between the reported thread number on the machine in two different places. The graph showed 4000 (it was running but down for 30 hours). However, the thread count below showed 400. I believe that the first figure maybe the JVM's count while the second is Jenkins'. In normal operation, we see about 200 threads. (However, we restarted, so I am not 100% sure that this is correct).
We see the following messages in the error log. The same exception occurs for each of our slaves within a short period of time.
Jul 31, 2014 5:13:17 AM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
WARNING: NioChannelHub keys=86 gen=1625477529: Computer.threadPoolForRemoting 58 for + XXXXXXXX terminated
java.io.IOException: Failed to abort
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownInput(Unknown Source)
at sun.nio.ch.SocketAdaptor.shutdownInput(Unknown Source)
... 6 more
In the first case, we also saw ping timeouts occur at about the same time as the problem. These were not present in the other case. On the latest case, there was a single slave losing network connectivity and we saw this exception in advance of when the 'crash' happened. However, I believe this to be a coincidence. The exception occurs in the logs without all slaves losing connectivity from time to time.
We see other exceptions in the logs. However, these seem to be related to us shutting down idle machines, or the Disk Usage Util plugin, and seem unrelated.
Last week, we increased the load on our machine from about 40-slaves to 70, and also increased the number of jobs. Before this, we had not seen this problem.
We are planning to upgrade to take in the (now reopened) fix for 22932.