Two days ago we encountered an issue where all builds attempted on two of our nodes failed. Console output for each failed build was the following:
ERROR: SEVERE ERROR occurs
org.jenkinsci.lib.envinject.EnvInjectException: hudson.remoting.ChannelClosedException: channel is already closed
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
... 8 more
Caused by: java.io.IOException
Caused by: java.util.concurrent.TimeoutException: Ping started at 1444039585493 hasn't completed by 1444039825494
... 2 more
These are windows slaves (JNLP) which run the slave jar as a service. We attempted to restart the slave services first, then rebooted the slaves and ultimately has to restart the Jenkins master to recover from the problem.
Our after action root cause analysis turned up the following in the Jenkins masters's log file:
Exception in thread "Channel reader thread: DC-JENWIN-001" java.lang.OutOfMemoryError: Java heap space
Exception in thread "Channel reader thread: DC-JENWIN-003" java.lang.OutOfMemoryError: Java heap space
These entries are remarkable in that they don't have any date-time stamps. It seems as if they were not emitted by the normal logger. None the less by looking at the log entries immediately preceding and following them we were able to localize the entries to within a minute accuracy and they are clearly the cause of the problem.
Moreover, after these two entires occurred, entries similar to the following appears in the master log a regular intervals:
Oct 05, 2015 10:08:54 AM com.cloudbees.jenkins.support.AsyncResultCache run
INFO: Could not retrieve metrics data from DC-JENWIN-001 for caching
These errors appeared for both of and only the affected slave nodes for retrieving "metrics" retrieving the "java version", retrieving the "environment", and all the various other statuses.
Connectivity to these slaves suffered a critical and apparently unrecoverable failure. This failure originated on the the Master side of the connection due apparently to a heap space issue. Yet the master did not properly recognizing the nodes as off line and continued to send builds to them. Given that jobs tend to be "sticky" to a certain slave as long as it is available, this effectively rendered several critical jobs un-buildable.
Though exact reproduction of this problem may be impossible, it is hopped that some steps can be taken to enable Jenkins to correctly identify this problem and take corrective or at least mitigating action, such as re-initializing whatever thread/component was inoperable or marking the node as off-line.