Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: remoting
Labels:
- jenkins
- remoting
- slave
Environment:
Jenkins 1.609.1 LTS. Attaching support plugin bundle with additional details.

Similar Issues:

Show
Epic Link:
Remoting - Agent Stability improvements

Two days ago we encountered an issue where all builds attempted on two of our nodes failed. Console output for each failed build was the following:

ERROR: SEVERE ERROR occurs 
org.jenkinsci.lib.envinject.EnvInjectException: hudson.remoting.ChannelClosedException: channel is already closed 
at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:75)
at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:81) 
at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:39) 
at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:574) 
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494) 
at hudson.model.Run.execute(Run.java:1741) 
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 
at hudson.model.ResourceController.execute(ResourceController.java:98) 
at hudson.model.Executor.run(Executor.java:374) 
Caused by: hudson.remoting.ChannelClosedException: channel is already closed 
at hudson.remoting.Channel.send(Channel.java:550) 
at hudson.remoting.Request.call(Request.java:129) 
at hudson.remoting.Channel.call(Channel.java:752) 
at hudson.FilePath.act(FilePath.java:1073) 
at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:44)
... 8 more 
Caused by: java.io.IOException 
at hudson.remoting.Channel.close(Channel.java:1109) 
at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118) 
at hudson.remoting.PingThread.ping(PingThread.java:126) 
at hudson.remoting.PingThread.run(PingThread.java:85) 
Caused by: java.util.concurrent.TimeoutException: Ping started at 1444039585493 hasn't completed by 1444039825494 
... 2 more

These are windows slaves (JNLP) which run the slave jar as a service. We attempted to restart the slave services first, then rebooted the slaves and ultimately has to restart the Jenkins master to recover from the problem.

Our after action root cause analysis turned up the following in the Jenkins masters's log file:

Exception in thread "Channel reader thread: DC-JENWIN-001" java.lang.OutOfMemoryError: Java heap space
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
Exception in thread "Channel reader thread: DC-JENWIN-003" java.lang.OutOfMemoryError: Java heap space
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)

These entries are remarkable in that they don't have any date-time stamps. It seems as if they were not emitted by the normal logger. None the less by looking at the log entries immediately preceding and following them we were able to localize the entries to within a minute accuracy and they are clearly the cause of the problem.

Moreover, after these two entires occurred, entries similar to the following appears in the master log a regular intervals:

Oct 05, 2015 10:08:54 AM com.cloudbees.jenkins.support.AsyncResultCache run
INFO: Could not retrieve metrics data from DC-JENWIN-001 for caching
java.util.concurrent.TimeoutException
at hudson.remoting.Request$1.get(Request.java:272)
at hudson.remoting.Request$1.get(Request.java:206)
at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
at com.cloudbees.jenkins.support.AsyncResultCache.run(AsyncResultCache.java:95)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

These errors appeared for both of and only the affected slave nodes for retrieving "metrics" retrieving the "java version", retrieving the "environment", and all the various other statuses.

Bottom Line

Connectivity to these slaves suffered a critical and apparently unrecoverable failure. This failure originated on the the Master side of the connection due apparently to a heap space issue. Yet the master did not properly recognizing the nodes as off line and continued to send builds to them. Given that jobs tend to be "sticky" to a certain slave as long as it is available, this effectively rendered several critical jobs un-buildable.

Recommended Resolution.

Though exact reproduction of this problem may be impossible, it is hopped that some steps can be taken to enable Jenkins to correctly identify this problem and take corrective or at least mitigating action, such as re-initializing whatever thread/component was inoperable or marking the node as off-line.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

support (1).zip
5.06 MB
2015-10-07 12:08

is related to

JENKINS-28492 The server rejected the connection: *** is already connected to this master. Rejecting this connection.

Resolved

links to

CloudBees Internal OSS-809

Details

Description

Bottom Line

Recommended Resolution.

Attachments

Attachments

Issue Links

Activity

People

Dates