Alright, so I root caused most of this. While there certainly are issues around the error handling, the errors we saw are all caused by memory. As we begin to run out of memory, the finally blocks that should zero out the channel object never get called. This causes a sort of cascading failure the manifests in a number of ways, including the error message above. The number of threads jumps, reflection starts to hang (job dsl starts to fail), etc.
For my instance, the root cause was the workspace cleanup plugin + node recycling. This was keeping channel objects around forever in some cases, causing a slow leak.
I would first verify that memory isn't the cause of the failure. I do the following:
Watch number of threads:
watch -n1 'find /proc/<jenkins pid>/task -maxdepth 1 -type d -print | wc -l'
Watch gc stats:
jstat -gccause -t -h25 <pid> 10s
If the number of threads starts to jump into the high-thousands (depending on your heap setup) then that's a good indication.
jstat will show failure to allocate eventually, and a high number of full gcs.
Code changed in jenkins
User: James Nord
Path:
src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java
http://jenkins-ci.org/commit/remoting/ec9b5c13b879f44c04fa28ee6c8b113a165c9e57
Log:
Be extra defensive about Errors and Exceptions
JENKINS-39835Be even more defensive then against leaving connections dangling.