[JENKINS-27514] Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM - Jenkins Jira

Type: Epic
Resolution: Unresolved
Priority: Major
Component/s: core, remoting, ssh-slaves-plugin
Labels:
- slave
Environment:

Hide
We are running Jenkins 1.565.3 with ssh-plugin 1.5 (the default with 1.565.3) on half the masters and ssh-plugin 1.9 (latest version) on the other half of the masters. Both sets exhibit this thread leakage. All masters run on Ubuntu Precise with java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.12.04.2) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode).

Show
We are running Jenkins 1.565.3 with ssh-plugin 1.5 (the default with 1.565.3) on half the masters and ssh-plugin 1.9 (latest version) on the other half of the masters. Both sets exhibit this thread leakage. All masters run on Ubuntu Precise with java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.12.04.2) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode).

Epic Name:
Core - Thread spikes in Computer.threadPoolForRemoting
Similar Issues:
Powered by SuggestiMate

Show

This issue has been converted to EPIC, because there are reports of various independent issues inside.

Issue:

Remoting threadPool is being widely used in Jenkins: https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code
Not all usages of Computer.threadPoolForRemoting are valid for starters
Computer.threadPoolForRemoting has downscaling logic, threads get killed after 60-second timeout
The pool has no thread limit by default, so it may grow infinitely until number of threads kills JVM or causes OOM
Some Jenkins use-cases cause burst Computer.threadPoolForRemoting load by design (e.g. Jenkins startup or agent reconnection after the issue)
Deadlocks or waits in the threadpool may also make it to grow infinitely

Proposed fixes:

Define usage policy for this thread pool in the documentation
Limit number of threads being created depending on the system scale, make the limit configurable (256 by default?)
Fix the most significant issues where the thread pool gets misused or blocked

Original report (tracked as ~~JENKINS-47012~~):

> After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

> We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

> I thought https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43 may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

> Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

20150904-jenkins03.txt
2.08 MB
2015-09-04 20:41
file-leak-detector.log
41 kB
2016-06-15 10:13
Jenkins_Dump_2017-06-12-10-52.zip
1.58 MB
2017-06-12 08:55
jenkins_watchdog_report.txt
267 kB
2017-07-09 22:13
jenkins_watchdog.sh
2 kB
2017-07-09 22:09
jenkins02-thread-dump.txt
1.49 MB
2015-03-20 16:27
support_2015-08-04_14.10.32.zip
2.17 MB
2015-08-04 17:30
support_2016-06-29_13.17.36 (2).zip
3.90 MB
2016-06-29 20:58
thread-dump.txt
5.48 MB
2016-06-29 20:46

is related to

JENKINS-43142 Remoting should not allow leaks of JarURLConnection from fetch operations

Open

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates