[JENKINS-51057] EventDispatcher and ConcurrentLinkedQueue ate my JVM

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: sse-gateway-plugin
Labels:
- leak
- memory
- memory-leak
Environment:

Hide
Jenkins 2.107.2
ace-editor-1.1
analysis-core-1.95
ansicolor-0.5.2
ant-1.8
antisamy-markup-formatter-1.5
apache-httpcomponents-client-4-api-4.5.3-2.1
artifactory-2.15.1
authentication-tokens-1.3
badge-1.4
blueocean-1.5.0
blueocean-autofavorite-1.2.2
blueocean-bitbucket-pipeline-1.5.0
blueocean-commons-1.5.0
blueocean-config-1.5.0
blueocean-core-js-1.5.0
blueocean-dashboard-1.5.0
blueocean-display-url-2.2.0
blueocean-events-1.5.0
blueocean-git-pipeline-1.5.0
blueocean-github-pipeline-1.5.0
blueocean-i18n-1.5.0
blueocean-jira-1.5.0
blueocean-jwt-1.5.0
blueocean-personalization-1.5.0
blueocean-pipeline-api-impl-1.5.0
blueocean-pipeline-editor-1.5.0
blueocean-pipeline-scm-api-1.5.0
blueocean-rest-1.5.0
blueocean-rest-impl-1.5.0
blueocean-web-1.5.0
bouncycastle-api-2.16.2
branch-api-2.0.19
build-timeout-1.19
build-user-vars-plugin-1.5
buildtriggerbadge-2.9
checkstyle-3.50
cli-commander-0.3
cloudbees-bitbucket-branch-source-2.2.10
cloudbees-folder-6.4
clover-4.9.0
cobertura-1.12
command-launcher-1.2
conditional-buildstep-1.3.6
config-file-provider-2.18
copyartifact-1.39.1
credentials-2.1.16
credentials-binding-1.16
cucumber-reports-3.16.0
cvs-2.14
dashboard-view-2.9.11
display-url-api-2.2.0
docker-commons-1.11
docker-workflow-1.15.1
durable-task-1.22
email-ext-2.62
embeddable-build-status-1.9
envinject-2.1.5
envinject-api-1.5
extended-read-permission-2.0
external-monitor-job-1.7
favorite-2.3.1
findbugs-4.72
git-3.8.0
git-client-2.7.1
git-server-1.7
github-1.29.0
github-api-1.90
github-branch-source-2.3.3
github-oauth-0.29
gradle-1.28
groovy-2.0
groovy-postbuild-2.4
handlebars-1.1.1
handy-uri-templates-2-api-2.1.6-1.0
htmlpublisher-1.16
http_request-1.8.22
ibm-application-security-1.1.2
icon-shim-2.0.3
ivy-1.28
jackson2-api-2.8.11.1
jacoco-3.0.1
javadoc-1.4
jenkins-design-language-1.5.0
jira-2.5.1
job-dsl-1.69
jquery-detached-1.2.1
jsch-0.1.54.2
junit-1.24
ldap-1.20
lockable-resources-2.2
mailer-1.21
mapdb-api-1.0.9.0
matrix-auth-2.2
matrix-project-1.13
maven-plugin-3.1.2
mercurial-2.3
metrics-3.1.2.11
metrics-graphite-3.0.0
modernstatus-1.2
momentjs-1.1.1
monitoring-1.72.0
pam-auth-1.3
parameterized-trigger-2.35.2
pegdown-formatter-1.3
performance-3.8
pipeline-build-step-2.7
pipeline-github-2.0
pipeline-github-lib-1.0
pipeline-graph-analysis-1.6
pipeline-input-step-2.8
pipeline-maven-3.5.3
pipeline-milestone-step-1.3.1
pipeline-model-api-1.2.9
pipeline-model-declarative-agent-1.1.1
pipeline-model-definition-1.2.9
pipeline-model-extensions-1.2.9
pipeline-rest-api-2.10
pipeline-stage-step-2.3
pipeline-stage-tags-metadata-1.2.9
pipeline-stage-view-2.10
pipeline-utility-steps-2.0.2
plain-credentials-1.4
pmd-3.50
pubsub-light-1.12
rebuild-1.28
resource-disposer-0.8
run-condition-1.0
scm-api-2.2.6
script-security-1.44
sidebar-link-1.9.1
simple-theme-plugin-0.4
slack-2.3
sse-gateway-1.15
ssh-agent-1.15
ssh-credentials-1.13
ssh-slaves-1.26
structs-1.14
subversion-2.10.5
support-core-2.46
swarm-3.12
tap-2.2.1
timestamper-1.8.9
token-macro-2.5
translation-1.16
variant-1.1
view-job-filters-1.27
webhook-step-1.3
windows-slaves-1.3.1
workflow-aggregator-2.5
workflow-api-2.27
workflow-basic-steps-2.6
workflow-cps-2.48
workflow-cps-global-lib-2.9
workflow-durable-task-step-2.19
workflow-job-2.19
workflow-multibranch-2.17
workflow-scm-step-2.6
workflow-step-api-2.14
workflow-support-2.18
ws-cleanup-0.34
xunit-1.102

Show
Jenkins 2.107.2 ace-editor-1.1 analysis-core-1.95 ansicolor-0.5.2 ant-1.8 antisamy-markup-formatter-1.5 apache-httpcomponents-client-4-api-4.5.3-2.1 artifactory-2.15.1 authentication-tokens-1.3 badge-1.4 blueocean-1.5.0 blueocean-autofavorite-1.2.2 blueocean-bitbucket-pipeline-1.5.0 blueocean-commons-1.5.0 blueocean-config-1.5.0 blueocean-core-js-1.5.0 blueocean-dashboard-1.5.0 blueocean-display-url-2.2.0 blueocean-events-1.5.0 blueocean-git-pipeline-1.5.0 blueocean-github-pipeline-1.5.0 blueocean-i18n-1.5.0 blueocean-jira-1.5.0 blueocean-jwt-1.5.0 blueocean-personalization-1.5.0 blueocean-pipeline-api-impl-1.5.0 blueocean-pipeline-editor-1.5.0 blueocean-pipeline-scm-api-1.5.0 blueocean-rest-1.5.0 blueocean-rest-impl-1.5.0 blueocean-web-1.5.0 bouncycastle-api-2.16.2 branch-api-2.0.19 build-timeout-1.19 build-user-vars-plugin-1.5 buildtriggerbadge-2.9 checkstyle-3.50 cli-commander-0.3 cloudbees-bitbucket-branch-source-2.2.10 cloudbees-folder-6.4 clover-4.9.0 cobertura-1.12 command-launcher-1.2 conditional-buildstep-1.3.6 config-file-provider-2.18 copyartifact-1.39.1 credentials-2.1.16 credentials-binding-1.16 cucumber-reports-3.16.0 cvs-2.14 dashboard-view-2.9.11 display-url-api-2.2.0 docker-commons-1.11 docker-workflow-1.15.1 durable-task-1.22 email-ext-2.62 embeddable-build-status-1.9 envinject-2.1.5 envinject-api-1.5 extended-read-permission-2.0 external-monitor-job-1.7 favorite-2.3.1 findbugs-4.72 git-3.8.0 git-client-2.7.1 git-server-1.7 github-1.29.0 github-api-1.90 github-branch-source-2.3.3 github-oauth-0.29 gradle-1.28 groovy-2.0 groovy-postbuild-2.4 handlebars-1.1.1 handy-uri-templates-2-api-2.1.6-1.0 htmlpublisher-1.16 http_request-1.8.22 ibm-application-security-1.1.2 icon-shim-2.0.3 ivy-1.28 jackson2-api-2.8.11.1 jacoco-3.0.1 javadoc-1.4 jenkins-design-language-1.5.0 jira-2.5.1 job-dsl-1.69 jquery-detached-1.2.1 jsch-0.1.54.2 junit-1.24 ldap-1.20 lockable-resources-2.2 mailer-1.21 mapdb-api-1.0.9.0 matrix-auth-2.2 matrix-project-1.13 maven-plugin-3.1.2 mercurial-2.3 metrics-3.1.2.11 metrics-graphite-3.0.0 modernstatus-1.2 momentjs-1.1.1 monitoring-1.72.0 pam-auth-1.3 parameterized-trigger-2.35.2 pegdown-formatter-1.3 performance-3.8 pipeline-build-step-2.7 pipeline-github-2.0 pipeline-github-lib-1.0 pipeline-graph-analysis-1.6 pipeline-input-step-2.8 pipeline-maven-3.5.3 pipeline-milestone-step-1.3.1 pipeline-model-api-1.2.9 pipeline-model-declarative-agent-1.1.1 pipeline-model-definition-1.2.9 pipeline-model-extensions-1.2.9 pipeline-rest-api-2.10 pipeline-stage-step-2.3 pipeline-stage-tags-metadata-1.2.9 pipeline-stage-view-2.10 pipeline-utility-steps-2.0.2 plain-credentials-1.4 pmd-3.50 pubsub-light-1.12 rebuild-1.28 resource-disposer-0.8 run-condition-1.0 scm-api-2.2.6 script-security-1.44 sidebar-link-1.9.1 simple-theme-plugin-0.4 slack-2.3 sse-gateway-1.15 ssh-agent-1.15 ssh-credentials-1.13 ssh-slaves-1.26 structs-1.14 subversion-2.10.5 support-core-2.46 swarm-3.12 tap-2.2.1 timestamper-1.8.9 token-macro-2.5 translation-1.16 variant-1.1 view-job-filters-1.27 webhook-step-1.3 windows-slaves-1.3.1 workflow-aggregator-2.5 workflow-api-2.27 workflow-basic-steps-2.6 workflow-cps-2.48 workflow-cps-global-lib-2.9 workflow-durable-task-step-2.19 workflow-job-2.19 workflow-multibranch-2.17 workflow-scm-step-2.6 workflow-step-api-2.14 workflow-support-2.18 ws-cleanup-0.34 xunit-1.102

Similar Issues:
Powered by SuggestiMate

Show

We started running out of memory in our JVM (Xmx 8G) and when looking at Melody's memory (heap) histogram (JENKINS_URL/monitoring?part=heaphisto) the top two items were:

Class	Size (Kb)	% size	Instances	% instances	Source
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry	2,890,399	44	92,492,793	43
java.util.concurrent.ConcurrentLinkedQueue$Node	2,167,981	33	92,500,553	43

77% (and growing as we were researching the problem) of the memory was being used by these two items.

I have two support bundles from this time and an .hprof as well.

I can either screen share with someone or if you can tell me how to analyze these files I would be happy to.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2018-09-05-18-59-08-587.png
151 kB
2018-09-06 01:59

is related to

JENKINS-45488 Pipeline: StackOverflowError at io.jenkins.blueocean.events.PipelineEventListener$StageEventPublisher.getParentBlock(PipelineEventListener.java:110)

Closed

relates to

JENKINS-58684 Linux max task resource exhaustion after excessive timer thread creation

Resolved

Christian Höltje added a comment - 2018-05-03 15:51 - edited

Our Jenkins server has been up 23 hours and we're already seeing large numbers of the EventDispacher objects:

Class	Size (Kb)	% size	Instances	% instances	Source
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry	96,964	2	3,102,853	4
java.util.concurrent.ConcurrentLinkedQueue$Node	73,163	1	3,121,633	4

It isn't a problem (yet) but this is alarming.

Our other, Jenkins server has no references to EventDispatcher$Retry in the memory histogram, even when expanding details.

Christian Höltje added a comment - 2018-05-03 15:51 - edited Our Jenkins server has been up 23 hours and we're already seeing large numbers of the EventDispacher objects: Class Size (Kb) % size Instances % instances Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 96,964 2 3,102,853 4 java.util.concurrent.ConcurrentLinkedQueue$Node 73,163 1 3,121,633 4 It isn't a problem (yet) but this is alarming. Our other, Jenkins server has no references to EventDispatcher$Retry in the memory histogram, even when expanding details.

Christian Höltje added a comment - 2018-05-03 16:19

In our logs, we're see messages like:

May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter: 
May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter: 
May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter: 
May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter:

Could that be related?

Christian Höltje added a comment - 2018-05-03 16:19 In our logs, we're see messages like: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: Could that be related?

Christian Höltje added a comment - 2018-05-03 16:38

And now its at:

Class	Size (Kb)	% size	Instances	% instances	Source
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry	105,174	3	3,365,570	4
java.util.concurrent.ConcurrentLinkedQueue$Node	79,051	2	3,372,854	4

Christian Höltje added a comment - 2018-05-03 16:38 And now its at: Class Size (Kb) % size Instances % instances Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 105,174 3 3,365,570 4 java.util.concurrent.ConcurrentLinkedQueue$Node 79,051 2 3,372,854 4

Christian Höltje added a comment - 2018-05-03 17:15

There are a lot of changes/improvements in sse-gateway (mostly by rtyler) since 1.15 was released (back in Jan 16th, 2017).

In the logs, I see a commit titled message leaks which sounds interesting.

I tried compiling the master branch and I get errors:

$ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8-alpine mvn verify 
...
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.0:npm (npm install) on project sse-gateway: Failed to run task: 'npm install' failed. java.io.IOException: Cannot run program "/src/node/node" (in directory "/src"): error=2, No such file or directory -> [Help 1]

Christian Höltje added a comment - 2018-05-03 17:15 There are a lot of changes/improvements in sse-gateway (mostly by rtyler ) since 1.15 was released (back in Jan 16th, 2017). In the logs, I see a commit titled message leaks which sounds interesting. I tried compiling the master branch and I get errors: $ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8-alpine mvn verify ... [ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.0:npm (npm install) on project sse-gateway: Failed to run task: 'npm install' failed. java.io.IOException: Cannot run program "/src/node/node" (in directory "/src"): error=2, No such file or directory -> [Help 1]

Christian Höltje added a comment - 2018-05-03 17:29

I got further when using the non-alpine openjdk image. Still doesn't build, but it is due to tests failing.

$ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8 mvn verify
...
Results :
Failed tests:
org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_autoDeleteOnExpire(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest)
 Run 1: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 2: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 3: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 4: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 5: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_delete_stale_events(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest)
 Run 1: EventHistoryStoreTest.test_delete_stale_events:69
 Run 2: EventHistoryStoreTest.test_delete_stale_events:69
 Run 3: EventHistoryStoreTest.test_delete_stale_events:69
 Run 4: EventHistoryStoreTest.test_delete_stale_events:69
 Run 5: EventHistoryStoreTest.test_delete_stale_events:69

Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project sse-gateway: There are test failures.
...

Christian Höltje added a comment - 2018-05-03 17:29 I got further when using the non-alpine openjdk image. Still doesn't build, but it is due to tests failing. $ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8 mvn verify ... Results : Failed tests: org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_autoDeleteOnExpire(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest) Run 1: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 2: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 3: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 4: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 5: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_delete_stale_events(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest) Run 1: EventHistoryStoreTest.test_delete_stale_events:69 Run 2: EventHistoryStoreTest.test_delete_stale_events:69 Run 3: EventHistoryStoreTest.test_delete_stale_events:69 Run 4: EventHistoryStoreTest.test_delete_stale_events:69 Run 5: EventHistoryStoreTest.test_delete_stale_events:69 Tests run: 12, Failures: 2, Errors: 0, Skipped: 0 ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project sse-gateway: There are test failures. ...

Christian Höltje added a comment - 2018-05-03 17:43

Some more events from our logs:

May 03, 2018 1:36:00 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch
Error dispatching event to SSE channel. Write failed.
May 03, 2018 1:36:36 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch
Error dispatching event to SSE channel. Write failed.
May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).
May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).
May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).

Christian Höltje added a comment - 2018-05-03 17:43 Some more events from our logs: May 03, 2018 1:36:00 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch Error dispatching event to SSE channel. Write failed. May 03, 2018 1:36:36 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch Error dispatching event to SSE channel. Write failed. May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612). May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612). May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).

Christian Höltje added a comment - 2018-05-04 13:07

This morning:

Class	Size (Kb)	% size	Instances	% instances	Source
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry	283,979	5	9,087,330	10
java.util.concurrent.ConcurrentLinkedQueue$Node	214,241	4	9,140,981	10

Is there a way I can find out what's in the queues to track what's causing the Retry to happen?

Christian Höltje added a comment - 2018-05-04 13:07 This morning: Class Size (Kb) % size Instances % instances Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 283,979 5 9,087,330 10 java.util.concurrent.ConcurrentLinkedQueue$Node 214,241 4 9,140,981 10 Is there a way I can find out what's in the queues to track what's causing the Retry to happen?

Christian Höltje added a comment - 2018-05-04 14:39

Looking in $JENKINS_HOME/logs/sse-events/ I see 16 .json files in jobs and 108 .json files in pipeline. And they change frequently (e.g. they are now, a minute later, 6 and 54).

Is there another place I can look at why there are some many Retry objects?

Christian Höltje added a comment - 2018-05-04 14:39 Looking in $JENKINS_HOME/logs/sse-events/ I see 16 .json files in jobs and 108 .json files in pipeline . And they change frequently (e.g. they are now, a minute later, 6 and 54). Is there another place I can look at why there are some many Retry objects?

Christian Höltje added a comment - 2018-05-07 20:08

I managed to use jvirtualvm to look at the .hprof from the initial comment and it was our friend EventDispatcher$Retry and $ConcurrentLinkedQueue$Node using up 90% of the memory.

Christian Höltje added a comment - 2018-05-07 20:08 I managed to use jvirtualvm to look at the .hprof from the initial comment and it was our friend EventDispatcher$Retry and $ ConcurrentLinkedQueue$Node using up 90% of the memory.

Christian Höltje added a comment - 2018-05-11 18:27

I finally got VirtualVM working reasonably well (using lots of RAM and a RAM Disk for temp files).

I found that there are 0 instances of org.jenkinsci.plugins.ssegateway.EventHistoryStore and 0 instances of org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.

Is it possible that the Retry objects are not being reclaimed by GC?

When I look at the objects in VirtualVM the Retry object has a reference (item) to a ConcurrentLinkedQueue$Node object.

The Node object has a reference (item) to the Retry object plus a reference to the next {{Node} in the queue.

Christian Höltje added a comment - 2018-05-11 18:27 I finally got VirtualVM working reasonably well (using lots of RAM and a RAM Disk for temp files). I found that there are 0 instances of org.jenkinsci.plugins.ssegateway.EventHistoryStore and 0 instances of org.jenkinsci.plugins.ssegateway.sse.EventDispatcher . Is it possible that the Retry objects are not being reclaimed by GC? When I look at the objects in VirtualVM the Retry object has a reference ( item ) to a ConcurrentLinkedQueue$Node object. The Node object has a reference ( item ) to the Retry object plus a reference to the next {{Node} in the queue.

Christian Höltje added a comment - 2018-05-11 19:44

Since I didn't see this problem in Blue Ocean 1.4.2 but I'm seeing it in 1.5.0 I went looking for changes in Blue Ocean that may have triggered this problem...

I found PR #1667 by vivek. It was to fix ~~JENKINS-48820~~. I doubt it caused this problem, but maybe it made it worse?

Christian Höltje added a comment - 2018-05-11 19:44 Since I didn't see this problem in Blue Ocean 1.4.2 but I'm seeing it in 1.5.0 I went looking for changes in Blue Ocean that may have triggered this problem... I found PR #1667 by vivek . It was to fix JENKINS-48820 . I doubt it caused this problem, but maybe it made it worse?

Wilfred Hughes added a comment - 2018-06-01 09:22 - edited

We're seeing this problem with Blue Ocean 1.4.2 and Jenkins 2.89.4.

Wilfred Hughes added a comment - 2018-06-01 09:22 - edited We're seeing this problem with Blue Ocean 1.4.2 and Jenkins 2.89.4.

Christian Höltje added a comment - 2018-06-25 18:03

The only workaround I found was to completely uninstall Blue Ocean and the pubsub plugins.

Christian Höltje added a comment - 2018-06-25 18:03 The only workaround I found was to completely uninstall Blue Ocean and the pubsub plugins.

rami stern added a comment - 2018-07-05 11:07

We think we might've found the root cause, we're checking it now on prod.

https://github.com/jenkinsci/sse-gateway-plugin/pull/27

rami stern added a comment - 2018-07-05 11:07 We think we might've found the root cause, we're checking it now on prod. https://github.com/jenkinsci/sse-gateway-plugin/pull/27

Gustaf Lundh added a comment - 2018-08-30 08:52

We just had to restart a critical Jenkins master due to this memory leak, where org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry and java.util.concurrent.ConcurrentLinkedQueue$Node consumed 20GB of our 40GB heap. We need to get rid of blueocean due to this issue. The priority should be raised to critical IMHO.

Gustaf Lundh added a comment - 2018-08-30 08:52 We just had to restart a critical Jenkins master due to this memory leak, where org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry and java.util.concurrent.ConcurrentLinkedQueue$Node consumed 20GB of our 40GB heap. We need to get rid of blueocean due to this issue. The priority should be raised to critical IMHO.

Jon Sten added a comment - 2018-08-31 12:08

As a temporary workaround, here's a script which will go through and look for old event dispatchers:

// First of, due to groovy and private field in super classes, we need to
// change visibility of retryQueue so that we can use reflection instead...
def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
retryQueueField.setAccessible(true)

def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
println("There are ${sessions.size()} sessions...")
def numEventsRemoved = 0
def numEventDispatchersPurged = 0
sessions.each{id, session->
  def eventDispatchers = session.sessionData._attributes['org.jenkinsci.plugins.ssegateway.sse.EventDispatcher']
  if (eventDispatchers) {
    println("Sessions ${id} has a ssegateway EventDispatcher map...")
    eventDispatchers.findAll{k, v -> k.startsWith('jenkins-blueocean-core-js')}.each { dispatcherId, dispatcher ->
      def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
      if (retryQueue.isEmpty()) {
        println("  Found one EventDispatcher, '${dispatcherId}', it has no retry events.")
      } else {
        def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
        if (oldestAge > 300) {
          println("  Found one EventDispatcher, '${dispatcherId}', its oldest retry event is ${oldestAge} seconds old, it contains ${retryQueue.size()} retry events, removing events and unsubscribing.")
          numEventsRemoved += retryQueue.size()
          numEventDispatchersPurged++
          dispatcher.unsubscribeAll()
          retryQueue.clear()
     	 } else {
        println("  Found one EventDispatcher, its oldest retry event is ${oldestAge} seconds old, so sparing it for now...")
      	}
      }
    }
  }
}

println("Removed ${numEventsRemoved} retry events from ${numEventDispatchersPurged} EventDispatchers!")

Btw, I've verified this bug in a fresh Jenkins install (2.121.3) and with latest version of Blue Ocean (1.8.2) which pulls SSE-gateway (1.15).

Jon Sten added a comment - 2018-08-31 12:08 As a temporary workaround, here's a script which will go through and look for old event dispatchers: // First of, due to groovy and private field in super classes, we need to // change visibility of retryQueue so that we can use reflection instead... def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue') retryQueueField.setAccessible(true) def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions println("There are ${sessions.size()} sessions...") def numEventsRemoved = 0 def numEventDispatchersPurged = 0 sessions.each{id, session-> def eventDispatchers = session.sessionData._attributes['org.jenkinsci.plugins.ssegateway.sse.EventDispatcher'] if (eventDispatchers) { println("Sessions ${id} has a ssegateway EventDispatcher map...") eventDispatchers.findAll{k, v -> k.startsWith('jenkins-blueocean-core-js')}.each { dispatcherId, dispatcher -> def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.isEmpty()) { println(" Found one EventDispatcher, '${dispatcherId}', it has no retry events.") } else { def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000 if (oldestAge > 300) { println(" Found one EventDispatcher, '${dispatcherId}', its oldest retry event is ${oldestAge} seconds old, it contains ${retryQueue.size()} retry events, removing events and unsubscribing.") numEventsRemoved += retryQueue.size() numEventDispatchersPurged++ dispatcher.unsubscribeAll() retryQueue.clear() } else { println(" Found one EventDispatcher, its oldest retry event is ${oldestAge} seconds old, so sparing it for now...") } } } } } println("Removed ${numEventsRemoved} retry events from ${numEventDispatchersPurged} EventDispatchers!") Btw, I've verified this bug in a fresh Jenkins install (2.121.3) and with latest version of Blue Ocean (1.8.2) which pulls SSE-gateway (1.15).

Maxfield Stewart added a comment - 2018-09-06 01:59

We're having this problem with blue ocean 1.4.2 and Jenkins 2.112. Out of a 18 GB heap, 16GB are the GuavaPubSub object which containes the concurrent queue and EventDispatchers. We ran jons 's script and it cleaned up only 2.5GB's of retry objects but still had 14GB's of references to EventDispatchers. It takes' out our Jenkins Server in about 14-15 days.

I took a heap dump and am adding a screenshot of a MAT analyzer view of the memory consumption. Was hoping someone would accept the PR for the memory leak fix which seems to be outstanding for some time now.

Maxfield Stewart added a comment - 2018-09-06 01:59 We're having this problem with blue ocean 1.4.2 and Jenkins 2.112. Out of a 18 GB heap, 16GB are the GuavaPubSub object which containes the concurrent queue and EventDispatchers. We ran jons 's script and it cleaned up only 2.5GB's of retry objects but still had 14GB's of references to EventDispatchers. It takes' out our Jenkins Server in about 14-15 days. I took a heap dump and am adding a screenshot of a MAT analyzer view of the memory consumption. Was hoping someone would accept the PR for the memory leak fix which seems to be outstanding for some time now.

Giorgio Sironi added a comment - 2018-10-18 12:29

Experiencing the same problem, it takes ~30 days for Jenkins to build up to its VM memory limit. We are on SSE Gateway 1.15 too.

Heap Classes: 10,648, Instances: 57,541,468, Kilo-Bytes: 1,715,402

Class	Size (Kb)	% size	Instances	% instances
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry	800,661	46	25,621,179	44
java.util.concurrent.ConcurrentLinkedQueue$Node	600,538	35	25,622,966	44
char[]	125,236	7	1,298,894	2
java.lang.String	30,420	1	1,297,943	2
short[]	20,271	1	9,119	0

Giorgio Sironi added a comment - 2018-10-18 12:29 Experiencing the same problem, it takes ~30 days for Jenkins to build up to its VM memory limit. We are on SSE Gateway 1.15 too. Heap Classes: 10,648, Instances: 57,541,468, Kilo-Bytes: 1,715,402 Class Size (Kb) % size Instances % instances Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 800,661 46 25,621,179 44 java.util.concurrent.ConcurrentLinkedQueue$Node 600,538 35 25,622,966 44 char[] 125,236 7 1,298,894 2 java.lang.String 30,420 1 1,297,943 2 short[] 20,271 1 9,119 0

Mikkel S. Andersen added a comment - 2018-11-06 10:20

Bump. Same issue here on SSE Gateway 1.16 it seems.

Any chance for a fix in the near future?

Heap Classes: 20,087, Instances: 280,652,984, Kilo-Bytes: 9,119,689

Class	Size (Kb)	% size	Instances	% instances
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry	3,383,758	37	108,280,267	38
java.util.concurrent.ConcurrentLinkedQueue$Node	2,538,166	27	108,295,118	38
char[]	1,075,470	11	9,044,530	3
java.lang.Object[]	205,241	2	5,771,490	2
int[]	182,430	2	1,633,983	0
byte[]	165,738	1	558,281	0
java.lang.String	157,701	1	6,728,615	2
java.util.HashMap$Node	99,299	1	3,177,591	1

Mikkel S. Andersen added a comment - 2018-11-06 10:20 Bump. Same issue here on SSE Gateway 1.16 it seems. Any chance for a fix in the near future? Heap Classes: 20,087, Instances: 280,652,984, Kilo-Bytes: 9,119,689 Class Size (Kb) % size Instances % instances Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 3,383,758 37 108,280,267 38 java.util.concurrent.ConcurrentLinkedQueue$Node 2,538,166 27 108,295,118 38 char[] 1,075,470 11 9,044,530 3 java.lang.Object[] 205,241 2 5,771,490 2 int[] 182,430 2 1,633,983 0 byte[] 165,738 1 558,281 0 java.lang.String 157,701 1 6,728,615 2 java.util.HashMap$Node 99,299 1 3,177,591 1

Maxfield Stewart added a comment - 2019-03-06 03:14

This problem continues. It continues to take production Jenkins offline forcing restarts every 15-25 days. We're beginning to see if there's a way to deploy jenkins with Zero Blue Ocean elements as it's not production ready. Though the plugin seems to install by default. it's a shame there's been zero ownership of this for almost a year now

Maxfield Stewart added a comment - 2019-03-06 03:14 This problem continues. It continues to take production Jenkins offline forcing restarts every 15-25 days. We're beginning to see if there's a way to deploy jenkins with Zero Blue Ocean elements as it's not production ready. Though the plugin seems to install by default. it's a shame there's been zero ownership of this for almost a year now

Álvaro Iradier added a comment - 2019-03-07 10:44

Just my 2 cents...

jons, I tried your cleanup script with mixed results. It was able to empty retryQueues but we still had a big leak. After isolating the problem, I noticed that if the HTTP session from the user expires, that script is not able to find the leaking EventDispatchers. They are, however, still in the GuavaPubSubBus subscribers map, and can be seen with:

import org.jenkinsci.plugins.pubsub.PubsubBus;
this.bus = PubsubBus.getBus();
this.bus.subscribers.each{ channelSubscriber, guavaSubscriber -> println channelSubscriber }
println "Done"

This outputs some lines containing:

org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber@3472875b

which is the Inner class that implements the ChannelSubscriber interface. I suspect that the memory is still leaking as it is being referenced through this subscriber.

I saw there is already a PR https://github.com/jenkinsci/sse-gateway-plugin/pull/27 and a fork of the plugin https://github.com/taboola/sse-gateway-plugin

The approach of this fix is changing the handling of retries so they are aborted and the queue cleared after some amount of time or retries. Howevery, I wonder if the problem is the cleanup should be done when the HttpSession is destroyed. Currently, I can see in https://github.com/jenkinsci/sse-gateway-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ssegateway/sse/EventDispatcher.java the cleanup code for sessionDestroyed is:

    /**
     * Http session listener.
     */
    @Extension
    public static final class SSEHttpSessionListener extends HttpSessionListener {
        @Override
        public void sessionDestroyed(HttpSessionEvent httpSessionEvent) {
            try {
                Map<String, EventDispatcher> dispatchers = EventDispatcherFactory.getDispatchers(httpSessionEvent.getSession());
                try {
                    for (EventDispatcher dispatcher : dispatchers.values()) {
                        try {
                            dispatcher.unsubscribeAll();
                        } catch (Exception e) {
                            LOGGER.log(Level.FINE, "Error during unsubscribeAll() for dispatcher " + dispatcher.getId() + ".", e);
                        }
                    }
                } finally {
                    dispatchers.clear();
                }
            } catch (Exception e) {
                LOGGER.log(Level.FINE, "Error during session cleanup. The session has probably timed out.", e);
            }
        }
    }

but although I can see that for every dispatcher there is a call to dispatcher.unsubscribeAll(), I am missing a call to retryQueue.clear(); But I am just guessing here...

Anyone knowing the internals of the plugin can confirm this? I might try forking the plugin and testing a modified version.

Regards.

Álvaro Iradier added a comment - 2019-03-07 10:44 Just my 2 cents... jons , I tried your cleanup script with mixed results. It was able to empty retryQueues but we still had a big leak. After isolating the problem, I noticed that if the HTTP session from the user expires, that script is not able to find the leaking EventDispatchers. They are, however, still in the GuavaPubSubBus subscribers map, and can be seen with: import org.jenkinsci.plugins.pubsub.PubsubBus; this .bus = PubsubBus.getBus(); this .bus.subscribers.each{ channelSubscriber, guavaSubscriber -> println channelSubscriber } println "Done" This outputs some lines containing: org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber@3472875b which is the Inner class that implements the ChannelSubscriber interface. I suspect that the memory is still leaking as it is being referenced through this subscriber. I saw there is already a PR https://github.com/jenkinsci/sse-gateway-plugin/pull/27 and a fork of the plugin https://github.com/taboola/sse-gateway-plugin The approach of this fix is changing the handling of retries so they are aborted and the queue cleared after some amount of time or retries. Howevery, I wonder if the problem is the cleanup should be done when the HttpSession is destroyed. Currently, I can see in https://github.com/jenkinsci/sse-gateway-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ssegateway/sse/EventDispatcher.java the cleanup code for sessionDestroyed is: /** * Http session listener. */ @Extension public static final class SSEHttpSessionListener extends HttpSessionListener { @Override public void sessionDestroyed(HttpSessionEvent httpSessionEvent) { try { Map< String , EventDispatcher> dispatchers = EventDispatcherFactory.getDispatchers(httpSessionEvent.getSession()); try { for (EventDispatcher dispatcher : dispatchers.values()) { try { dispatcher.unsubscribeAll(); } catch (Exception e) { LOGGER.log(Level.FINE, "Error during unsubscribeAll() for dispatcher " + dispatcher.getId() + "." , e); } } } finally { dispatchers.clear(); } } catch (Exception e) { LOGGER.log(Level.FINE, "Error during session cleanup. The session has probably timed out." , e); } } } but although I can see that for every dispatcher there is a call to dispatcher.unsubscribeAll(), I am missing a call to retryQueue.clear(); But I am just guessing here... Anyone knowing the internals of the plugin can confirm this? I might try forking the plugin and testing a modified version. Regards.

Álvaro Iradier added a comment - 2019-03-08 19:40 - edited

After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm.

Just in case it can help, this is the cleanup script we are running daily:

import org.jenkinsci.plugins.pubsub.PubsubBus;
import org.jenkinsci.plugins.ssegateway.sse.*;

def dryRun = false
this.bus = PubsubBus.getBus();

// change visibility of retryQueue so that we can use reflection instead...
def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
retryQueueField.setAccessible(true)

def dispatcherCount = 0
def dispatchersList = []

//Build a list of EventDispatchers in all existing HTTP sessions
println "DISPATCHERS IN HTTP SESSIONS"
println "----------------------------"
def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
sessions.each{id, session->
  def eventDispatchers = EventDispatcherFactory.getDispatchers(session)
  if (eventDispatchers) {
    eventDispatchers.each { dispatcherId, dispatcher ->
      dispatchersList.add(dispatcherId)
      def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
      if (retryQueue.peek() != null) {
        def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
        println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
      } else {
        println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents"
      }      
    }
  }
}

println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions"
println ""

//Find all subscribers in bus
println "DISPATCHERS IN PUBSUBBUS"
println "------------------------"
this.bus.subscribers.any{ channelSubscriber, guavaSubscriber ->
  if (channelSubscriber.getClass().getName().equals('org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber')) {
    dispatcherCount++
	def dispatcher = channelSubscriber.this$0

    def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
    if (retryQueue.peek() != null) {
      def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
	  println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
      if (oldestAge > 300) {
      	println "  Clearing retryQueue with " + retryQueue.size() + " events"
        if (!dryRun) {
            retryQueue.clear()
            dispatcher.unsubscribeAll()
            try {
              dispatcher.stop()
            } catch (Exception ex) {
              println "  !! Exception stopping AsynDispatcher"
            }          
        } else {
          println "  Ignoring, dryrun"
        }
      }
    } else {
	    println "Dispatcher: " + dispatcher.id + " with no retryEvents"
    }
    
    if (dispatchersList.indexOf(dispatcher.id) < 0) {
      println "  Dispatcher is not in HTTP Sessions. Clearing"
        if (!dryRun) {
          dispatcher.unsubscribeAll()
          try {
          	dispatcher.stop()
          } catch (Exception ex) {
            println "  !! Exception stopping AsynDispatcher"
          }
        } else {
          println "  Ignoring, dryrun"
        }
    }
  } 
}

println ""
println "Dispatchers in PubSubBus: " + dispatcherCount

Álvaro Iradier added a comment - 2019-03-08 19:40 - edited After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm. Just in case it can help, this is the cleanup script we are running daily: import org.jenkinsci.plugins.pubsub.PubsubBus; import org.jenkinsci.plugins.ssegateway.sse.*; def dryRun = false this .bus = PubsubBus.getBus(); // change visibility of retryQueue so that we can use reflection instead... def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField( 'retryQueue' ) retryQueueField.setAccessible( true ) def dispatcherCount = 0 def dispatchersList = [] //Build a list of EventDispatchers in all existing HTTP sessions println "DISPATCHERS IN HTTP SESSIONS" println "----------------------------" def sessions = Jenkins.instance.servletContext. this $0._sessionHandler._sessionCache._sessions sessions.each{id, session-> def eventDispatchers = EventDispatcherFactory.getDispatchers(session) if (eventDispatchers) { eventDispatchers.each { dispatcherId, dispatcher -> dispatchersList.add(dispatcherId) def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.peek() != null ) { def oldestAge = ( System .currentTimeMillis() - retryQueue.peek().timestamp)/1000 println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds." } else { println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents" } } } } println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions" println "" //Find all subscribers in bus println "DISPATCHERS IN PUBSUBBUS" println "------------------------" this .bus.subscribers.any{ channelSubscriber, guavaSubscriber -> if (channelSubscriber.getClass().getName().equals( 'org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber' )) { dispatcherCount++ def dispatcher = channelSubscriber. this $0 def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.peek() != null ) { def oldestAge = ( System .currentTimeMillis() - retryQueue.peek().timestamp)/1000 println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds." if (oldestAge > 300) { println " Clearing retryQueue with " + retryQueue.size() + " events" if (!dryRun) { retryQueue.clear() dispatcher.unsubscribeAll() try { dispatcher.stop() } catch (Exception ex) { println " !! Exception stopping AsynDispatcher" } } else { println " Ignoring, dryrun" } } } else { println "Dispatcher: " + dispatcher.id + " with no retryEvents" } if (dispatchersList.indexOf(dispatcher.id) < 0) { println " Dispatcher is not in HTTP Sessions. Clearing" if (!dryRun) { dispatcher.unsubscribeAll() try { dispatcher.stop() } catch (Exception ex) { println " !! Exception stopping AsynDispatcher" } } else { println " Ignoring, dryrun" } } } } println "" println "Dispatchers in PubSubBus: " + dispatcherCount

Jon Sten added a comment - 2019-03-11 07:41

airadier Thank you for the investigation. In our case I've only noticed EventDispatcher objects being kept alive by the HTTP sessions (for some unknown reason sessions aren't terminated, might be related to the LDAP plugin, not sure though). However, as you noted it seems like there are other places where the objects are kept referenced on the heap. So some kind of timeout of the dispatcher objects and sessions is really needed.

Jon Sten added a comment - 2019-03-11 07:41 airadier Thank you for the investigation. In our case I've only noticed EventDispatcher objects being kept alive by the HTTP sessions (for some unknown reason sessions aren't terminated, might be related to the LDAP plugin, not sure though). However, as you noted it seems like there are other places where the objects are kept referenced on the heap. So some kind of timeout of the dispatcher objects and sessions is really needed.

Maxfield Stewart added a comment - 2019-03-23 04:22

airadier Thank you for the hard work, your script almost works on some of our servers. I had to modify the code to put an exception trap around "dispatcher.stop()" as in some cases that was blowing up on my server, after that it cleared memory.

However on another (older version) of Jenkins the above code explodes trying to access `def dispatcher = channelSubscriber.this$0` as the reflection property fails and doesn't return the parent class (Dispatcher). Not sure what's up there, but we're upgrading that server soon and I'll try again.

Maxfield Stewart added a comment - 2019-03-23 04:22 airadier Thank you for the hard work, your script almost works on some of our servers. I had to modify the code to put an exception trap around "dispatcher.stop()" as in some cases that was blowing up on my server, after that it cleared memory. However on another (older version) of Jenkins the above code explodes trying to access `def dispatcher = channelSubscriber.this$0` as the reflection property fails and doesn't return the parent class (Dispatcher). Not sure what's up there, but we're upgrading that server soon and I'll try again.

Álvaro Iradier added a comment - 2019-03-24 20:49

Hi maxfields2000, yes I also got the exception after some tests in our production environment, and had to wrap in try-catch. I just updated my comment with latest version of the script we are running. Probably the dispatcher.stop() is not necessary, as I mostly always get an exception, but it won't hurt. Also I noticed that the part in the " Dispatcher is not in HTTP Sessions. Clearing" was not executed most of the times, so I copied the cleanup part (unsubscribeAll() and the .stop()) to the "Clearing retryQueue with..." part.

After a couple of weeks running I examined the thread dumps and saw no trace of any leak regarding the EventDispatcher$Retry elements, so for us it is working. Looking forward for a fix in the official upstream version of the plugin

Álvaro Iradier added a comment - 2019-03-24 20:49 Hi maxfields2000 , yes I also got the exception after some tests in our production environment, and had to wrap in try-catch. I just updated my comment with latest version of the script we are running. Probably the dispatcher.stop() is not necessary, as I mostly always get an exception, but it won't hurt. Also I noticed that the part in the " Dispatcher is not in HTTP Sessions. Clearing" was not executed most of the times, so I copied the cleanup part (unsubscribeAll() and the .stop()) to the "Clearing retryQueue with..." part. After a couple of weeks running I examined the thread dumps and saw no trace of any leak regarding the EventDispatcher$Retry elements, so for us it is working. Looking forward for a fix in the official upstream version of the plugin

Maxfield Stewart added a comment - 2019-03-29 17:07

airadier Confirmed by the way that your script needs a newer/newest version of plugins, not entirely sure which ones. Of 3 4 deployments I managed, 1 the script was not working for ( couldn't get the parent dispatcher via this$0 ) until I moved to the latest versions. Tested on Blue ocean 1.13 and 1.14, Jenkins v2.168. Doesn't seem to work on versions of Blue Ocean older than 1.10/Jenkins 2.147.

Maxfield Stewart added a comment - 2019-03-29 17:07 airadier Confirmed by the way that your script needs a newer/newest version of plugins, not entirely sure which ones. Of 3 4 deployments I managed, 1 the script was not working for ( couldn't get the parent dispatcher via this$0 ) until I moved to the latest versions. Tested on Blue ocean 1.13 and 1.14, Jenkins v2.168. Doesn't seem to work on versions of Blue Ocean older than 1.10/Jenkins 2.147.

Dan Mordechai added a comment - 2019-05-13 12:30 - edited

airadier Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.

Some things to point out:

I had to change the line

def dispatcher = channelSubscriber.this$0

def dispatcher = channelSubscriber.eventDispatcher

2. Your script will only work for SSE gatway plugin version > 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

Dan Mordechai added a comment - 2019-05-13 12:30 - edited airadier Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins. Some things to point out: I had to change the line def dispatcher = channelSubscriber. this $0 to def dispatcher = channelSubscriber.eventDispatcher 2. Your script will only work for SSE gatway plugin version > 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

Raihaan Shouhell added a comment - 2019-06-24 03:12

A fix got merged to master that definitely helped our masters

Raihaan Shouhell added a comment - 2019-06-24 03:12 A fix got merged to master that definitely helped our masters

Olivier Lamy added a comment - 2019-06-25 02:56

fixed with https://github.com/jenkinsci/sse-gateway-plugin/commit/9cfe9d9c4d9e284ee7d099d4e734b3677ee677f9

Olivier Lamy added a comment - 2019-06-25 02:56 fixed with https://github.com/jenkinsci/sse-gateway-plugin/commit/9cfe9d9c4d9e284ee7d099d4e734b3677ee677f9

Andreas Galek added a comment - 2019-08-12 06:42

Having installed the newest plugin's version 1.19 I was facing error messages like this:

Aug 09, 2019 6:36:46 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException
SEVERE: A thread (EventDispatcher.retryProcessor/37063) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
java.lang.OutOfMemoryError: unable to create new native thread
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:717)
    at java.util.Timer.<init>(Timer.java:160)
    at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296)
    at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437)
    at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)

Is it the same problem as the one you fixed in this Jira ticket?

Andreas Galek added a comment - 2019-08-12 06:42 Having installed the newest plugin's version 1.19 I was facing error messages like this: Aug 09, 2019 6:36:46 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException SEVERE: A thread (EventDispatcher.retryProcessor/37063) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.Timer.<init>(Timer.java:160) at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296) at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437) at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) Is it the same problem as the one you fixed in this Jira ticket?

Olivier Lamy added a comment - 2019-08-12 07:10

andreasgk side effect bug of the fix please read here https://issues.jenkins-ci.org/browse/JENKINS-58684

Olivier Lamy added a comment - 2019-08-12 07:10 andreasgk side effect bug of the fix please read here https://issues.jenkins-ci.org/browse/JENKINS-58684

Assignee:: Olivier Lamy

Reporter:: Christian Höltje

Votes:: 16 Vote for this issue

Watchers:: 24 Start watching this issue

Created:: 2018-05-01 19:36

Updated:: 2019-08-14 13:29

Resolved:: 2019-06-25 02:56

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Christian Höltje added a comment - 2018-05-03 15:51, Edited by Christian Höltje - 2018-05-03 16:11

Expand comment: Christian Höltje added a comment - 2018-05-03 15:51, Edited by Christian Höltje - 2018-05-03 16:11

Collapse comment: Christian Höltje added a comment - 2018-05-03 16:19

Expand comment: Christian Höltje added a comment - 2018-05-03 16:19

Collapse comment: Christian Höltje added a comment - 2018-05-03 16:38

Expand comment: Christian Höltje added a comment - 2018-05-03 16:38

Collapse comment: Christian Höltje added a comment - 2018-05-03 17:15

Expand comment: Christian Höltje added a comment - 2018-05-03 17:15

Collapse comment: Christian Höltje added a comment - 2018-05-03 17:29

Expand comment: Christian Höltje added a comment - 2018-05-03 17:29

Collapse comment: Christian Höltje added a comment - 2018-05-03 17:43

Expand comment: Christian Höltje added a comment - 2018-05-03 17:43

Collapse comment: Christian Höltje added a comment - 2018-05-04 13:07

Expand comment: Christian Höltje added a comment - 2018-05-04 13:07

Collapse comment: Christian Höltje added a comment - 2018-05-04 14:39

Expand comment: Christian Höltje added a comment - 2018-05-04 14:39

Collapse comment: Christian Höltje added a comment - 2018-05-07 20:08

Expand comment: Christian Höltje added a comment - 2018-05-07 20:08

Collapse comment: Christian Höltje added a comment - 2018-05-11 18:27

Expand comment: Christian Höltje added a comment - 2018-05-11 18:27

Collapse comment: Christian Höltje added a comment - 2018-05-11 19:44

Expand comment: Christian Höltje added a comment - 2018-05-11 19:44

Collapse comment: Wilfred Hughes added a comment - 2018-06-01 09:22, Edited by Wilfred Hughes - 2018-06-01 09:22

Expand comment: Wilfred Hughes added a comment - 2018-06-01 09:22, Edited by Wilfred Hughes - 2018-06-01 09:22

Collapse comment: Christian Höltje added a comment - 2018-06-25 18:03

Expand comment: Christian Höltje added a comment - 2018-06-25 18:03

Collapse comment: rami stern added a comment - 2018-07-05 11:07

Expand comment: rami stern added a comment - 2018-07-05 11:07

Collapse comment: Gustaf Lundh added a comment - 2018-08-30 08:52

Expand comment: Gustaf Lundh added a comment - 2018-08-30 08:52

Collapse comment: Jon Sten added a comment - 2018-08-31 12:08

Expand comment: Jon Sten added a comment - 2018-08-31 12:08

Collapse comment: Maxfield Stewart added a comment - 2018-09-06 01:59

Expand comment: Maxfield Stewart added a comment - 2018-09-06 01:59

Collapse comment: Giorgio Sironi added a comment - 2018-10-18 12:29

Expand comment: Giorgio Sironi added a comment - 2018-10-18 12:29

Collapse comment: Mikkel S. Andersen added a comment - 2018-11-06 10:20

Expand comment: Mikkel S. Andersen added a comment - 2018-11-06 10:20

Collapse comment: Maxfield Stewart added a comment - 2019-03-06 03:14

Expand comment: Maxfield Stewart added a comment - 2019-03-06 03:14

Collapse comment: Álvaro Iradier added a comment - 2019-03-07 10:44

Expand comment: Álvaro Iradier added a comment - 2019-03-07 10:44

Collapse comment: Álvaro Iradier added a comment - 2019-03-08 19:40, Edited by Álvaro Iradier - 2019-03-24 20:45

Expand comment: Álvaro Iradier added a comment - 2019-03-08 19:40, Edited by Álvaro Iradier - 2019-03-24 20:45

Collapse comment: Jon Sten added a comment - 2019-03-11 07:41

Expand comment: Jon Sten added a comment - 2019-03-11 07:41

Collapse comment: Maxfield Stewart added a comment - 2019-03-23 04:22

Expand comment: Maxfield Stewart added a comment - 2019-03-23 04:22

Collapse comment: Álvaro Iradier added a comment - 2019-03-24 20:49

Expand comment: Álvaro Iradier added a comment - 2019-03-24 20:49

Collapse comment: Maxfield Stewart added a comment - 2019-03-29 17:07

Expand comment: Maxfield Stewart added a comment - 2019-03-29 17:07

Collapse comment: Dan Mordechai added a comment - 2019-05-13 12:30, Edited by Dan Mordechai - 2019-05-16 09:42

Expand comment: Dan Mordechai added a comment - 2019-05-13 12:30, Edited by Dan Mordechai - 2019-05-16 09:42

Collapse comment: Raihaan Shouhell added a comment - 2019-06-24 03:12

Expand comment: Raihaan Shouhell added a comment - 2019-06-24 03:12

Collapse comment: Olivier Lamy added a comment - 2019-06-25 02:56

Expand comment: Olivier Lamy added a comment - 2019-06-25 02:56

Collapse comment: Andreas Galek added a comment - 2019-08-12 06:42

Expand comment: Andreas Galek added a comment - 2019-08-12 06:42

Collapse comment: Olivier Lamy added a comment - 2019-08-12 07:10

Expand comment: Olivier Lamy added a comment - 2019-08-12 07:10

People

Dates