[JENKINS-39835] Be super defensive in remoting read

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: remoting
Labels:
- 2.32.2-fixed

Similar Issues:
Powered by SuggestiMate

Show

Not observed but an OOMErr could kill an agent connection as we would not reset the read ops if a throwable happened that was not a RuntimeException (ie any class of Error).

The code should be defensive against this and terminate the connection so it can re-establish rather than being in the hung case.

is related to

JENKINS-28492 The server rejected the connection: *** is already connected to this master. Rejecting this connection.

Resolved

links to

CloudBees Internal OSS-1168

James Nord created issue - 2016-11-17 19:02

James Nord made changes - 2016-11-17 19:02

Assignee

New: James Nord [ teilo ]

SCM/JIRA link daemon added a comment - 2016-11-17 19:03

Code changed in jenkins
User: James Nord
Path:
src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java
http://jenkins-ci.org/commit/remoting/ec9b5c13b879f44c04fa28ee6c8b113a165c9e57
Log:
Be extra defensive about Errors and Exceptions

~~JENKINS-39835~~ Be even more defensive then against leaving connections dangling.

SCM/JIRA link daemon added a comment - 2016-11-17 19:03 Code changed in jenkins User: James Nord Path: src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java http://jenkins-ci.org/commit/remoting/ec9b5c13b879f44c04fa28ee6c8b113a165c9e57 Log: Be extra defensive about Errors and Exceptions JENKINS-39835 Be even more defensive then against leaving connections dangling.

James Nord made changes - 2016-12-16 22:52

Labels

New: lts-candiatw

James Nord made changes - 2016-12-16 22:55

Labels

Original: lts-candiatw

New: lts-candidate

James Nord added a comment - 2016-12-16 22:58

i believe this issue has now been observed on a live site

James Nord added a comment - 2016-12-16 22:58 i believe this issue has now been observed on a live site

SCM/JIRA link daemon added a comment - 2016-12-22 11:21

Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java
http://jenkins-ci.org/commit/remoting/32674f6221cb93c7b5217231afc1b5fbec554d77
Log:
Merge pull request #133 from jenkinsci/jtnord-patch-1

~~JENKINS-39835~~ - Be extra defensive about Errors and Exceptions

Compare: https://github.com/jenkinsci/remoting/compare/b50beca9e888...32674f6221cb

SCM/JIRA link daemon added a comment - 2016-12-22 11:21 Code changed in jenkins User: Oleg Nenashev Path: src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java http://jenkins-ci.org/commit/remoting/32674f6221cb93c7b5217231afc1b5fbec554d77 Log: Merge pull request #133 from jenkinsci/jtnord-patch-1 JENKINS-39835 - Be extra defensive about Errors and Exceptions Compare: https://github.com/jenkinsci/remoting/compare/b50beca9e888...32674f6221cb

Oleg Nenashev made changes - 2016-12-25 23:33

Link

New: This issue is related to ~~JENKINS-28492~~ [ ~~JENKINS-28492~~ ]

Matthew Mitchell added a comment - 2016-12-26 14:00

Alright, so I root caused most of this. While there certainly are issues around the error handling, the errors we saw are all caused by memory. As we begin to run out of memory, the finally blocks that should zero out the channel object never get called. This causes a sort of cascading failure the manifests in a number of ways, including the error message above. The number of threads jumps, reflection starts to hang (job dsl starts to fail), etc.

For my instance, the root cause was the workspace cleanup plugin + node recycling. This was keeping channel objects around forever in some cases, causing a slow leak.

I would first verify that memory isn't the cause of the failure. I do the following:

Watch number of threads:

watch -n1 'find /proc/<jenkins pid>/task -maxdepth 1 -type d -print | wc -l'

Watch gc stats:

jstat -gccause -t -h25 <pid> 10s

If the number of threads starts to jump into the high-thousands (depending on your heap setup) then that's a good indication.
jstat will show failure to allocate eventually, and a high number of full gcs.

Matthew Mitchell added a comment - 2016-12-26 14:00 Alright, so I root caused most of this. While there certainly are issues around the error handling, the errors we saw are all caused by memory. As we begin to run out of memory, the finally blocks that should zero out the channel object never get called. This causes a sort of cascading failure the manifests in a number of ways, including the error message above. The number of threads jumps, reflection starts to hang (job dsl starts to fail), etc. For my instance, the root cause was the workspace cleanup plugin + node recycling. This was keeping channel objects around forever in some cases, causing a slow leak. I would first verify that memory isn't the cause of the failure. I do the following: Watch number of threads: watch -n1 'find /proc/<jenkins pid>/task -maxdepth 1 -type d -print | wc -l' Watch gc stats: jstat -gccause -t -h25 <pid> 10s If the number of threads starts to jump into the high-thousands (depending on your heap setup) then that's a good indication. jstat will show failure to allocate eventually, and a high number of full gcs.

SCM/JIRA link daemon added a comment - 2016-12-27 11:06

Code changed in jenkins
User: Oleg Nenashev
Path:
pom.xml
http://jenkins-ci.org/commit/jenkins/7c2e1b2ece1770874eedd69cf20142aad4b491b9
Log:
[FIXED JENKINS-39835] - Update remoting to 3.4 (#2679)

SCM/JIRA link daemon added a comment - 2016-12-27 11:06 Code changed in jenkins User: Oleg Nenashev Path: pom.xml http://jenkins-ci.org/commit/jenkins/7c2e1b2ece1770874eedd69cf20142aad4b491b9 Log: [FIXED JENKINS-39835] - Update remoting to 3.4 (#2679)

Assignee:: Unassigned

Reporter:: James Nord

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2016-11-17 19:02

Updated:: 2020-12-10 13:45

Resolved:: 2016-12-27 11:06

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: SCM/JIRA link daemon added a comment - 2016-11-17 19:03

Expand comment: SCM/JIRA link daemon added a comment - 2016-11-17 19:03

Collapse comment: James Nord added a comment - 2016-12-16 22:58

Expand comment: James Nord added a comment - 2016-12-16 22:58

Collapse comment: SCM/JIRA link daemon added a comment - 2016-12-22 11:21

Expand comment: SCM/JIRA link daemon added a comment - 2016-12-22 11:21

Collapse comment: Matthew Mitchell added a comment - 2016-12-26 14:00

Expand comment: Matthew Mitchell added a comment - 2016-12-26 14:00

Collapse comment: SCM/JIRA link daemon added a comment - 2016-12-27 11:06

Expand comment: SCM/JIRA link daemon added a comment - 2016-12-27 11:06

People

Dates