Not observed but an OOMErr could kill an agent connection as we would not reset the read ops if a throwable happened that was not a RuntimeException (ie any class of Error).

      The code should be defensive against this and terminate the connection so it can re-establish rather than being in the hung case.

          [JENKINS-39835] Be super defensive in remoting read

          Code changed in jenkins
          User: James Nord
          Path:
          src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java
          http://jenkins-ci.org/commit/remoting/ec9b5c13b879f44c04fa28ee6c8b113a165c9e57
          Log:
          Be extra defensive about Errors and Exceptions

          JENKINS-39835 Be even more defensive then against leaving connections dangling.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: James Nord Path: src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java http://jenkins-ci.org/commit/remoting/ec9b5c13b879f44c04fa28ee6c8b113a165c9e57 Log: Be extra defensive about Errors and Exceptions JENKINS-39835 Be even more defensive then against leaving connections dangling.

          James Nord added a comment -

          i believe this issue has now been observed on a live site

          James Nord added a comment - i believe this issue has now been observed on a live site

          Code changed in jenkins
          User: Oleg Nenashev
          Path:
          src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java
          http://jenkins-ci.org/commit/remoting/32674f6221cb93c7b5217231afc1b5fbec554d77
          Log:
          Merge pull request #133 from jenkinsci/jtnord-patch-1

          JENKINS-39835 - Be extra defensive about Errors and Exceptions

          Compare: https://github.com/jenkinsci/remoting/compare/b50beca9e888...32674f6221cb

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: src/main/java/org/jenkinsci/remoting/protocol/impl/NIONetworkLayer.java http://jenkins-ci.org/commit/remoting/32674f6221cb93c7b5217231afc1b5fbec554d77 Log: Merge pull request #133 from jenkinsci/jtnord-patch-1 JENKINS-39835 - Be extra defensive about Errors and Exceptions Compare: https://github.com/jenkinsci/remoting/compare/b50beca9e888...32674f6221cb

          Alright, so I root caused most of this. While there certainly are issues around the error handling, the errors we saw are all caused by memory. As we begin to run out of memory, the finally blocks that should zero out the channel object never get called. This causes a sort of cascading failure the manifests in a number of ways, including the error message above. The number of threads jumps, reflection starts to hang (job dsl starts to fail), etc.

          For my instance, the root cause was the workspace cleanup plugin + node recycling. This was keeping channel objects around forever in some cases, causing a slow leak.

          I would first verify that memory isn't the cause of the failure. I do the following:

          Watch number of threads:

          watch -n1 'find /proc/<jenkins pid>/task -maxdepth 1 -type d -print | wc -l'

          Watch gc stats:

          jstat -gccause -t -h25 <pid> 10s

          If the number of threads starts to jump into the high-thousands (depending on your heap setup) then that's a good indication.
          jstat will show failure to allocate eventually, and a high number of full gcs.

          Matthew Mitchell added a comment - Alright, so I root caused most of this. While there certainly are issues around the error handling, the errors we saw are all caused by memory. As we begin to run out of memory, the finally blocks that should zero out the channel object never get called. This causes a sort of cascading failure the manifests in a number of ways, including the error message above. The number of threads jumps, reflection starts to hang (job dsl starts to fail), etc. For my instance, the root cause was the workspace cleanup plugin + node recycling. This was keeping channel objects around forever in some cases, causing a slow leak. I would first verify that memory isn't the cause of the failure. I do the following: Watch number of threads: watch -n1 'find /proc/<jenkins pid>/task -maxdepth 1 -type d -print | wc -l' Watch gc stats: jstat -gccause -t -h25 <pid> 10s If the number of threads starts to jump into the high-thousands (depending on your heap setup) then that's a good indication. jstat will show failure to allocate eventually, and a high number of full gcs.

          Code changed in jenkins
          User: Oleg Nenashev
          Path:
          pom.xml
          http://jenkins-ci.org/commit/jenkins/7c2e1b2ece1770874eedd69cf20142aad4b491b9
          Log:
          [FIXED JENKINS-39835] - Update remoting to 3.4 (#2679)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: pom.xml http://jenkins-ci.org/commit/jenkins/7c2e1b2ece1770874eedd69cf20142aad4b491b9 Log: [FIXED JENKINS-39835] - Update remoting to 3.4 (#2679)

            Unassigned Unassigned
            teilo James Nord
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: