Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22938

SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

      == The current state of the bug ==

      After an SSH slave outputs ~4MB of data to stderr, eventually the writes to stderr block. After about 5 minutes, a timeout mechanism will kill the entire slave connection.

      == Why someone would have 4MB of stderr output ==

      1. Gradle creates strange findbugs.xml files that has a srcDir tag for every java source file.
      2. The Jenkins findbugs plugin, when analyzing generated findbugs.xml files, will sometimes spew the following errors. (It seems to happen when it finds bugs in a java file that also has an inner class.)

      WARNING: Can't resolve absolute file name for file SomeFile.java, dir list = [ <Every srcDir> ]

      3. Because of how gradle creates its findbugs.xml files, the error message contains a list of every single source file. For large projects, these errors are huge and there are a lot of them.

      I understand that this is a complicate thing to reproduce. Maybe it can be reproduced easier by using the ssh-slaves code to launch a command that simulates a lot of stderr output.

      == My analysis of why this happens ==

      SSH supports multiple channels of communication within the same SSH connection. com.trilead.ssh2.channel.Channel.freeupWindow is a function that must be called once the data is read off of the channel, which sends a message to the other side letting it know that you're ready for more data. If this function is never called, the other side will quit sending data and everything will stall. This is what is happening.

      ssh-slaves-plugin/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java:892

      Older versions of ssh-slaves would use a thread to read the output of stderr. Newer versions, in an attempt to get rid of the thread, provide an OutputStream to the ssh session, so that the ssh session can write to it directly with no extra thread. When this newer method is used, there is no way that Channel.freeupWindow can ever get called. Currently the only way to call Channel.freeupWindow is to call ChannelInputStream.getChannelData. Since the new method never calls session.getStderr(), then there is no possible way for SSH slaves to call getChannelData or freeupWindow.

      == Past incarnations of this bug ==

      JENKINS-18836
      JENKINS-18879
      JENKINS-19619

      In the past, there used to be a call to freeupWindow for the stderr Channel. But as far as I can tell, this never worked, since Channel.freeupWindow calls TransportManager.sendMessage, which ensures that the thread calling it is not the same as the receiving thread. Check the error message in JENKINS-18879.

      These bugs were supposedly fixed 7 months ago with this change:

      https://github.com/jenkinsci/trilead-ssh2/commit/f1353cc0e0aa1b1e6bc845236e4a2530ea3103fd

      He believed that the call to freeupWindow was duplicate and unnecessary. I believe that this was a mistake. It was not duplicate, but it also never worked to begin with.

      == Solutions ==

      I think there are 2 solutions:

      1. Give up on getting rid of the thread that reads stderr data. Delete the code trilead-ssh that provides a way to pipe output to an OutputStream, since they never worked to begin with. Delete the code in ssh-slaves that attempts to use those methods.

      2. Continue on the path to get rid of the thread. Revert the change made 7 months ago. Eliminate the check in TransportManager.sendMessage that ensures it doesn't get called from a receiving thread.

          [JENKINS-22938] SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

          There is a 3rd path. Call freeupWindow, but from the async message queue which retains the purity of the receiver thread. Implemented in build217-jenkins-5

          Stephen Connolly added a comment - There is a 3rd path. Call freeupWindow, but from the async message queue which retains the purity of the receiver thread. Implemented in build217-jenkins-5

          Stephen Connolly added a comment - https://github.com/jenkinsci/trilead-ssh2/commit/5811ddd7ae15670a4f9ad345352613b3f2f2db97 https://github.com/jenkinsci/jenkins/commit/7c620e9fd321029e05f7cc994d3da1dde48411f7

          dogfood added a comment -

          Integrated in jenkins_main_trunk #3380
          [FIXED JENKINS-22938] SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis (Revision 7c620e9fd321029e05f7cc994d3da1dde48411f7)

          Result = SUCCESS
          Stephen Connolly : 7c620e9fd321029e05f7cc994d3da1dde48411f7
          Files :

          • core/pom.xml
          • changelog.html

          dogfood added a comment - Integrated in jenkins_main_trunk #3380 [FIXED JENKINS-22938] SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis (Revision 7c620e9fd321029e05f7cc994d3da1dde48411f7) Result = SUCCESS Stephen Connolly : 7c620e9fd321029e05f7cc994d3da1dde48411f7 Files : core/pom.xml changelog.html

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          src/com/trilead/ssh2/channel/Channel.java
          http://jenkins-ci.org/commit/trilead-ssh2/5811ddd7ae15670a4f9ad345352613b3f2f2db97
          Log:
          JENKINS-22938 SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

          The fix for JENKINS-18836, JENKINS-18879, JENKINS-19619 was incorrect in its analysis.

          • There is no call to getChannelData() on the new code path, so thus you cannot have two calls of freeupWindow()
          • The problem with the original call to freeupWindow() is that it is on the receiver thread. You should not mix the responsibilities. Blocking the receiver thread to send a message will negatively impact performance and connection stability.
          • The correct solution is to push the freeupWindow onto the async queue thus the ACK gets sent and the purity of the receiving thread can be maintained.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: src/com/trilead/ssh2/channel/Channel.java http://jenkins-ci.org/commit/trilead-ssh2/5811ddd7ae15670a4f9ad345352613b3f2f2db97 Log: JENKINS-22938 SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis The fix for JENKINS-18836 , JENKINS-18879 , JENKINS-19619 was incorrect in its analysis. There is no call to getChannelData() on the new code path, so thus you cannot have two calls of freeupWindow() The problem with the original call to freeupWindow() is that it is on the receiver thread. You should not mix the responsibilities. Blocking the receiver thread to send a message will negatively impact performance and connection stability. The correct solution is to push the freeupWindow onto the async queue thus the ACK gets sent and the purity of the receiving thread can be maintained.

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          changelog.html
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/7c620e9fd321029e05f7cc994d3da1dde48411f7
          Log:
          [FIXED JENKINS-22938] SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: changelog.html core/pom.xml http://jenkins-ci.org/commit/jenkins/7c620e9fd321029e05f7cc994d3da1dde48411f7 Log: [FIXED JENKINS-22938] SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

          boris ivan added a comment - - edited

          Regarding the "== Why someone would have 4MB of stderr output =="

          Some of us use Jenkins for things like running Maven builds which have a goal of "integration-test". In other words, there is very little 'build' in the traditional sense, and in fact the 'build' itself may not even be unit tests associated with a build of a product-under-test, but an entirely separate test project being built that interacts with something completely external. This generates reams and reams of output from testNG tests, etc, and there might be a valid reason to use stderr in some of that output. Couple that with potentially thousands of tests... Just adding this commentary in case it helps people think of scenarios like this. We use it in this fashion nightly.

          boris ivan added a comment - - edited Regarding the "== Why someone would have 4MB of stderr output ==" Some of us use Jenkins for things like running Maven builds which have a goal of "integration-test". In other words, there is very little 'build' in the traditional sense, and in fact the 'build' itself may not even be unit tests associated with a build of a product-under-test, but an entirely separate test project being built that interacts with something completely external. This generates reams and reams of output from testNG tests, etc, and there might be a valid reason to use stderr in some of that output. Couple that with potentially thousands of tests... Just adding this commentary in case it helps people think of scenarios like this. We use it in this fashion nightly.

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          core/pom.xml
          http://jenkins-ci.org/commit/jenkins/0cfa00d1e7f0dd91d2aa0998e226c986636c3e7b
          Log:
          [FIXED JENKINS-22938] SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

          (cherry picked from commit 7c620e9fd321029e05f7cc994d3da1dde48411f7)

          Conflicts:
          changelog.html

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: core/pom.xml http://jenkins-ci.org/commit/jenkins/0cfa00d1e7f0dd91d2aa0998e226c986636c3e7b Log: [FIXED JENKINS-22938] SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis (cherry picked from commit 7c620e9fd321029e05f7cc994d3da1dde48411f7) Conflicts: changelog.html

            stephenconnolly Stephen Connolly
            bvinc Brian Vincent
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: