Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22938

SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis

XMLWordPrintable

      == The current state of the bug ==

      After an SSH slave outputs ~4MB of data to stderr, eventually the writes to stderr block. After about 5 minutes, a timeout mechanism will kill the entire slave connection.

      == Why someone would have 4MB of stderr output ==

      1. Gradle creates strange findbugs.xml files that has a srcDir tag for every java source file.
      2. The Jenkins findbugs plugin, when analyzing generated findbugs.xml files, will sometimes spew the following errors. (It seems to happen when it finds bugs in a java file that also has an inner class.)

      WARNING: Can't resolve absolute file name for file SomeFile.java, dir list = [ <Every srcDir> ]

      3. Because of how gradle creates its findbugs.xml files, the error message contains a list of every single source file. For large projects, these errors are huge and there are a lot of them.

      I understand that this is a complicate thing to reproduce. Maybe it can be reproduced easier by using the ssh-slaves code to launch a command that simulates a lot of stderr output.

      == My analysis of why this happens ==

      SSH supports multiple channels of communication within the same SSH connection. com.trilead.ssh2.channel.Channel.freeupWindow is a function that must be called once the data is read off of the channel, which sends a message to the other side letting it know that you're ready for more data. If this function is never called, the other side will quit sending data and everything will stall. This is what is happening.

      ssh-slaves-plugin/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java:892

      Older versions of ssh-slaves would use a thread to read the output of stderr. Newer versions, in an attempt to get rid of the thread, provide an OutputStream to the ssh session, so that the ssh session can write to it directly with no extra thread. When this newer method is used, there is no way that Channel.freeupWindow can ever get called. Currently the only way to call Channel.freeupWindow is to call ChannelInputStream.getChannelData. Since the new method never calls session.getStderr(), then there is no possible way for SSH slaves to call getChannelData or freeupWindow.

      == Past incarnations of this bug ==

      JENKINS-18836
      JENKINS-18879
      JENKINS-19619

      In the past, there used to be a call to freeupWindow for the stderr Channel. But as far as I can tell, this never worked, since Channel.freeupWindow calls TransportManager.sendMessage, which ensures that the thread calling it is not the same as the receiving thread. Check the error message in JENKINS-18879.

      These bugs were supposedly fixed 7 months ago with this change:

      https://github.com/jenkinsci/trilead-ssh2/commit/f1353cc0e0aa1b1e6bc845236e4a2530ea3103fd

      He believed that the call to freeupWindow was duplicate and unnecessary. I believe that this was a mistake. It was not duplicate, but it also never worked to begin with.

      == Solutions ==

      I think there are 2 solutions:

      1. Give up on getting rid of the thread that reads stderr data. Delete the code trilead-ssh that provides a way to pipe output to an OutputStream, since they never worked to begin with. Delete the code in ssh-slaves that attempts to use those methods.

      2. Continue on the path to get rid of the thread. Revert the change made 7 months ago. Eliminate the check in TransportManager.sendMessage that ensures it doesn't get called from a receiving thread.

            stephenconnolly Stephen Connolly
            bvinc Brian Vincent
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: