Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22938

SSH slave connections die after the slave outputs 4MB of stderr, usually during findbugs analysis




      == The current state of the bug ==

      After an SSH slave outputs ~4MB of data to stderr, eventually the writes to stderr block. After about 5 minutes, a timeout mechanism will kill the entire slave connection.

      == Why someone would have 4MB of stderr output ==

      1. Gradle creates strange findbugs.xml files that has a srcDir tag for every java source file.
      2. The Jenkins findbugs plugin, when analyzing generated findbugs.xml files, will sometimes spew the following errors. (It seems to happen when it finds bugs in a java file that also has an inner class.)

      WARNING: Can't resolve absolute file name for file SomeFile.java, dir list = [ <Every srcDir> ]

      3. Because of how gradle creates its findbugs.xml files, the error message contains a list of every single source file. For large projects, these errors are huge and there are a lot of them.

      I understand that this is a complicate thing to reproduce. Maybe it can be reproduced easier by using the ssh-slaves code to launch a command that simulates a lot of stderr output.

      == My analysis of why this happens ==

      SSH supports multiple channels of communication within the same SSH connection. com.trilead.ssh2.channel.Channel.freeupWindow is a function that must be called once the data is read off of the channel, which sends a message to the other side letting it know that you're ready for more data. If this function is never called, the other side will quit sending data and everything will stall. This is what is happening.


      Older versions of ssh-slaves would use a thread to read the output of stderr. Newer versions, in an attempt to get rid of the thread, provide an OutputStream to the ssh session, so that the ssh session can write to it directly with no extra thread. When this newer method is used, there is no way that Channel.freeupWindow can ever get called. Currently the only way to call Channel.freeupWindow is to call ChannelInputStream.getChannelData. Since the new method never calls session.getStderr(), then there is no possible way for SSH slaves to call getChannelData or freeupWindow.

      == Past incarnations of this bug ==


      In the past, there used to be a call to freeupWindow for the stderr Channel. But as far as I can tell, this never worked, since Channel.freeupWindow calls TransportManager.sendMessage, which ensures that the thread calling it is not the same as the receiving thread. Check the error message in JENKINS-18879.

      These bugs were supposedly fixed 7 months ago with this change:


      He believed that the call to freeupWindow was duplicate and unnecessary. I believe that this was a mistake. It was not duplicate, but it also never worked to begin with.

      == Solutions ==

      I think there are 2 solutions:

      1. Give up on getting rid of the thread that reads stderr data. Delete the code trilead-ssh that provides a way to pipe output to an OutputStream, since they never worked to begin with. Delete the code in ssh-slaves that attempts to use those methods.

      2. Continue on the path to get rid of the thread. Revert the change made 7 months ago. Eliminate the check in TransportManager.sendMessage that ensures it doesn't get called from a receiving thread.




            stephenconnolly Stephen Connolly
            bvinc Brian Vincent
            0 Vote for this issue
            7 Start watching this issue