Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-3922

Slave is slow copying maven artifacts to master

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • remoting
    • None
    • Platform: All, OS: All

      The artifact transfer is currently a 3-4x penalty for the project that I am
      working on. I have reproduced the issue with a simple test pom that does
      nothing but jar hudson.war. I performed this test on a heterogeneous
      environment. Both master and slave are running Fedora 10, but the master is a
      faster machine. Still, it highlights the issue.

      Here are some stats (all stats are after caching dependencies in the local repos):
      Master build through Hudson: 19s
      Master build from command line (no Hudson): 9s
      Slave build through Hudson: 1m46s
      Slave build from command line (no Hudson): 16s

      To be fair we should at least add time to do a straight scp of the artifact from
      slave to master. The two nodes share a 100 Mbit switch:

      $ scp target/slow-rider-1.0.0-SNAPSHOT.jar master_node:
      slow-rider-1.0.0NAPSHOT.jar 100% 25MB 12.7MB/s 00:02

      Of course this example exaggerates the issue to make it more clear but not by
      too much. I originally noticed this in a completely separate environment that
      was all virtual. I reproduced this on two physical machines using a different
      switch and different ethernet drivers (both virtual and physical). The
      reproducibility plus the comparison against command line + scp leads me to
      suspect eager flushing.

          [JENKINS-3922] Slave is slow copying maven artifacts to master

          For me (FreeBSD), it is locked in next stack trace:

          "Channel reader thread: Channel to Maven [/usr/local/openjdk6//bin/java, -cp, /usr/home/builder/jenkins/builder/maven-agent.jar:/usr/home/builder/jenkins/builder/classworlds.jar, hudson.maven.agent.Main, /usr/local/share/java/maven2, /usr/home/builder/jenkins/builder/slave.jar, /usr/home/builder/jenkins/builder/maven-interceptor.jar, 40088, /usr/home/builder/jenkins/builder/maven2.1-interceptor.jar] / waiting for hudson.remoting.Channel@77cd18d:builder" prio=5 tid=0x0000000851778000 nid=0x84e460740 in Object.wait() [0x00007ffffa9ac000..0x00007ffffa9ac920]
          java.lang.Thread.State: TIMED_WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)
          at hudson.remoting.Request.call(Request.java:127)

          • locked <0x000000083eb71ca8> (a hudson.remoting.ProxyInputStream$Chunk)
            at hudson.remoting.ProxyInputStream._read(ProxyInputStream.java:74)
          • locked <0x000000081ae366b8> (a hudson.remoting.ProxyInputStream)
            at hudson.remoting.ProxyInputStream.read(ProxyInputStream.java:80)
            at hudson.remoting.RemoteInputStream.read(RemoteInputStream.java:91)
            at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
          • locked <0x000000081ae36668> (a java.io.BufferedInputStream)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
            at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
          • locked <0x000000081ae36638> (a java.io.BufferedInputStream)
            at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2264)
            at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2666)
            at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2696)
            at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1648)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323)
            at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1945)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1869)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
            at hudson.remoting.Channel$ReaderThread.run(Channel.java:1087)

          Looking at the code, there is a comment here:
          // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
          // but in production I've observed that in rare occasion it can block forever, even after a channel
          // is gone. So be defensive against that.
          wait(30*1000);

          It seems that it is time to get when this does occur

          Vitalii Tymchyshyn added a comment - For me (FreeBSD), it is locked in next stack trace: "Channel reader thread: Channel to Maven [/usr/local/openjdk6//bin/java, -cp, /usr/home/builder/jenkins/builder/maven-agent.jar:/usr/home/builder/jenkins/builder/classworlds.jar, hudson.maven.agent.Main, /usr/local/share/java/maven2, /usr/home/builder/jenkins/builder/slave.jar, /usr/home/builder/jenkins/builder/maven-interceptor.jar, 40088, /usr/home/builder/jenkins/builder/maven2.1-interceptor.jar] / waiting for hudson.remoting.Channel@77cd18d:builder" prio=5 tid=0x0000000851778000 nid=0x84e460740 in Object.wait() [0x00007ffffa9ac000..0x00007ffffa9ac920] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:127) locked <0x000000083eb71ca8> (a hudson.remoting.ProxyInputStream$Chunk) at hudson.remoting.ProxyInputStream._read(ProxyInputStream.java:74) locked <0x000000081ae366b8> (a hudson.remoting.ProxyInputStream) at hudson.remoting.ProxyInputStream.read(ProxyInputStream.java:80) at hudson.remoting.RemoteInputStream.read(RemoteInputStream.java:91) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) locked <0x000000081ae36668> (a java.io.BufferedInputStream) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) locked <0x000000081ae36638> (a java.io.BufferedInputStream) at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2264) at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2666) at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2696) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1648) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1945) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1869) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:1087) Looking at the code, there is a comment here: // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel, // but in production I've observed that in rare occasion it can block forever, even after a channel // is gone. So be defensive against that. wait(30*1000); It seems that it is time to get when this does occur

          BTW: Something is bran-damaging in this stack trace. It is channel reader thread that is executing. It then sends a read request (hudson.remoting.ProxyInputStream.Chunk) to remote side and waits for an answer. But an answer should be read by exactly this thread that is waiting (unless there are two channels and two thread).

          Vitalii Tymchyshyn added a comment - BTW: Something is bran-damaging in this stack trace. It is channel reader thread that is executing. It then sends a read request (hudson.remoting.ProxyInputStream.Chunk) to remote side and waits for an answer. But an answer should be read by exactly this thread that is waiting (unless there are two channels and two thread).

          OK, it seems I got this down. The problem is in SSH connection buffering. Fix is in https://github.com/jenkinsci/ssh-slaves-plugin/pull/4

          Vitalii Tymchyshyn added a comment - OK, it seems I got this down. The problem is in SSH connection buffering. Fix is in https://github.com/jenkinsci/ssh-slaves-plugin/pull/4

          Code changed in jenkins
          User: Seiji Sogabe
          Path:
          src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
          http://jenkins-ci.org/commit/ssh-slaves-plugin/aa61fad787d7c49d5c5b417d6e38371ffa7e6397
          Log:
          Merge pull request #4 from tivv/master

          A fix for https://issues.jenkins-ci.org/browse/JENKINS-3922

          Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/2c0afb6...aa61fad

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Seiji Sogabe Path: src/main/java/hudson/plugins/sshslaves/SSHLauncher.java http://jenkins-ci.org/commit/ssh-slaves-plugin/aa61fad787d7c49d5c5b417d6e38371ffa7e6397 Log: Merge pull request #4 from tivv/master A fix for https://issues.jenkins-ci.org/browse/JENKINS-3922 Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/2c0afb6...aa61fad

          I don't see the fix in the last released versions?

          Boris Granveaud added a comment - I don't see the fix in the last released versions?

          Jimmi Dyson added a comment -

          The fix is present, but doesn't actually seem to fix the problem. Rather it just improves the speed a bit, but it is still very slow compared to native SSH.

          I have done some tests & this seems to be due to the library that Jenkins uses for SSH - org.jvnet.hudson:trilead-ssh2:build212-hudson-5. Simple tests show really slow SFTPing. I've experimented with JSch & get comparable speeds against native SFTP. I'm working on porting the SSH slaves plugin to use JSch (which is released under a BSD-style license - any compatibility issues there?).

          One drawback of the JSch library is the lack of Putty key support. I don't know if this is such a big deal as users can always convert Putty keys to OpenSSH format keys using puttygen?

          I notice that the org.jvnet.hudson:trilead-ssh2:build212-hudson-5 dependency comes as transitive from jenkins-core. Should the SSH library actually be a part of core dependencies?

          Jimmi Dyson added a comment - The fix is present, but doesn't actually seem to fix the problem. Rather it just improves the speed a bit, but it is still very slow compared to native SSH. I have done some tests & this seems to be due to the library that Jenkins uses for SSH - org.jvnet.hudson:trilead-ssh2:build212-hudson-5. Simple tests show really slow SFTPing. I've experimented with JSch & get comparable speeds against native SFTP. I'm working on porting the SSH slaves plugin to use JSch (which is released under a BSD-style license - any compatibility issues there?). One drawback of the JSch library is the lack of Putty key support. I don't know if this is such a big deal as users can always convert Putty keys to OpenSSH format keys using puttygen? I notice that the org.jvnet.hudson:trilead-ssh2:build212-hudson-5 dependency comes as transitive from jenkins-core. Should the SSH library actually be a part of core dependencies?

          Jimmi Dyson added a comment -

          The fix slightly speeds it up, but it is still slow enough to extend build times considerably with big artifacts. Experimenting with a different SSH library that shows good initial signs of speeding things up to near-native SSH speed.

          Jimmi Dyson added a comment - The fix slightly speeds it up, but it is still slow enough to extend build times considerably with big artifacts. Experimenting with a different SSH library that shows good initial signs of speeding things up to near-native SSH speed.

          Jimmi Dyson added a comment - - edited

          I've updated the ssh slaves plugin to use JSch & all works fine for connection, starting, running builds, disconnecting, etc. But it doesn't solve the SFTP speed issue... I now realise that it doesn't actually use SFTP for archiving artifacts back to the master - that is done through the FilePath abstraction I believe, although using the streams created by the SSHLauncher.

          So why is this slow? In our environment, doing native SSH transfers takes around 10 seconds for a 100MB transfer. Jenkins archiving a 100MB artifact takes about 50 seconds using an SSH slave. Using an SFTP client using JSch 100MB is transferred in the same time as native (10 seconds).

          Jimmi Dyson added a comment - - edited I've updated the ssh slaves plugin to use JSch & all works fine for connection, starting, running builds, disconnecting, etc. But it doesn't solve the SFTP speed issue... I now realise that it doesn't actually use SFTP for archiving artifacts back to the master - that is done through the FilePath abstraction I believe, although using the streams created by the SSHLauncher. So why is this slow? In our environment, doing native SSH transfers takes around 10 seconds for a 100MB transfer. Jenkins archiving a 100MB artifact takes about 50 seconds using an SSH slave. Using an SFTP client using JSch 100MB is transferred in the same time as native (10 seconds).

          David Reiss added a comment -

          This issue affected us in a big way once we moved our slaves to a remote datacenter. From the descriptions, it seems like not everyone has the same problem that we did, but I'll explain how we fixed it.

          Diagnostics

          • Make sure you can log into your master and run "scp slave:somefile ." and get the bandwidth that you expect. If not, Jenkins is not your problem. Check out http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php if you are on a high-latency link.
          • Compute your bandwidth-delay product. This is the bandwidth you get from a raw scp in bytes per second times the round-trip-time you get from ping. In my case, this was about 4,000,000 (4 MB/s) * 0.06 (60ms) = 240,000 bytes (240 kB).
          • If you are using ssh slaves and your BDP is greater than 16 kB, you are definitely having the same problem that we were. This is the trilead ssh window problem.
          • If you are using any type of slave and your BDP is greater than 128 kB, then you are also affected by the jenkins remoting pipe window problem.

          trilead ssh window problem

          The ssh-slaves-plugin uses the trilead ssh library to connect to the slaves. Unfortunately, that library uses a hard-coded 30,000-byte receive buffer, which limits the amount of in-flight data to 30,000 bytes. In practice, the algorithm it uses for updating its receive window rounds that down to a power of two, so you only get 16kB.

          I created a pull request at https://github.com/jenkinsci/trilead-ssh2/pull/1 to make this configurable at JVM startup time. Making this window large increased our bandwidth by a factor of almost 8. Note that two of these buffers are allocated for each slave, so turning this up can consume memory quickly if you have several slaves. In our case, we have memory to spare, so it wasn't a problem. It might be useful to switch to another ssh library that allocates window memory dynamically.

          Fixing this will get your BDP up to almost 128kB, but beyond that, you run into another problem.

          jenkins remoting pipe window problem

          The archiving process uses a hudson.remoting.Pipe object to send the data back. This object uses flow control to avoid overwhelming the receiver. By default, it only allows 128kB of in-flight data. There is already a system property that controls this constant, but it has a space in its name, which makes it a bit complicated to set. I created a pull request at https://github.com/jenkinsci/remoting/pull/4 to fix the name.

          Note that this property must be set on the slave's JVM, not the master's. Therefore, to set it, you must go into your ssh slave configuration, open the advanced button, find the "JVM Options" input, and enter "-Dclass\ hudson.remoting.Channel.pipeWindowSize=1234567" (no quotes, change the number to whatever is appropriate for your environment). If my pull request is accepted, this will change to "-Dhudson.remoting.Channel.pipeWindowSize=1234567". Note that this window is not preallocated, so you can make this number fairly large and excess memory will not be consumed unless the master is unable to keep up with data from the slave.

          Increasing both of these windows increased our bandwidth by a factor about 15, matching the 4MB/s we were getting from raw scp.

          Good luck!

          David Reiss added a comment - This issue affected us in a big way once we moved our slaves to a remote datacenter. From the descriptions, it seems like not everyone has the same problem that we did, but I'll explain how we fixed it. Diagnostics Make sure you can log into your master and run "scp slave:somefile ." and get the bandwidth that you expect. If not, Jenkins is not your problem. Check out http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php if you are on a high-latency link. Compute your bandwidth-delay product. This is the bandwidth you get from a raw scp in bytes per second times the round-trip-time you get from ping. In my case, this was about 4,000,000 (4 MB/s) * 0.06 (60ms) = 240,000 bytes (240 kB). If you are using ssh slaves and your BDP is greater than 16 kB, you are definitely having the same problem that we were. This is the trilead ssh window problem. If you are using any type of slave and your BDP is greater than 128 kB, then you are also affected by the jenkins remoting pipe window problem. trilead ssh window problem The ssh-slaves-plugin uses the trilead ssh library to connect to the slaves. Unfortunately, that library uses a hard-coded 30,000-byte receive buffer, which limits the amount of in-flight data to 30,000 bytes. In practice, the algorithm it uses for updating its receive window rounds that down to a power of two, so you only get 16kB. I created a pull request at https://github.com/jenkinsci/trilead-ssh2/pull/1 to make this configurable at JVM startup time. Making this window large increased our bandwidth by a factor of almost 8. Note that two of these buffers are allocated for each slave, so turning this up can consume memory quickly if you have several slaves. In our case, we have memory to spare, so it wasn't a problem. It might be useful to switch to another ssh library that allocates window memory dynamically. Fixing this will get your BDP up to almost 128kB, but beyond that, you run into another problem. jenkins remoting pipe window problem The archiving process uses a hudson.remoting.Pipe object to send the data back. This object uses flow control to avoid overwhelming the receiver. By default, it only allows 128kB of in-flight data. There is already a system property that controls this constant, but it has a space in its name, which makes it a bit complicated to set. I created a pull request at https://github.com/jenkinsci/remoting/pull/4 to fix the name. Note that this property must be set on the slave 's JVM, not the master's. Therefore, to set it, you must go into your ssh slave configuration, open the advanced button, find the "JVM Options" input, and enter "-Dclass\ hudson.remoting.Channel.pipeWindowSize=1234567" (no quotes, change the number to whatever is appropriate for your environment). If my pull request is accepted, this will change to "-Dhudson.remoting.Channel.pipeWindowSize=1234567". Note that this window is not preallocated, so you can make this number fairly large and excess memory will not be consumed unless the master is unable to keep up with data from the slave. Increasing both of these windows increased our bandwidth by a factor about 15, matching the 4MB/s we were getting from raw scp. Good luck!

          Jesse Glick added a comment -

          Probably improved by JENKINS-7813 fixes.

          Jesse Glick added a comment - Probably improved by JENKINS-7813 fixes.

            kohsuke Kohsuke Kawaguchi
            pamdirac John McNair
            Votes:
            35 Vote for this issue
            Watchers:
            34 Start watching this issue

              Created:
              Updated:
              Resolved: