-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Platform: All, OS: All
-
Powered by SuggestiMate
The artifact transfer is currently a 3-4x penalty for the project that I am
working on. I have reproduced the issue with a simple test pom that does
nothing but jar hudson.war. I performed this test on a heterogeneous
environment. Both master and slave are running Fedora 10, but the master is a
faster machine. Still, it highlights the issue.
Here are some stats (all stats are after caching dependencies in the local repos):
Master build through Hudson: 19s
Master build from command line (no Hudson): 9s
Slave build through Hudson: 1m46s
Slave build from command line (no Hudson): 16s
To be fair we should at least add time to do a straight scp of the artifact from
slave to master. The two nodes share a 100 Mbit switch:
$ scp target/slow-rider-1.0.0-SNAPSHOT.jar master_node:
slow-rider-1.0.0NAPSHOT.jar 100% 25MB 12.7MB/s 00:02
Of course this example exaggerates the issue to make it more clear but not by
too much. I originally noticed this in a completely separate environment that
was all virtual. I reproduced this on two physical machines using a different
switch and different ethernet drivers (both virtual and physical). The
reproducibility plus the comparison against command line + scp leads me to
suspect eager flushing.
- is related to
-
JENKINS-3799 Slave-to-master copies can be extremely slow
-
- Resolved
-
-
JENKINS-7813 Archiving artifacts very slow
-
- Resolved
-
-
JENKINS-7921 Archiving very slow between slave and master in unix driven by ssh
-
- Resolved
-
-
JENKINS-3524 sending artifacts to master is very slow
-
- Closed
-
[JENKINS-3922] Slave is slow copying maven artifacts to master
Oops. I just noticed that I forgot to upgrade this environment. The above
stats were collected on 309. On 312 we have:
Master: 20s
Slave: 1m18s
There seems to be definite improvement but still a big penalty for the slave.
Would it be possible for you to run a packet capturing tool like Wireshark to
obtain the network packet dump between the master and the slave?
You want the ssh traffic? Is that helpful? Also, it is ~63MB for this build.
Is there a subset that you would want to see?
Just noting that this is still very definitely the case - and for what it's worth, I had the same speed problems both with the current MavenArtifact contents and with a test I did using FilePath.copyRecursiveTo instead of FilePath.copyTo.
We're seeing similar problems in the ASF Hudson environment. Archiving frequently make up 90% of the build time for projects. Is the currently any work ongoing in this area? Is there anything we could assist that would help in debugging the problems we're seeing?
I diagnosed this issue some and I see a stack trace being sent with every 8K data chunk for these transfers. There's some 12 packets being sent per 8K chunk of data.
I'm seeing this problem also. It turns a 10-minute build into a 60-minute build.
Hudson claims to have resolved this ticket in their system:
http://issues.hudson-ci.org/browse/HUDSON-3922
More details in a duplicate:
http://issues.hudson-ci.org/browse/HUDSON-7813
Fix is here:
https://github.com/hudson/hudson/commit/953af4eabc03be58abe8405a35090b4e5fd08933
The fix is to give users an option to disable compression altogether for remoting. Can we have something similar in jenkins? I realize that there will be a class of use cases where good, working compression makes sense, but in a setup where all jenkins nodes are on the same physical switch, compression rarely makes sense. So the deeper fix is probably to provide an option to turn off compression, perhaps per slave node AND fix the compression performance on Linux. I'd be ecstatic to get the first part in the short term.
While I haven't tested that fix, I am highly skeptical that the fix addresses this problem. From my debugging, the problem seemed to be excess network traffic and not the overhead of compression. Specifically, there were something like 10-12 packets per 8K chunks of data sent, there was a stack trace attached to every chunk (seems like an easy fix, this last one) and who knows what else.
Just spend a little time stepping through the file transmission loop and watch that Wireshark output.
I'm open to considering multiple causes, but I'll add another data point or two. I changed the chunking size to multiple megabytes to limit the number of object serializations, and it had a very small impact. I hacked together an enhancement of the remoting API so that files could be exchanged in a single object serialization. Even testing with 1 big file showed only limited improvement. I had exactly the same suspicions initially that there were simply too many packets exchanged for the amount of data flowing. My own testing showed otherwise for my particular setup. I convinced myself that there were some small gains to be made along these lines, but the real problem was elsewhere. I never got to the bottom of it though. I don't have proof, but the testing done on hudson and the explanation of their fix jive with my experience.
Nevermind. I changed TarCompression.GZIP to TarCompression.NONE and tested that. No difference. Back to square 1.
We have same issue on our environment. All our slaves are started over SSH. I have moved single job to JLNP slave and artefacts are copied way faster. Retrieving files from git and then console is also almost instant in comparison to job running over SSH.
Do you thing it is worth moving all jobs to JLNP or after moving all congestion will move together?
For me (FreeBSD), it is locked in next stack trace:
"Channel reader thread: Channel to Maven [/usr/local/openjdk6//bin/java, -cp, /usr/home/builder/jenkins/builder/maven-agent.jar:/usr/home/builder/jenkins/builder/classworlds.jar, hudson.maven.agent.Main, /usr/local/share/java/maven2, /usr/home/builder/jenkins/builder/slave.jar, /usr/home/builder/jenkins/builder/maven-interceptor.jar, 40088, /usr/home/builder/jenkins/builder/maven2.1-interceptor.jar] / waiting for hudson.remoting.Channel@77cd18d:builder" prio=5 tid=0x0000000851778000 nid=0x84e460740 in Object.wait() [0x00007ffffa9ac000..0x00007ffffa9ac920]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:127)
- locked <0x000000083eb71ca8> (a hudson.remoting.ProxyInputStream$Chunk)
at hudson.remoting.ProxyInputStream._read(ProxyInputStream.java:74) - locked <0x000000081ae366b8> (a hudson.remoting.ProxyInputStream)
at hudson.remoting.ProxyInputStream.read(ProxyInputStream.java:80)
at hudson.remoting.RemoteInputStream.read(RemoteInputStream.java:91)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317) - locked <0x000000081ae36668> (a java.io.BufferedInputStream)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317) - locked <0x000000081ae36638> (a java.io.BufferedInputStream)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2264)
at java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2666)
at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2696)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1648)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1945)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1869)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:1087)
Looking at the code, there is a comment here:
// I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
// but in production I've observed that in rare occasion it can block forever, even after a channel
// is gone. So be defensive against that.
wait(30*1000);
It seems that it is time to get when this does occur
BTW: Something is bran-damaging in this stack trace. It is channel reader thread that is executing. It then sends a read request (hudson.remoting.ProxyInputStream.Chunk) to remote side and waits for an answer. But an answer should be read by exactly this thread that is waiting (unless there are two channels and two thread).
OK, it seems I got this down. The problem is in SSH connection buffering. Fix is in https://github.com/jenkinsci/ssh-slaves-plugin/pull/4
Code changed in jenkins
User: Seiji Sogabe
Path:
src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
http://jenkins-ci.org/commit/ssh-slaves-plugin/aa61fad787d7c49d5c5b417d6e38371ffa7e6397
Log:
Merge pull request #4 from tivv/master
A fix for https://issues.jenkins-ci.org/browse/JENKINS-3922
Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/2c0afb6...aa61fad
The fix is present, but doesn't actually seem to fix the problem. Rather it just improves the speed a bit, but it is still very slow compared to native SSH.
I have done some tests & this seems to be due to the library that Jenkins uses for SSH - org.jvnet.hudson:trilead-ssh2:build212-hudson-5. Simple tests show really slow SFTPing. I've experimented with JSch & get comparable speeds against native SFTP. I'm working on porting the SSH slaves plugin to use JSch (which is released under a BSD-style license - any compatibility issues there?).
One drawback of the JSch library is the lack of Putty key support. I don't know if this is such a big deal as users can always convert Putty keys to OpenSSH format keys using puttygen?
I notice that the org.jvnet.hudson:trilead-ssh2:build212-hudson-5 dependency comes as transitive from jenkins-core. Should the SSH library actually be a part of core dependencies?
The fix slightly speeds it up, but it is still slow enough to extend build times considerably with big artifacts. Experimenting with a different SSH library that shows good initial signs of speeding things up to near-native SSH speed.
I've updated the ssh slaves plugin to use JSch & all works fine for connection, starting, running builds, disconnecting, etc. But it doesn't solve the SFTP speed issue... I now realise that it doesn't actually use SFTP for archiving artifacts back to the master - that is done through the FilePath abstraction I believe, although using the streams created by the SSHLauncher.
So why is this slow? In our environment, doing native SSH transfers takes around 10 seconds for a 100MB transfer. Jenkins archiving a 100MB artifact takes about 50 seconds using an SSH slave. Using an SFTP client using JSch 100MB is transferred in the same time as native (10 seconds).
This issue affected us in a big way once we moved our slaves to a remote datacenter. From the descriptions, it seems like not everyone has the same problem that we did, but I'll explain how we fixed it.
Diagnostics
- Make sure you can log into your master and run "scp slave:somefile ." and get the bandwidth that you expect. If not, Jenkins is not your problem. Check out http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php if you are on a high-latency link.
- Compute your bandwidth-delay product. This is the bandwidth you get from a raw scp in bytes per second times the round-trip-time you get from ping. In my case, this was about 4,000,000 (4 MB/s) * 0.06 (60ms) = 240,000 bytes (240 kB).
- If you are using ssh slaves and your BDP is greater than 16 kB, you are definitely having the same problem that we were. This is the trilead ssh window problem.
- If you are using any type of slave and your BDP is greater than 128 kB, then you are also affected by the jenkins remoting pipe window problem.
trilead ssh window problem
The ssh-slaves-plugin uses the trilead ssh library to connect to the slaves. Unfortunately, that library uses a hard-coded 30,000-byte receive buffer, which limits the amount of in-flight data to 30,000 bytes. In practice, the algorithm it uses for updating its receive window rounds that down to a power of two, so you only get 16kB.
I created a pull request at https://github.com/jenkinsci/trilead-ssh2/pull/1 to make this configurable at JVM startup time. Making this window large increased our bandwidth by a factor of almost 8. Note that two of these buffers are allocated for each slave, so turning this up can consume memory quickly if you have several slaves. In our case, we have memory to spare, so it wasn't a problem. It might be useful to switch to another ssh library that allocates window memory dynamically.
Fixing this will get your BDP up to almost 128kB, but beyond that, you run into another problem.
jenkins remoting pipe window problem
The archiving process uses a hudson.remoting.Pipe object to send the data back. This object uses flow control to avoid overwhelming the receiver. By default, it only allows 128kB of in-flight data. There is already a system property that controls this constant, but it has a space in its name, which makes it a bit complicated to set. I created a pull request at https://github.com/jenkinsci/remoting/pull/4 to fix the name.
Note that this property must be set on the slave's JVM, not the master's. Therefore, to set it, you must go into your ssh slave configuration, open the advanced button, find the "JVM Options" input, and enter "-Dclass\ hudson.remoting.Channel.pipeWindowSize=1234567" (no quotes, change the number to whatever is appropriate for your environment). If my pull request is accepted, this will change to "-Dhudson.remoting.Channel.pipeWindowSize=1234567". Note that this window is not preallocated, so you can make this number fairly large and excess memory will not be consumed unless the master is unable to keep up with data from the slave.
Increasing both of these windows increased our bandwidth by a factor about 15, matching the 4MB/s we were getting from raw scp.
Created an attachment (id=754)
pom that simply jars hudson.war