Have encountered some reports of slow slave performance on a Unix master using many slaves where the thread dumps show all but one slave connection thread waiting for a single lock, which is held by a thread that looks like this:
"Pipe writer thread: ..." - Thread ...
java.lang.Thread.State: RUNNABLE
at sun.security.provider.NativePRNG$RandomIO.implNextBytes(NativePRNG.java:255)
- locked <598aec0c> (a java.lang.Object)
at sun.security.provider.NativePRNG$RandomIO.access$200(NativePRNG.java:108)
at sun.security.provider.NativePRNG.engineNextBytes(NativePRNG.java:97)
at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
- locked <329129da> (a java.security.SecureRandom)
at java.security.SecureRandom.next(SecureRandom.java:455)
at java.util.Random.nextInt(Random.java:189)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:154)
From what I can tell neither the Jenkins SSH Slaves plugin nor the Trilead SSH library are to blame, as they produce a different SecureRandom instance for each slave. Rather it is NativePRNG (the default implementation on typical Linux installations among others) which uses a global lock, to synchronize access to /dev/random and /dev/urandom; and random can block waiting for sufficient entropy to accumulate.
It might help for the SSH Slaves plugin to offer a java.security.SecureRandom based on sun.security.provider.SecureRandom, which does not acquire a global lock to process connection data. (It may take longer to set up a connection, since it needs to seed the random-number generator based on thread activity.)
Unconfirmed workarounds:
- Edit the JRE's $JAVA_HOME/lib/security/java.security to comment out the line securerandom.source=file:/dev/urandom (should switch back to the generic implementation)
- Running -Djava.security.egd=file:/dev/./urandom (should force use of urandom which is supposed to be nonblocking)