-
Bug
-
Resolution: Won't Fix
-
Major
-
Jenkins master ver. 1.574-SNAPSHOT (git 17d9093) on x64 Debian 7 with:
java version "1.6.0_32"
OpenJDK Runtime Environment (IcedTea6 1.13.4) (6b32-1.13.4-1~deb7u1)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
Jenkins slave on POWER7 ppc64 Fedora 16 (an OSU OSL community machine), with Java:
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (fedora-65.1.11.1.fc16-ppc64)
OpenJDK 64-Bit Zero VM (build 20.0-b12, interpreted mode)Jenkins master ver. 1.574-SNAPSHOT (git 17d9093) on x64 Debian 7 with: java version "1.6.0_32" OpenJDK Runtime Environment (IcedTea6 1.13.4) (6b32-1.13.4-1~deb7u1) OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode) Jenkins slave on POWER7 ppc64 Fedora 16 (an OSU OSL community machine), with Java: java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.1) (fedora-65.1.11.1.fc16-ppc64) OpenJDK 64-Bit Zero VM (build 20.0-b12, interpreted mode)
I've encountered an ssh2 channel protocol issue when a ppc64 slave communicates with an x64 master.
Most operations, like sending build logs, work fine. When the time comes to upload artifacts at the end of the build the build stalls indefinitely at:
Archiving artifacts
If I get stack dumps of slave and master using jstack, I see the master waiting to read from the slave:
"Channel reader thread: Fedora16-ppc64-Power7-osuosl-karman" prio=10 tid=0x00000000038c2800 nid=0x6de7 in Object.wait() [0x00007f825ef8b000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel) at java.lang.Object.wait(Object.java:502) at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212) - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel) at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127) at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946) - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel) at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58) at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
and the slave is waiting for data from the master:
"Channel reader thread: channel" prio=10 tid=0x00000fff940fedd0 nid=0x558e runnable [0x00000fff6dc6d000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:236) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) - locked <0x00000fff78ba9f98> (a java.io.BufferedInputStream) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
of course I can't get those dumps at exactly the same moment, even if that were meaningful with network latencies and buffering, but repeated runs never show any other state for either thread.
tshark shows that there's some SSH chatter going on:
0.000000 SLAVE -> MASTER SSH 126 Encrypted response packet len=60 0.176121 MASTER -> SLAVE SSH 94 Encrypted request packet len=28 0.176151 SLAVE -> MASTER TCP 66 ssh > 37501 [ACK] Seq=61 Ack=29 Win=707 Len=0 TSval=4141397874 TSecr=2808266826
but it should well be low level ssh keepalives or similar, as it's at precise 5 second intervals with nothing much else happening. There are three master->slave ssh connections, so it's not guaranteed that it's even the one associated with the stuck channel.
My first thought is endianness.
I don't really know how to begin debugging this issue, though.