Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23917

Protocol deadlock while uploading artifacts from ppc64

    • Icon: Bug Bug
    • Resolution: Won't Fix
    • Icon: Major Major
    • core, remoting

      I've encountered an ssh2 channel protocol issue when a ppc64 slave communicates with an x64 master.

      Most operations, like sending build logs, work fine. When the time comes to upload artifacts at the end of the build the build stalls indefinitely at:

      Archiving artifacts
      

      If I get stack dumps of slave and master using jstack, I see the master waiting to read from the slave:

      "Channel reader thread: Fedora16-ppc64-Power7-osuosl-karman" prio=10 tid=0x00000000038c2800 nid=0x6de7 in Object.wait() [0x00007f825ef8b000]
         java.lang.Thread.State: WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              - waiting on <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at java.lang.Object.wait(Object.java:502)
              at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
              - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
              at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
              - locked <0x00000000bf5802e0> (a com.trilead.ssh2.channel.Channel)
              at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
              at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      and the slave is waiting for data from the master:

      "Channel reader thread: channel" prio=10 tid=0x00000fff940fedd0 nid=0x558e runnable [0x00000fff6dc6d000]
         java.lang.Thread.State: RUNNABLE
              at java.io.FileInputStream.readBytes(Native Method)
              at java.io.FileInputStream.read(FileInputStream.java:236)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
              - locked <0x00000fff78ba9f98> (a java.io.BufferedInputStream)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      of course I can't get those dumps at exactly the same moment, even if that were meaningful with network latencies and buffering, but repeated runs never show any other state for either thread.

      tshark shows that there's some SSH chatter going on:

        0.000000 SLAVE -> MASTER SSH 126 Encrypted response packet len=60
        0.176121 MASTER -> SLAVE SSH 94 Encrypted request packet len=28
        0.176151 SLAVE -> MASTER TCP 66 ssh > 37501 [ACK] Seq=61 Ack=29 Win=707 Len=0 TSval=4141397874 TSecr=2808266826
      

      but it should well be low level ssh keepalives or similar, as it's at precise 5 second intervals with nothing much else happening. There are three master->slave ssh connections, so it's not guaranteed that it's even the one associated with the stuck channel.

      My first thought is endianness.

      I don't really know how to begin debugging this issue, though.

        1. slavelog-from-master.txt
          4 kB
        2. slavelog-from-slave.txt
          1 kB
        3. jenkins-master-idle-stack.txt
          47 kB
        4. config.xml
          1.0 kB
        5. jenkins-master-stack.txt
          49 kB
        6. jenkins-slave-stack.txt
          6 kB

          [JENKINS-23917] Protocol deadlock while uploading artifacts from ppc64

          Craig Ringer created issue -
          Craig Ringer made changes -
          Summary Original: Protocol deadlock while uploading artifacts New: Protocol deadlock while uploading artifacts from ppc64

          Craig Ringer added a comment -

          I'm going to re-test with a simplified build and a single executor configured.

          Craig Ringer added a comment - I'm going to re-test with a simplified build and a single executor configured.

          Craig Ringer added a comment -

          Interestingly, archiving worked with a trivial configuration - a dd command to create a dummy file and a trivial archiving command to copy it.

          Craig Ringer added a comment - Interestingly, archiving worked with a trivial configuration - a dd command to create a dummy file and a trivial archiving command to copy it.

          Craig Ringer added a comment -

          I've been able to reproduce this with a trivial job and after limiting the node to a single executor. It is not consistently reproducible, it's somewhat random. I suspect it's dependent on the input being archived (10MB of randomly generated data), but it could also be just plain random.

          I'll attach the config.xml and thread dumps.

          Craig Ringer added a comment - I've been able to reproduce this with a trivial job and after limiting the node to a single executor. It is not consistently reproducible, it's somewhat random. I suspect it's dependent on the input being archived (10MB of randomly generated data), but it could also be just plain random. I'll attach the config.xml and thread dumps.

          Craig Ringer added a comment -

          Build log is:

          Started by user Craig Ringer
          [EnvInject] - Loading node environment variables.
          Building remotely on fedora16-ppc64-Power7-osuosl-karman (ppc64 fedora16 ppc linux fedora) in workspace /home/jenkins/workspace/ppctest
          [ppctest] $ /bin/sh -xe /tmp/hudson9160124340407748260.sh
          + dd if=/dev/urandom of=dummy.out bs=1M count=10
          10+0 records in
          10+0 records out
          10485760 bytes (10 MB) copied, 0.604488 s, 17.3 MB/s
          Archiving artifacts
          

          At the time the master stack was taken there was another build running. I'll see if I can capture another once it's idle.

          Craig Ringer added a comment - Build log is: Started by user Craig Ringer [EnvInject] - Loading node environment variables. Building remotely on fedora16-ppc64-Power7-osuosl-karman (ppc64 fedora16 ppc linux fedora) in workspace /home/jenkins/workspace/ppctest [ppctest] $ /bin/sh -xe /tmp/hudson9160124340407748260.sh + dd if =/dev/urandom of=dummy.out bs=1M count=10 10+0 records in 10+0 records out 10485760 bytes (10 MB) copied, 0.604488 s, 17.3 MB/s Archiving artifacts At the time the master stack was taken there was another build running. I'll see if I can capture another once it's idle.
          Craig Ringer made changes -
          Attachment New: config.xml [ 26420 ]
          Attachment New: jenkins-master-stack.txt [ 26421 ]
          Attachment New: jenkins-slave-stack.txt [ 26422 ]

          Craig Ringer added a comment -

          If it is helpful, I can provision access to the build worker for anyone interested in this issue.

          Craig Ringer added a comment - If it is helpful, I can provision access to the build worker for anyone interested in this issue.

          Craig Ringer added a comment - - edited

          Attached a jstack for the master at idle except for the stuck connection.

          Craig Ringer added a comment - - edited Attached a jstack for the master at idle except for the stuck connection.
          Craig Ringer made changes -
          Attachment New: jenkins-master-idle-stack.txt [ 26423 ]

            Unassigned Unassigned
            ringerc Craig Ringer
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: