Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-9189

truncation or corruption of zip workspace archive from slave

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • None
    • Hudson/Jenkins >= 1.378, Debian, slave connected via SSH

      Downloading a ZIP archive of the workspace from a project that was built on a slave appears to be broken since Hudson 1.378 (found by bisection between 1.365 and 1.406-SNAPSHOT). It worked and still works when the project was built on the master, so no remoting takes place.

      How to reproduce:

      1. Set up a free-style project that just creates a few files in the workspace, such as:
        env > env.txt
        ls -la > ls-la.txt
        dd if=/dev/urandom of=random.bin bs=512 count=2048
        
      2. Restrict this project to run on a slave (connected via SSH in my case).
      3. Run this project.
      4. Using the "(all files in zip)" link, download the workspace and verify the downloaded Zip archive. With 1.377 and before, you can run the download and verification step in a loop from the command line for 100 times in a row without error. Since 1.378, it will usually fail at the second attempt and will, on first glance at the hexdump, look like a correct but truncated Zip archive. The script that I used for testing is this:
        $ cat test.sh 
        i=0
        while [ $i -lt 100 ]; do
                i=`expr $i + 1`
                echo $i
                wget -q -O test.zip 'http://localhost/jenkins/job/test/ws/*zip*/test.zip' && \
                unzip -l test.zip > /dev/null || exit $?
        done
        exit 0
        

      Known workaround:

      • Run the job on the Jenkins master. (This isn't an option in our setup.)

      Possibly related issues:

      The changelog of 1.378 mentions JENKINS-5977 "Improving the master/slave communication to avoid pipe clogging problem." and I suspect that this change introduced the problem. A later changelog entry for 1.397 mentions that it fixed "a master/slave communication problem since 1.378" (JENKINS-7745). However, using the steps described above I can still reproduce at least this issue, even in the current version 1.404 and the latest snapshot.

      As suggested in comments of other issues touching on the field of master/slave communication, it would seem reasonable to assume that this issue could be caused by a missing flush operation on an output stream, or something to that effect. Another possibility, however likely, might be the suspected thread concurrency problem noted in remoting/src/main/java/hudson/remoting/PipeWindow.java, where it also mentions the issues JENKINS-7745 or JENKINS-7581.

          [JENKINS-9189] truncation or corruption of zip workspace archive from slave

          Uwe Stuehler created issue -

          Uwe Stuehler added a comment -

          This patch, which reverts to the old behavior of using a SynchronousExecutorService in hudson.remoting.Channel seems to be another workaround.

          Uwe Stuehler added a comment - This patch, which reverts to the old behavior of using a SynchronousExecutorService in hudson.remoting.Channel seems to be another workaround.
          Uwe Stuehler made changes -
          Attachment New: Channel_createPipeWriter.diff [ 20336 ]

          Uwe Stuehler added a comment - - edited

          With this diff and the log level set to FINE, we get something like this when retrieving the workspace Zip file:

          Apr 11, 2011 8:33:41 PM hudson.remoting.ProxyOutputStream
          FINE: oid=5 Yaaawn!
          Apr 11, 2011 8:33:41 PM hudson.model.Queue
          FINE: Queue maintenance started hudson.model.Queue@6e135779
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Send Pipe.Ack(5,2)
          [...]
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Send Pipe.Ack(5,18)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Send Pipe.Ack(5,512)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Response[retVal=hudson.remoting.UserResponse@4b76ffeb,exception=null]
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Pipe.EOF(5)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Pipe.Pause(5)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Pipe.Chunk(5,2)
          

          What it shows is that commands are executed in parallel, so while Chunk commands are being executed, an EOF command can start executing at the same time and close the stream. This seems at least a bit counterintuitive, given the documentation for Channel.send():

          /**
           * Sends a command to the remote end and executes it there.
           *
           * <p>
           * This is the lowest layer of abstraction in {@link Channel}.
           * {@link Command}s are executed on a remote system in the order they are sent.
           */
          /*package*/ synchronized void send(Command cmd) throws IOException {
          

          Uwe Stuehler added a comment - - edited With this diff and the log level set to FINE, we get something like this when retrieving the workspace Zip file: Apr 11, 2011 8:33:41 PM hudson.remoting.ProxyOutputStream FINE: oid=5 Yaaawn! Apr 11, 2011 8:33:41 PM hudson.model.Queue FINE: Queue maintenance started hudson.model.Queue@6e135779 Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,2) [...] Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,18) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,512) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Response[retVal=hudson.remoting.UserResponse@4b76ffeb,exception= null ] Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.EOF(5) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.Pause(5) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.Chunk(5,2) What it shows is that commands are executed in parallel, so while Chunk commands are being executed, an EOF command can start executing at the same time and close the stream. This seems at least a bit counterintuitive, given the documentation for Channel.send(): /** * Sends a command to the remote end and executes it there. * * <p> * This is the lowest layer of abstraction in {@link Channel}. * {@link Command}s are executed on a remote system in the order they are sent. */ /* package */ synchronized void send(Command cmd) throws IOException {
          Uwe Stuehler made changes -
          Attachment New: command-completion-fail.diff [ 20372 ]
          Uwe Stuehler made changes -
          Priority Original: Critical [ 2 ] New: Major [ 3 ]
          Harald Kahlfeld made changes -
          Attachment New: fix-JENKINS-9189.diff [ 20486 ]

          After some more investigation we found that the issue was still persistent with the current master or 1_412 release. Even after the flow was corrected - probably by other changes or fixes - to seemingly execute the commands in order of appearance, the zip files still get corrupted.

          The issue seems to be related to a race condition between a chunk command to be completed and the eof command closing the stream without flushing. The attached fix handles this situation as it introduces - within the doClose() method - a call to flush the stream by invoking the according command. Furthermore the implementation of the flush command had to be changed to wait for the completion of the thread, making sure this way the flush() was executed before cutting the stream within the following eof command.

          Remark: The flush command was added as described above as we didn't find a better place to put it. On the other hand it allows a somewhat lazy call of the doClose method.

          Harald Kahlfeld added a comment - After some more investigation we found that the issue was still persistent with the current master or 1_412 release. Even after the flow was corrected - probably by other changes or fixes - to seemingly execute the commands in order of appearance, the zip files still get corrupted. The issue seems to be related to a race condition between a chunk command to be completed and the eof command closing the stream without flushing. The attached fix handles this situation as it introduces - within the doClose() method - a call to flush the stream by invoking the according command. Furthermore the implementation of the flush command had to be changed to wait for the completion of the thread, making sure this way the flush() was executed before cutting the stream within the following eof command. Remark: The flush command was added as described above as we didn't find a better place to put it. On the other hand it allows a somewhat lazy call of the doClose method.

          Uwe Stuehler added a comment -

          I have reviewed Harald's diff and I agree with him on the fix and his comment, but I would like someone to review it again and commit it if it's correct.

          Kawaguchi-san, I've assigned the issue to you since as far as I can tell you implemented the pipe throttling mechanism and might know that area the best.

          Regarding the fix: the added .get() seems to be necessary indeed to actually ensure the flush finishes and this is what Channel.localSyncIO() does as well.

          Thanks a lot.

          Uwe Stuehler added a comment - I have reviewed Harald's diff and I agree with him on the fix and his comment, but I would like someone to review it again and commit it if it's correct. Kawaguchi-san, I've assigned the issue to you since as far as I can tell you implemented the pipe throttling mechanism and might know that area the best. Regarding the fix: the added .get() seems to be necessary indeed to actually ensure the flush finishes and this is what Channel.localSyncIO() does as well. Thanks a lot.
          Uwe Stuehler made changes -
          Assignee New: Kohsuke Kawaguchi [ kohsuke ]

            kohsuke Kohsuke Kawaguchi
            ustuehler Uwe Stuehler
            Votes:
            6 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: