Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-9189

truncation or corruption of zip workspace archive from slave

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • None
    • Hudson/Jenkins >= 1.378, Debian, slave connected via SSH

      Downloading a ZIP archive of the workspace from a project that was built on a slave appears to be broken since Hudson 1.378 (found by bisection between 1.365 and 1.406-SNAPSHOT). It worked and still works when the project was built on the master, so no remoting takes place.

      How to reproduce:

      1. Set up a free-style project that just creates a few files in the workspace, such as:
        env > env.txt
        ls -la > ls-la.txt
        dd if=/dev/urandom of=random.bin bs=512 count=2048
        
      2. Restrict this project to run on a slave (connected via SSH in my case).
      3. Run this project.
      4. Using the "(all files in zip)" link, download the workspace and verify the downloaded Zip archive. With 1.377 and before, you can run the download and verification step in a loop from the command line for 100 times in a row without error. Since 1.378, it will usually fail at the second attempt and will, on first glance at the hexdump, look like a correct but truncated Zip archive. The script that I used for testing is this:
        $ cat test.sh 
        i=0
        while [ $i -lt 100 ]; do
                i=`expr $i + 1`
                echo $i
                wget -q -O test.zip 'http://localhost/jenkins/job/test/ws/*zip*/test.zip' && \
                unzip -l test.zip > /dev/null || exit $?
        done
        exit 0
        

      Known workaround:

      • Run the job on the Jenkins master. (This isn't an option in our setup.)

      Possibly related issues:

      The changelog of 1.378 mentions JENKINS-5977 "Improving the master/slave communication to avoid pipe clogging problem." and I suspect that this change introduced the problem. A later changelog entry for 1.397 mentions that it fixed "a master/slave communication problem since 1.378" (JENKINS-7745). However, using the steps described above I can still reproduce at least this issue, even in the current version 1.404 and the latest snapshot.

      As suggested in comments of other issues touching on the field of master/slave communication, it would seem reasonable to assume that this issue could be caused by a missing flush operation on an output stream, or something to that effect. Another possibility, however likely, might be the suspected thread concurrency problem noted in remoting/src/main/java/hudson/remoting/PipeWindow.java, where it also mentions the issues JENKINS-7745 or JENKINS-7581.

          [JENKINS-9189] truncation or corruption of zip workspace archive from slave

          Uwe Stuehler added a comment -

          This patch, which reverts to the old behavior of using a SynchronousExecutorService in hudson.remoting.Channel seems to be another workaround.

          Uwe Stuehler added a comment - This patch, which reverts to the old behavior of using a SynchronousExecutorService in hudson.remoting.Channel seems to be another workaround.

          Uwe Stuehler added a comment - - edited

          With this diff and the log level set to FINE, we get something like this when retrieving the workspace Zip file:

          Apr 11, 2011 8:33:41 PM hudson.remoting.ProxyOutputStream
          FINE: oid=5 Yaaawn!
          Apr 11, 2011 8:33:41 PM hudson.model.Queue
          FINE: Queue maintenance started hudson.model.Queue@6e135779
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Send Pipe.Ack(5,2)
          [...]
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Send Pipe.Ack(5,18)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Send Pipe.Ack(5,512)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Response[retVal=hudson.remoting.UserResponse@4b76ffeb,exception=null]
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Pipe.EOF(5)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Pipe.Pause(5)
          Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
          FINE: Received Pipe.Chunk(5,2)
          

          What it shows is that commands are executed in parallel, so while Chunk commands are being executed, an EOF command can start executing at the same time and close the stream. This seems at least a bit counterintuitive, given the documentation for Channel.send():

          /**
           * Sends a command to the remote end and executes it there.
           *
           * <p>
           * This is the lowest layer of abstraction in {@link Channel}.
           * {@link Command}s are executed on a remote system in the order they are sent.
           */
          /*package*/ synchronized void send(Command cmd) throws IOException {
          

          Uwe Stuehler added a comment - - edited With this diff and the log level set to FINE, we get something like this when retrieving the workspace Zip file: Apr 11, 2011 8:33:41 PM hudson.remoting.ProxyOutputStream FINE: oid=5 Yaaawn! Apr 11, 2011 8:33:41 PM hudson.model.Queue FINE: Queue maintenance started hudson.model.Queue@6e135779 Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,2) [...] Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,18) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,512) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Response[retVal=hudson.remoting.UserResponse@4b76ffeb,exception= null ] Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.EOF(5) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.Pause(5) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.Chunk(5,2) What it shows is that commands are executed in parallel, so while Chunk commands are being executed, an EOF command can start executing at the same time and close the stream. This seems at least a bit counterintuitive, given the documentation for Channel.send(): /** * Sends a command to the remote end and executes it there. * * <p> * This is the lowest layer of abstraction in {@link Channel}. * {@link Command}s are executed on a remote system in the order they are sent. */ /* package */ synchronized void send(Command cmd) throws IOException {

          After some more investigation we found that the issue was still persistent with the current master or 1_412 release. Even after the flow was corrected - probably by other changes or fixes - to seemingly execute the commands in order of appearance, the zip files still get corrupted.

          The issue seems to be related to a race condition between a chunk command to be completed and the eof command closing the stream without flushing. The attached fix handles this situation as it introduces - within the doClose() method - a call to flush the stream by invoking the according command. Furthermore the implementation of the flush command had to be changed to wait for the completion of the thread, making sure this way the flush() was executed before cutting the stream within the following eof command.

          Remark: The flush command was added as described above as we didn't find a better place to put it. On the other hand it allows a somewhat lazy call of the doClose method.

          Harald Kahlfeld added a comment - After some more investigation we found that the issue was still persistent with the current master or 1_412 release. Even after the flow was corrected - probably by other changes or fixes - to seemingly execute the commands in order of appearance, the zip files still get corrupted. The issue seems to be related to a race condition between a chunk command to be completed and the eof command closing the stream without flushing. The attached fix handles this situation as it introduces - within the doClose() method - a call to flush the stream by invoking the according command. Furthermore the implementation of the flush command had to be changed to wait for the completion of the thread, making sure this way the flush() was executed before cutting the stream within the following eof command. Remark: The flush command was added as described above as we didn't find a better place to put it. On the other hand it allows a somewhat lazy call of the doClose method.

          Uwe Stuehler added a comment -

          I have reviewed Harald's diff and I agree with him on the fix and his comment, but I would like someone to review it again and commit it if it's correct.

          Kawaguchi-san, I've assigned the issue to you since as far as I can tell you implemented the pipe throttling mechanism and might know that area the best.

          Regarding the fix: the added .get() seems to be necessary indeed to actually ensure the flush finishes and this is what Channel.localSyncIO() does as well.

          Thanks a lot.

          Uwe Stuehler added a comment - I have reviewed Harald's diff and I agree with him on the fix and his comment, but I would like someone to review it again and commit it if it's correct. Kawaguchi-san, I've assigned the issue to you since as far as I can tell you implemented the pipe throttling mechanism and might know that area the best. Regarding the fix: the added .get() seems to be necessary indeed to actually ensure the flush finishes and this is what Channel.localSyncIO() does as well. Thanks a lot.

          Rob Johnston added a comment -

          I've applied Harald Kahlfeld's patch to the latest code and modified the supplied test case so that 100 attempts are made and the number of failures is reported, as such:

          $ cat test.sh
          i=0
          fails=0
          while [ $i -lt 100 ]; do
                  i=`expr $i + 1`
                  echo $i
                  wget -q -O test.zip 'http://localhost/jenkins/job/test/ws/*zip*/test.zip' && \
                  unzip -l test.zip > /dev/null || fails=`expr $fails + 1`
          done
          echo "failures: " $fails
          exit 0
          

          The patch is not a perfect fix, but it's considerably better than the current released version. Using the patched war, 6/100 attempts to download and extract the zip failed. The unpatched war gave a 100% failure rate.

          I think this is acceptable, given that it's currently impossible for users to extract zips downloaded from Jenkins.

          Rob Johnston added a comment - I've applied Harald Kahlfeld's patch to the latest code and modified the supplied test case so that 100 attempts are made and the number of failures is reported, as such: $ cat test.sh i=0 fails=0 while [ $i -lt 100 ]; do i=`expr $i + 1` echo $i wget -q -O test.zip 'http: //localhost/jenkins/job/test/ws /*zip*/ test.zip' && \ unzip -l test.zip > /dev/ null || fails=`expr $fails + 1` done echo "failures: " $fails exit 0 The patch is not a perfect fix, but it's considerably better than the current released version. Using the patched war, 6/100 attempts to download and extract the zip failed. The unpatched war gave a 100% failure rate. I think this is acceptable, given that it's currently impossible for users to extract zips downloaded from Jenkins.

          First of all, thank you Rob for checking the above patch and requesting a pull at git-hub.

          The ratio 6/100 you mentioned gave me some unease and so I got the 1.414 sources, applied my patch and re-checked again. My ratio was 0/100 or no failures with the patched war.

          Just to be on the safe side. My environment is a x64 ubuntu 11.04 (classic mode thought) with sun java 1.6.0_24 installed.

          Harald Kahlfeld added a comment - First of all, thank you Rob for checking the above patch and requesting a pull at git-hub. The ratio 6/100 you mentioned gave me some unease and so I got the 1.414 sources, applied my patch and re-checked again. My ratio was 0/100 or no failures with the patched war. Just to be on the safe side. My environment is a x64 ubuntu 11.04 (classic mode thought) with sun java 1.6.0_24 installed.

          Thank you for filing the pull request and bringing this to my attention, and my apologies for not noticing it sooner. I think I understand the issue now and I believe the proposed patch should fix the problem (so I'm with Harald that I don't understand why it still fails 6/100 at Rob's deployment.)

          However, I think the patch stops short of addressing the root cause of the problem, which is that there's a race condition between the queued remote I/O operations vs the return from the closure call. In this case, the remote call (at FilePath.archive) executes close, so this patch fixes it, but generally speaking there's no such requirement. It would have been also OK to have the closure write a bunch of stuff to RemoteOutputStream, then return from the closure, then close it (or instead of closing it, keep writing more stuff at it.) Such patterns can be seen elsewhere, such as JENKINS-7871.

          So I'm fixing this in a bit different way, such that we keep track of the I/O operations made during a closure execution.

          If you can try it out, that would be greatly appreciated.

          Kohsuke Kawaguchi added a comment - Thank you for filing the pull request and bringing this to my attention, and my apologies for not noticing it sooner. I think I understand the issue now and I believe the proposed patch should fix the problem (so I'm with Harald that I don't understand why it still fails 6/100 at Rob's deployment.) However, I think the patch stops short of addressing the root cause of the problem, which is that there's a race condition between the queued remote I/O operations vs the return from the closure call. In this case, the remote call (at FilePath.archive) executes close, so this patch fixes it, but generally speaking there's no such requirement. It would have been also OK to have the closure write a bunch of stuff to RemoteOutputStream, then return from the closure, then close it (or instead of closing it, keep writing more stuff at it.) Such patterns can be seen elsewhere, such as JENKINS-7871 . So I'm fixing this in a bit different way, such that we keep track of the I/O operations made during a closure execution. If you can try it out, that would be greatly appreciated.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #864
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          Kohsuke Kawaguchi : 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60
          Files :

          • remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          • remoting/src/main/java/hudson/remoting/Request.java

          dogfood added a comment - Integrated in jenkins_main_trunk #864 [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way. Kohsuke Kawaguchi : 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60 Files : remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java

          After checking with the latest jenkins.war (http://ci.jenkins-ci.org/job/jenkins_main_trunk/864/artifact/war/target/jenkins.war) and some test runs with differently sized zip files, all runs had a 0/100 failure rate so far. Seems the new handling solved the initial problem. Thanks again for solving this issue in a more appropriate way.

          Harald Kahlfeld added a comment - After checking with the latest jenkins.war ( http://ci.jenkins-ci.org/job/jenkins_main_trunk/864/artifact/war/target/jenkins.war ) and some test runs with differently sized zip files, all runs had a 0/100 failure rate so far. Seems the new handling solved the initial problem. Thanks again for solving this issue in a more appropriate way.

          Rob Johnston added a comment - - edited

          Sorry of the delay in replying, we've had a long weekend here

          My deployment is on an x64 CentOS 5.5 VM (LabManager), with a Sun 64bit JDK 1.6.0_22. I retried the patched war (original patch) and got the same result. All failures are with the following error:

            End-of-central-directory signature not found.  Either this file is not
            a zipfile, or it constitutes one disk of a multi-part archive.  In the
            latter case the central directory and zipfile comment will be found on
            the last disk(s) of this archive.
          unzip:  cannot find zipfile directory in one of test.zip or
                  test.zip.zip, and cannot find test.zip.ZIP, period.
          

          I've re-run the test using the latest jenkins war (http://ci.jenkins-ci.org/job/jenkins_main_trunk/880/artifact/war/target/jenkins.war) and consistently get a 0/100 failure rate, so I guess the more general fix has overcome whatever strange problem my setp was causing.

          Thanks for the fix(es)!

          edit: hit "3" instead of "6" - we're using a 1.6 JDK

          Rob Johnston added a comment - - edited Sorry of the delay in replying, we've had a long weekend here My deployment is on an x64 CentOS 5.5 VM (LabManager), with a Sun 64bit JDK 1.6.0_22. I retried the patched war (original patch) and got the same result. All failures are with the following error: End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of test.zip or test.zip.zip, and cannot find test.zip.ZIP, period. I've re-run the test using the latest jenkins war ( http://ci.jenkins-ci.org/job/jenkins_main_trunk/880/artifact/war/target/jenkins.war ) and consistently get a 0/100 failure rate, so I guess the more general fix has overcome whatever strange problem my setp was causing. Thanks for the fix(es)! edit: hit "3" instead of "6" - we're using a 1.6 JDK

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          remoting/src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/jenkins/ef2c8a7d119611a40dfbca91e8c26af9ad8dcbb5
          Log:
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.
          (cherry picked from commit 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/jenkins/ef2c8a7d119611a40dfbca91e8c26af9ad8dcbb5 Log: [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way. (cherry picked from commit 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60)

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          remoting/src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60
          Log:
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60 Log: [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          remoting/src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60
          Log:
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60 Log: [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          src/main/java/hudson/remoting/ProxyOutputStream.java
          src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/remoting/8ffed0da4996934bfc28bf6b08c258d367a1c526
          Log:
          [JENKINS-11251 JENKINS-9189] Resurrecting what's deleted in e0e154d12d7a10759287b187467389c6e643c12b

          When communicating with remoting < 2.15, this allows them to continue to
          perform some degree of syncing, so that they can still enjoy the fix for
          JENKINS-9189.

          None of these code is exposed via API outside remoting, so at some point
          we can revert this change to simplify the code a bit and eliminate the
          redundancy, because as long as >= 2.15 remoting talk to each other,
          PipeWriter does everything we need.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/hudson/remoting/ProxyOutputStream.java src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/remoting/8ffed0da4996934bfc28bf6b08c258d367a1c526 Log: [JENKINS-11251 JENKINS-9189] Resurrecting what's deleted in e0e154d12d7a10759287b187467389c6e643c12b When communicating with remoting < 2.15, this allows them to continue to perform some degree of syncing, so that they can still enjoy the fix for JENKINS-9189 . None of these code is exposed via API outside remoting, so at some point we can revert this change to simplify the code a bit and eliminate the redundancy, because as long as >= 2.15 remoting talk to each other, PipeWriter does everything we need.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          src/main/java/hudson/remoting/Channel.java
          src/main/java/hudson/remoting/Pipe.java
          src/main/java/hudson/remoting/PipeWriter.java
          src/main/java/hudson/remoting/ProxyOutputStream.java
          src/main/java/hudson/remoting/Request.java
          src/main/java/hudson/remoting/Response.java
          http://jenkins-ci.org/commit/remoting/e0e154d12d7a10759287b187467389c6e643c12b
          Log:
          [FIXED JENKINS-11251] reimplemented I/O and Request/Response sync

          See PipeWriter javadoc for the discussion and the context of this.
          This change re-implements the original fix for JENKINS-9189.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/hudson/remoting/Channel.java src/main/java/hudson/remoting/Pipe.java src/main/java/hudson/remoting/PipeWriter.java src/main/java/hudson/remoting/ProxyOutputStream.java src/main/java/hudson/remoting/Request.java src/main/java/hudson/remoting/Response.java http://jenkins-ci.org/commit/remoting/e0e154d12d7a10759287b187467389c6e643c12b Log: [FIXED JENKINS-11251] reimplemented I/O and Request/Response sync See PipeWriter javadoc for the discussion and the context of this. This change re-implements the original fix for JENKINS-9189 .

            kohsuke Kohsuke Kawaguchi
            ustuehler Uwe Stuehler
            Votes:
            6 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: