Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-9189

truncation or corruption of zip workspace archive from slave

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • None
    • Hudson/Jenkins >= 1.378, Debian, slave connected via SSH

      Downloading a ZIP archive of the workspace from a project that was built on a slave appears to be broken since Hudson 1.378 (found by bisection between 1.365 and 1.406-SNAPSHOT). It worked and still works when the project was built on the master, so no remoting takes place.

      How to reproduce:

      1. Set up a free-style project that just creates a few files in the workspace, such as:
        env > env.txt
        ls -la > ls-la.txt
        dd if=/dev/urandom of=random.bin bs=512 count=2048
        
      2. Restrict this project to run on a slave (connected via SSH in my case).
      3. Run this project.
      4. Using the "(all files in zip)" link, download the workspace and verify the downloaded Zip archive. With 1.377 and before, you can run the download and verification step in a loop from the command line for 100 times in a row without error. Since 1.378, it will usually fail at the second attempt and will, on first glance at the hexdump, look like a correct but truncated Zip archive. The script that I used for testing is this:
        $ cat test.sh 
        i=0
        while [ $i -lt 100 ]; do
                i=`expr $i + 1`
                echo $i
                wget -q -O test.zip 'http://localhost/jenkins/job/test/ws/*zip*/test.zip' && \
                unzip -l test.zip > /dev/null || exit $?
        done
        exit 0
        

      Known workaround:

      • Run the job on the Jenkins master. (This isn't an option in our setup.)

      Possibly related issues:

      The changelog of 1.378 mentions JENKINS-5977 "Improving the master/slave communication to avoid pipe clogging problem." and I suspect that this change introduced the problem. A later changelog entry for 1.397 mentions that it fixed "a master/slave communication problem since 1.378" (JENKINS-7745). However, using the steps described above I can still reproduce at least this issue, even in the current version 1.404 and the latest snapshot.

      As suggested in comments of other issues touching on the field of master/slave communication, it would seem reasonable to assume that this issue could be caused by a missing flush operation on an output stream, or something to that effect. Another possibility, however likely, might be the suspected thread concurrency problem noted in remoting/src/main/java/hudson/remoting/PipeWindow.java, where it also mentions the issues JENKINS-7745 or JENKINS-7581.

          [JENKINS-9189] truncation or corruption of zip workspace archive from slave

          First of all, thank you Rob for checking the above patch and requesting a pull at git-hub.

          The ratio 6/100 you mentioned gave me some unease and so I got the 1.414 sources, applied my patch and re-checked again. My ratio was 0/100 or no failures with the patched war.

          Just to be on the safe side. My environment is a x64 ubuntu 11.04 (classic mode thought) with sun java 1.6.0_24 installed.

          Harald Kahlfeld added a comment - First of all, thank you Rob for checking the above patch and requesting a pull at git-hub. The ratio 6/100 you mentioned gave me some unease and so I got the 1.414 sources, applied my patch and re-checked again. My ratio was 0/100 or no failures with the patched war. Just to be on the safe side. My environment is a x64 ubuntu 11.04 (classic mode thought) with sun java 1.6.0_24 installed.

          Thank you for filing the pull request and bringing this to my attention, and my apologies for not noticing it sooner. I think I understand the issue now and I believe the proposed patch should fix the problem (so I'm with Harald that I don't understand why it still fails 6/100 at Rob's deployment.)

          However, I think the patch stops short of addressing the root cause of the problem, which is that there's a race condition between the queued remote I/O operations vs the return from the closure call. In this case, the remote call (at FilePath.archive) executes close, so this patch fixes it, but generally speaking there's no such requirement. It would have been also OK to have the closure write a bunch of stuff to RemoteOutputStream, then return from the closure, then close it (or instead of closing it, keep writing more stuff at it.) Such patterns can be seen elsewhere, such as JENKINS-7871.

          So I'm fixing this in a bit different way, such that we keep track of the I/O operations made during a closure execution.

          If you can try it out, that would be greatly appreciated.

          Kohsuke Kawaguchi added a comment - Thank you for filing the pull request and bringing this to my attention, and my apologies for not noticing it sooner. I think I understand the issue now and I believe the proposed patch should fix the problem (so I'm with Harald that I don't understand why it still fails 6/100 at Rob's deployment.) However, I think the patch stops short of addressing the root cause of the problem, which is that there's a race condition between the queued remote I/O operations vs the return from the closure call. In this case, the remote call (at FilePath.archive) executes close, so this patch fixes it, but generally speaking there's no such requirement. It would have been also OK to have the closure write a bunch of stuff to RemoteOutputStream, then return from the closure, then close it (or instead of closing it, keep writing more stuff at it.) Such patterns can be seen elsewhere, such as JENKINS-7871 . So I'm fixing this in a bit different way, such that we keep track of the I/O operations made during a closure execution. If you can try it out, that would be greatly appreciated.

          dogfood added a comment -

          Integrated in jenkins_main_trunk #864
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          Kohsuke Kawaguchi : 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60
          Files :

          • remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          • remoting/src/main/java/hudson/remoting/Request.java

          dogfood added a comment - Integrated in jenkins_main_trunk #864 [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way. Kohsuke Kawaguchi : 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60 Files : remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java

          After checking with the latest jenkins.war (http://ci.jenkins-ci.org/job/jenkins_main_trunk/864/artifact/war/target/jenkins.war) and some test runs with differently sized zip files, all runs had a 0/100 failure rate so far. Seems the new handling solved the initial problem. Thanks again for solving this issue in a more appropriate way.

          Harald Kahlfeld added a comment - After checking with the latest jenkins.war ( http://ci.jenkins-ci.org/job/jenkins_main_trunk/864/artifact/war/target/jenkins.war ) and some test runs with differently sized zip files, all runs had a 0/100 failure rate so far. Seems the new handling solved the initial problem. Thanks again for solving this issue in a more appropriate way.

          Rob Johnston added a comment - - edited

          Sorry of the delay in replying, we've had a long weekend here

          My deployment is on an x64 CentOS 5.5 VM (LabManager), with a Sun 64bit JDK 1.6.0_22. I retried the patched war (original patch) and got the same result. All failures are with the following error:

            End-of-central-directory signature not found.  Either this file is not
            a zipfile, or it constitutes one disk of a multi-part archive.  In the
            latter case the central directory and zipfile comment will be found on
            the last disk(s) of this archive.
          unzip:  cannot find zipfile directory in one of test.zip or
                  test.zip.zip, and cannot find test.zip.ZIP, period.
          

          I've re-run the test using the latest jenkins war (http://ci.jenkins-ci.org/job/jenkins_main_trunk/880/artifact/war/target/jenkins.war) and consistently get a 0/100 failure rate, so I guess the more general fix has overcome whatever strange problem my setp was causing.

          Thanks for the fix(es)!

          edit: hit "3" instead of "6" - we're using a 1.6 JDK

          Rob Johnston added a comment - - edited Sorry of the delay in replying, we've had a long weekend here My deployment is on an x64 CentOS 5.5 VM (LabManager), with a Sun 64bit JDK 1.6.0_22. I retried the patched war (original patch) and got the same result. All failures are with the following error: End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of test.zip or test.zip.zip, and cannot find test.zip.ZIP, period. I've re-run the test using the latest jenkins war ( http://ci.jenkins-ci.org/job/jenkins_main_trunk/880/artifact/war/target/jenkins.war ) and consistently get a 0/100 failure rate, so I guess the more general fix has overcome whatever strange problem my setp was causing. Thanks for the fix(es)! edit: hit "3" instead of "6" - we're using a 1.6 JDK

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          remoting/src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/jenkins/ef2c8a7d119611a40dfbca91e8c26af9ad8dcbb5
          Log:
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.
          (cherry picked from commit 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60)

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/jenkins/ef2c8a7d119611a40dfbca91e8c26af9ad8dcbb5 Log: [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way. (cherry picked from commit 9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60)

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          remoting/src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60
          Log:
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60 Log: [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          remoting/src/main/java/hudson/remoting/ProxyOutputStream.java
          remoting/src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60
          Log:
          [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: remoting/src/main/java/hudson/remoting/ProxyOutputStream.java remoting/src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/jenkins/9cdd9cc0c5640beeb6bf36a4b26fa1ddcce7fd60 Log: [FIXED JENKINS-9189] fixed the race condition between I/O operation and the return of the Channel.call() execution in more fundamental way.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          src/main/java/hudson/remoting/ProxyOutputStream.java
          src/main/java/hudson/remoting/Request.java
          http://jenkins-ci.org/commit/remoting/8ffed0da4996934bfc28bf6b08c258d367a1c526
          Log:
          [JENKINS-11251 JENKINS-9189] Resurrecting what's deleted in e0e154d12d7a10759287b187467389c6e643c12b

          When communicating with remoting < 2.15, this allows them to continue to
          perform some degree of syncing, so that they can still enjoy the fix for
          JENKINS-9189.

          None of these code is exposed via API outside remoting, so at some point
          we can revert this change to simplify the code a bit and eliminate the
          redundancy, because as long as >= 2.15 remoting talk to each other,
          PipeWriter does everything we need.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/hudson/remoting/ProxyOutputStream.java src/main/java/hudson/remoting/Request.java http://jenkins-ci.org/commit/remoting/8ffed0da4996934bfc28bf6b08c258d367a1c526 Log: [JENKINS-11251 JENKINS-9189] Resurrecting what's deleted in e0e154d12d7a10759287b187467389c6e643c12b When communicating with remoting < 2.15, this allows them to continue to perform some degree of syncing, so that they can still enjoy the fix for JENKINS-9189 . None of these code is exposed via API outside remoting, so at some point we can revert this change to simplify the code a bit and eliminate the redundancy, because as long as >= 2.15 remoting talk to each other, PipeWriter does everything we need.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          src/main/java/hudson/remoting/Channel.java
          src/main/java/hudson/remoting/Pipe.java
          src/main/java/hudson/remoting/PipeWriter.java
          src/main/java/hudson/remoting/ProxyOutputStream.java
          src/main/java/hudson/remoting/Request.java
          src/main/java/hudson/remoting/Response.java
          http://jenkins-ci.org/commit/remoting/e0e154d12d7a10759287b187467389c6e643c12b
          Log:
          [FIXED JENKINS-11251] reimplemented I/O and Request/Response sync

          See PipeWriter javadoc for the discussion and the context of this.
          This change re-implements the original fix for JENKINS-9189.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/hudson/remoting/Channel.java src/main/java/hudson/remoting/Pipe.java src/main/java/hudson/remoting/PipeWriter.java src/main/java/hudson/remoting/ProxyOutputStream.java src/main/java/hudson/remoting/Request.java src/main/java/hudson/remoting/Response.java http://jenkins-ci.org/commit/remoting/e0e154d12d7a10759287b187467389c6e643c12b Log: [FIXED JENKINS-11251] reimplemented I/O and Request/Response sync See PipeWriter javadoc for the discussion and the context of this. This change re-implements the original fix for JENKINS-9189 .

            kohsuke Kohsuke Kawaguchi
            ustuehler Uwe Stuehler
            Votes:
            6 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: