[JENKINS-9189] truncation or corruption of zip workspace archive from slave

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: core
Labels:
None
Environment:
Hudson/Jenkins >= 1.378, Debian, slave connected via SSH

Similar Issues:
Powered by SuggestiMate

Show

Downloading a ZIP archive of the workspace from a project that was built on a slave appears to be broken since Hudson 1.378 (found by bisection between 1.365 and 1.406-SNAPSHOT). It worked and still works when the project was built on the master, so no remoting takes place.

How to reproduce:

Set up a free-style project that just creates a few files in the workspace, such as:

env > env.txt
ls -la > ls-la.txt
dd if=/dev/urandom of=random.bin bs=512 count=2048

Restrict this project to run on a slave (connected via SSH in my case).
Run this project.
Using the "(all files in zip)" link, download the workspace and verify the downloaded Zip archive. With 1.377 and before, you can run the download and verification step in a loop from the command line for 100 times in a row without error. Since 1.378, it will usually fail at the second attempt and will, on first glance at the hexdump, look like a correct but truncated Zip archive. The script that I used for testing is this:
```
$ cat test.sh 
i=0
while [ $i -lt 100 ]; do
        i=`expr $i + 1`
        echo $i
        wget -q -O test.zip 'http://localhost/jenkins/job/test/ws/*zip*/test.zip' && \
        unzip -l test.zip > /dev/null || exit $?
done
exit 0
```

Known workaround:

Run the job on the Jenkins master. (This isn't an option in our setup.)

Possibly related issues:

The changelog of 1.378 mentions ~~JENKINS-5977~~ "Improving the master/slave communication to avoid pipe clogging problem." and I suspect that this change introduced the problem. A later changelog entry for 1.397 mentions that it fixed "a master/slave communication problem since 1.378" (~~JENKINS-7745~~). However, using the steps described above I can still reproduce at least this issue, even in the current version 1.404 and the latest snapshot.

As suggested in comments of other issues touching on the field of master/slave communication, it would seem reasonable to assume that this issue could be caused by a missing flush operation on an output stream, or something to that effect. Another possibility, however likely, might be the suspected thread concurrency problem noted in remoting/src/main/java/hudson/remoting/PipeWindow.java, where it also mentions the issues ~~JENKINS-7745~~ or ~~JENKINS-7581~~.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

fix-JENKINS-9189.diff
2 kB
2011-05-18 07:40
command-completion-fail.diff
2 kB
2011-04-11 21:24
Channel_createPipeWriter.diff
1 kB
2011-03-29 22:35

is duplicated by

JENKINS-9868 Workspace in zip from slave is corrupt

Resolved

is related to

JENKINS-11251 Cannot parse coverage results Premature end of file.

Resolved

JENKINS-9540 Copy Artifacts Plugin Throws "IOException: Pipe is already closed"

Closed

Uwe Stuehler created issue - 2011-03-29 21:10

Uwe Stuehler added a comment - 2011-03-29 22:35

This patch, which reverts to the old behavior of using a SynchronousExecutorService in hudson.remoting.Channel seems to be another workaround.

Uwe Stuehler added a comment - 2011-03-29 22:35 This patch, which reverts to the old behavior of using a SynchronousExecutorService in hudson.remoting.Channel seems to be another workaround.

Uwe Stuehler made changes - 2011-03-29 22:35

Attachment

New: Channel_createPipeWriter.diff [ 20336 ]

Uwe Stuehler added a comment - 2011-04-11 21:24 - edited

With this diff and the log level set to FINE, we get something like this when retrieving the workspace Zip file:

Apr 11, 2011 8:33:41 PM hudson.remoting.ProxyOutputStream
FINE: oid=5 Yaaawn!
Apr 11, 2011 8:33:41 PM hudson.model.Queue
FINE: Queue maintenance started hudson.model.Queue@6e135779
Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
FINE: Send Pipe.Ack(5,2)
[...]
Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
FINE: Send Pipe.Ack(5,18)
Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
FINE: Send Pipe.Ack(5,512)
Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
FINE: Received Response[retVal=hudson.remoting.UserResponse@4b76ffeb,exception=null]
Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
FINE: Received Pipe.EOF(5)
Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
FINE: Received Pipe.Pause(5)
Apr 11, 2011 8:33:38 PM hudson.remoting.Channel
FINE: Received Pipe.Chunk(5,2)

What it shows is that commands are executed in parallel, so while Chunk commands are being executed, an EOF command can start executing at the same time and close the stream. This seems at least a bit counterintuitive, given the documentation for Channel.send():

/**
 * Sends a command to the remote end and executes it there.
 *
 * <p>
 * This is the lowest layer of abstraction in {@link Channel}.
 * {@link Command}s are executed on a remote system in the order they are sent.
 */
/*package*/ synchronized void send(Command cmd) throws IOException {

Uwe Stuehler added a comment - 2011-04-11 21:24 - edited With this diff and the log level set to FINE, we get something like this when retrieving the workspace Zip file: Apr 11, 2011 8:33:41 PM hudson.remoting.ProxyOutputStream FINE: oid=5 Yaaawn! Apr 11, 2011 8:33:41 PM hudson.model.Queue FINE: Queue maintenance started hudson.model.Queue@6e135779 Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,2) [...] Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,18) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Send Pipe.Ack(5,512) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Response[retVal=hudson.remoting.UserResponse@4b76ffeb,exception= null ] Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.EOF(5) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.Pause(5) Apr 11, 2011 8:33:38 PM hudson.remoting.Channel FINE: Received Pipe.Chunk(5,2) What it shows is that commands are executed in parallel, so while Chunk commands are being executed, an EOF command can start executing at the same time and close the stream. This seems at least a bit counterintuitive, given the documentation for Channel.send(): /** * Sends a command to the remote end and executes it there. * * <p> * This is the lowest layer of abstraction in {@link Channel}. * {@link Command}s are executed on a remote system in the order they are sent. */ /* package */ synchronized void send(Command cmd) throws IOException {

Uwe Stuehler made changes - 2011-04-11 21:24

Attachment

New: command-completion-fail.diff [ 20372 ]

Uwe Stuehler made changes - 2011-04-11 21:30

Priority

Original: Critical [ 2 ]

New: Major [ 3 ]

Harald Kahlfeld made changes - 2011-05-18 07:40

Attachment

New: fix-JENKINS-9189.diff [ 20486 ]

Harald Kahlfeld added a comment - 2011-05-18 07:57

After some more investigation we found that the issue was still persistent with the current master or 1_412 release. Even after the flow was corrected - probably by other changes or fixes - to seemingly execute the commands in order of appearance, the zip files still get corrupted.

The issue seems to be related to a race condition between a chunk command to be completed and the eof command closing the stream without flushing. The attached fix handles this situation as it introduces - within the doClose() method - a call to flush the stream by invoking the according command. Furthermore the implementation of the flush command had to be changed to wait for the completion of the thread, making sure this way the flush() was executed before cutting the stream within the following eof command.

Remark: The flush command was added as described above as we didn't find a better place to put it. On the other hand it allows a somewhat lazy call of the doClose method.

Harald Kahlfeld added a comment - 2011-05-18 07:57 After some more investigation we found that the issue was still persistent with the current master or 1_412 release. Even after the flow was corrected - probably by other changes or fixes - to seemingly execute the commands in order of appearance, the zip files still get corrupted. The issue seems to be related to a race condition between a chunk command to be completed and the eof command closing the stream without flushing. The attached fix handles this situation as it introduces - within the doClose() method - a call to flush the stream by invoking the according command. Furthermore the implementation of the flush command had to be changed to wait for the completion of the thread, making sure this way the flush() was executed before cutting the stream within the following eof command. Remark: The flush command was added as described above as we didn't find a better place to put it. On the other hand it allows a somewhat lazy call of the doClose method.

Uwe Stuehler added a comment - 2011-05-18 09:01

I have reviewed Harald's diff and I agree with him on the fix and his comment, but I would like someone to review it again and commit it if it's correct.

Kawaguchi-san, I've assigned the issue to you since as far as I can tell you implemented the pipe throttling mechanism and might know that area the best.

Regarding the fix: the added .get() seems to be necessary indeed to actually ensure the flush finishes and this is what Channel.localSyncIO() does as well.

Thanks a lot.

Uwe Stuehler added a comment - 2011-05-18 09:01 I have reviewed Harald's diff and I agree with him on the fix and his comment, but I would like someone to review it again and commit it if it's correct. Kawaguchi-san, I've assigned the issue to you since as far as I can tell you implemented the pipe throttling mechanism and might know that area the best. Regarding the fix: the added .get() seems to be necessary indeed to actually ensure the flush finishes and this is what Channel.localSyncIO() does as well. Thanks a lot.

Uwe Stuehler made changes - 2011-05-18 09:01

Assignee

New: Kohsuke Kawaguchi [ kohsuke ]

Assignee:: Kohsuke Kawaguchi

Reporter:: Uwe Stuehler

Votes:: 6 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2011-03-29 21:10

Updated:: 2012-08-09 17:18

Resolved:: 2011-06-09 21:51

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Uwe Stuehler added a comment - 2011-03-29 22:35

Expand comment: Uwe Stuehler added a comment - 2011-03-29 22:35

Collapse comment: Uwe Stuehler added a comment - 2011-04-11 21:24, Edited by Uwe Stuehler - 2011-04-11 21:24

Expand comment: Uwe Stuehler added a comment - 2011-04-11 21:24, Edited by Uwe Stuehler - 2011-04-11 21:24

Collapse comment: Harald Kahlfeld added a comment - 2011-05-18 07:57

Expand comment: Harald Kahlfeld added a comment - 2011-05-18 07:57

Collapse comment: Uwe Stuehler added a comment - 2011-05-18 09:01

Expand comment: Uwe Stuehler added a comment - 2011-05-18 09:01

People

Dates