[JENKINS-7836] Clover and cobertura parsing on hudson master fails because of invalid XML

Type: Bug
Resolution: Fixed
Priority: Critical
Component/s: clover-plugin, cobertura-plugin
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show
URL:
http://mail-archives.apache.org/mod_mbox/www-builds/201010.mbox/%3c007b01cb6af6$e5f74e00$b1e5ea00$@thetaphi.de%3e

Since a few days, on our Apache Hudson installation, parsing of Clover's clover.xml or the Coberture's coverage.xml file fails (but not in all cases, sometimes it simply passes with the same build and same job configuration). This only happens after transferring to master, the reports and xml file is created on Hudson slave. It seems like the network code somehow breaks the xml file during transfer to the master.

Downloading th clover.xml from the workspace to my local computer and validating it confirms, that it is not incorrectly formatted and has no XML parse errors.

Here are errors that appear during clover publishing: https://hudson.apache.org/hudson/job/Lucene-trunk/1336/console
For cobertura: https://hudson.apache.org/hudson/view/Directory/job/dir-shared-metrics/34/console

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

HUDSON-7836-stacktrace.txt
17 kB
2010-11-08 01:14

duplicates

JENKINS-7951 Build fails on parsing coverage.xml

Closed

is duplicated by

JENKINS-8273 java.io.EOFException when running AccuRev streams command

Resolved

JENKINS-8742 Bug in Channel class or communication to slaves - Accurev plugin fails when getting list of streams

Resolved

JENKINS-7951 Build fails on parsing coverage.xml

Closed

JENKINS-7897 Cobertura plugin fails reading wrong coverage summary

Closed

is related to

JENKINS-7871 "java.io.IOException: Bad file descriptor" when file copied from slave

Resolved

JENKINS-7809 Remote Launcher randomly returns no data.

Resolved

(2 is related to)

Uwe Schindler created issue - 2010-10-19 01:30

tdunning added a comment - 2010-10-19 11:16

Also occurs with the Mahout Apache project:

https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend

Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes
to the project itself.

Apache recently changed to the new remoting code and has restarted the slaves in question.

tdunning added a comment - 2010-10-19 11:16 Also occurs with the Mahout Apache project: https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes to the project itself. Apache recently changed to the new remoting code and has restarted the slaves in question.

Simon Westcott added a comment - 2010-10-21 06:29

We started to experience this issue about a week ago - previously stable builds now intermittently fail with,

FATAL: Unable to copy coverage from /xxx/ to /yyy/
hudson.util.IOException2: Cannot parse coverage results

Downloading the XML file locally proves the file is valid XML.

Simon Westcott added a comment - 2010-10-21 06:29 We started to experience this issue about a week ago - previously stable builds now intermittently fail with, FATAL: Unable to copy coverage from /xxx/ to /yyy/ hudson.util.IOException2: Cannot parse coverage results Downloading the XML file locally proves the file is valid XML.

torbent made changes - 2010-10-21 15:34

Link

New: This issue is related to ~~JENKINS-7809~~ [ ~~JENKINS-7809~~ ]

Stubbs added a comment - 2010-10-22 09:37

I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again.

I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which.

I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.

Stubbs added a comment - 2010-10-22 09:37 I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again. I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which. I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.

tdunning added a comment - 2010-10-22 11:43

One additional note from the Apache side.

This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails.

I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.

tdunning added a comment - 2010-10-22 11:43 One additional note from the Apache side. This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails. I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.

Uwe Schindler made changes - 2010-10-25 00:47

Priority

Original: Major [ 3 ]

New: Critical [ 2 ]

Stubbs added a comment - 2010-10-28 01:58

We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live.

We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.

Stubbs added a comment - 2010-10-28 01:58 We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live. We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.

rshelley made changes - 2010-11-01 12:01

Link

New: This issue is duplicated by ~~JENKINS-7897~~ [ ~~JENKINS-7897~~ ]

Stubbs added a comment - 2010-11-08 01:14

Stack trace from the master Hudson's log at the time the build failed with this error.

Stubbs added a comment - 2010-11-08 01:14 Stack trace from the master Hudson's log at the time the build failed with this error.

Stubbs made changes - 2010-11-08 01:14

Attachment

New: JENKINS-7836-stacktrace.txt [ 19949 ]

Assignee:: Kohsuke Kawaguchi

Reporter:: Uwe Schindler

Votes:: 68 Vote for this issue

Watchers:: 65 Start watching this issue

Created:: 2010-10-19 01:30

Updated:: 2011-05-20 15:29

Resolved:: 2011-03-14 20:49

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: tdunning added a comment - 2010-10-19 11:16

Expand comment: tdunning added a comment - 2010-10-19 11:16

Collapse comment: Simon Westcott added a comment - 2010-10-21 06:29

Expand comment: Simon Westcott added a comment - 2010-10-21 06:29

Collapse comment: Stubbs added a comment - 2010-10-22 09:37

Expand comment: Stubbs added a comment - 2010-10-22 09:37

Collapse comment: tdunning added a comment - 2010-10-22 11:43

Expand comment: tdunning added a comment - 2010-10-22 11:43

Collapse comment: Stubbs added a comment - 2010-10-28 01:58

Expand comment: Stubbs added a comment - 2010-10-28 01:58

Collapse comment: Stubbs added a comment - 2010-11-08 01:14

Expand comment: Stubbs added a comment - 2010-11-08 01:14

People

Dates