Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-7836

Clover and cobertura parsing on hudson master fails because of invalid XML

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Since a few days, on our Apache Hudson installation, parsing of Clover's clover.xml or the Coberture's coverage.xml file fails (but not in all cases, sometimes it simply passes with the same build and same job configuration). This only happens after transferring to master, the reports and xml file is created on Hudson slave. It seems like the network code somehow breaks the xml file during transfer to the master.

      Downloading th clover.xml from the workspace to my local computer and validating it confirms, that it is not incorrectly formatted and has no XML parse errors.

        Attachments

          Issue Links

            Activity

            Hide
            tdunning tdunning added a comment -

            Also occurs with the Mahout Apache project:

            https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend

            Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes
            to the project itself.

            Apache recently changed to the new remoting code and has restarted the slaves in question.

            Show
            tdunning tdunning added a comment - Also occurs with the Mahout Apache project: https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes to the project itself. Apache recently changed to the new remoting code and has restarted the slaves in question.
            Hide
            swestcott Simon Westcott added a comment -

            We started to experience this issue about a week ago - previously stable builds now intermittently fail with,

            FATAL: Unable to copy coverage from /xxx/ to /yyy/
            hudson.util.IOException2: Cannot parse coverage results

            Downloading the XML file locally proves the file is valid XML.

            Show
            swestcott Simon Westcott added a comment - We started to experience this issue about a week ago - previously stable builds now intermittently fail with, FATAL: Unable to copy coverage from /xxx/ to /yyy/ hudson.util.IOException2: Cannot parse coverage results Downloading the XML file locally proves the file is valid XML.
            Hide
            stubbs Stubbs added a comment -

            I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again.

            I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which.

            I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.

            Show
            stubbs Stubbs added a comment - I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again. I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which. I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.
            Hide
            tdunning tdunning added a comment -

            One additional note from the Apache side.

            This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails.

            I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.

            Show
            tdunning tdunning added a comment - One additional note from the Apache side. This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails. I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.
            Hide
            stubbs Stubbs added a comment -

            We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live.

            We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.

            Show
            stubbs Stubbs added a comment - We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live. We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.
            Hide
            stubbs Stubbs added a comment -

            Stack trace from the master Hudson's log at the time the build failed with this error.

            Show
            stubbs Stubbs added a comment - Stack trace from the master Hudson's log at the time the build failed with this error.
            Hide
            fpavageau fpavageau added a comment -

            The problem started with 1.378. I've tried almost all versions since, they all exhibit the same behavior. We're staying with 1.377 at the moment.

            I think it might be related to the fixes for http://issues.jenkins-ci.org/browse/JENKINS-5977, with the channel to the slave being closed too early sometimes. It seems to happen more under heavy load.

            Show
            fpavageau fpavageau added a comment - The problem started with 1.378. I've tried almost all versions since, they all exhibit the same behavior. We're staying with 1.377 at the moment. I think it might be related to the fixes for http://issues.jenkins-ci.org/browse/JENKINS-5977 , with the channel to the slave being closed too early sometimes. It seems to happen more under heavy load.
            Hide
            carltonb carltonb added a comment -

            We are seeing the problem as well... for every build that threw an IOException when publishing cobertura.xml, the original cobertura.xml is valid but the archived copy is truncated. We are using Hudson 1.384.

            Show
            carltonb carltonb added a comment - We are seeing the problem as well... for every build that threw an IOException when publishing cobertura.xml, the original cobertura.xml is valid but the archived copy is truncated. We are using Hudson 1.384.
            Hide
            hudsonfsc hudsonfsc added a comment -

            Same issue

            Show
            hudsonfsc hudsonfsc added a comment - Same issue
            Hide
            maxence maxence added a comment -

            Same issue here, occurs with Hudson 1.384 on JDK 1.6.0_u22, with master running Debian 5.0 and slave running Ubuntu 10.04.

            Show
            maxence maxence added a comment - Same issue here, occurs with Hudson 1.384 on JDK 1.6.0_u22, with master running Debian 5.0 and slave running Ubuntu 10.04.
            Hide
            erickerickson erickerickson added a comment -

            +1, it's really hard to convince people to try the any trunk code when they look and see all the "failures" even when they don't really have anything to do with the project..

            I'm coming at this from a Solr angle.

            Erick

            Show
            erickerickson erickerickson added a comment - +1, it's really hard to convince people to try the any trunk code when they look and see all the "failures" even when they don't really have anything to do with the project.. I'm coming at this from a Solr angle. Erick
            Hide
            npellow npellow added a comment -

            Hi,
            As far as I can tell, this is neither an issue with Clover nor Corbutura, rather how build artefacts are passed between agent and server.

            If there is anything Clover could do to improve this, please let us know.

            Cheers,
            Nick Pellow
            Atlassian Clover.

            Show
            npellow npellow added a comment - Hi, As far as I can tell, this is neither an issue with Clover nor Corbutura, rather how build artefacts are passed between agent and server. If there is anything Clover could do to improve this, please let us know. Cheers, Nick Pellow Atlassian Clover.
            Hide
            thetaphi Uwe Schindler added a comment -

            Hi Nick,

            thanks for the support! I also think this is a problem of artifact publishing, I just wanted to report it also to you.

            Show
            thetaphi Uwe Schindler added a comment - Hi Nick, thanks for the support! I also think this is a problem of artifact publishing, I just wanted to report it also to you.
            Hide
            hbjastad hbjastad added a comment -

            +1
            We have tried earlier to upgrade from 1.377, but due to this problem, we had to revert back to 1.377. Tried again to upgrade yesterday, but got the same problem. So it definitely seems to be a problem that got introduced by 1.378 - and hasn't been fixed since.

            Show
            hbjastad hbjastad added a comment - +1 We have tried earlier to upgrade from 1.377, but due to this problem, we had to revert back to 1.377. Tried again to upgrade yesterday, but got the same problem. So it definitely seems to be a problem that got introduced by 1.378 - and hasn't been fixed since.
            Hide
            stubbs Stubbs added a comment -

            We get multiple examples of this every night, so if you need any extra info just give me a shout.

            Show
            stubbs Stubbs added a comment - We get multiple examples of this every night, so if you need any extra info just give me a shout.
            Hide
            tdunning tdunning added a comment -

            I anybody actually looking into this issue?

            It seems to affect lots of high profile installs of hudson. Surely it isn't good to have dozens of projects at Apache have the impression that hudson is this unstable.

            Show
            tdunning tdunning added a comment - I anybody actually looking into this issue? It seems to affect lots of high profile installs of hudson. Surely it isn't good to have dozens of projects at Apache have the impression that hudson is this unstable.
            Hide
            jlaurila jlaurila added a comment -

            +1 - Have this happening on 1.385.

            Our configuration has a master and 2-7 slaves. All run CentOS 5.5, Sun JDK from RPM jdk-1.6.0_21-fcs except one slave which runs Fedora 12. Each machine has only one executor due to Ivy locking issues.

            Show
            jlaurila jlaurila added a comment - +1 - Have this happening on 1.385. Our configuration has a master and 2-7 slaves. All run CentOS 5.5, Sun JDK from RPM jdk-1.6.0_21-fcs except one slave which runs Fedora 12. Each machine has only one executor due to Ivy locking issues.
            Hide
            thetaphi Uwe Schindler added a comment -

            Hudson at Apache was updated to v1.395 at the weekend. After that the error happens more seldom (most builds succeed now), but today we got a new stack trace related to this:

            Publishing Clover coverage report...
            Publishing Clover HTML report...
            Publishing Clover XML report...
            FATAL: Unable to copy coverage from /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29
            hudson.util.IOException2: Failed to copy /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports/clover.xml to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29/clover.xml
            	at hudson.FilePath.copyTo(FilePath.java:1374)
            	at hudson.plugins.clover.CloverPublisher.copyXmlReport(CloverPublisher.java:233)
            	at hudson.plugins.clover.CloverPublisher.perform(CloverPublisher.java:157)
            	at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36)
            	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:622)
            	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:601)
            	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:579)
            	at hudson.model.Build$RunnerImpl.post2(Build.java:156)
            	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:548)
            	at hudson.model.Run.run(Run.java:1386)
            	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
            	at hudson.model.ResourceController.execute(ResourceController.java:88)
            	at hudson.model.Executor.run(Executor.java:145)
            Caused by: java.io.IOException: Bad file descriptor
            	at java.io.FileOutputStream.close0(Native Method)
            	at java.io.FileOutputStream.close(FileOutputStream.java:279)
            	at hudson.FilePath.copyTo(FilePath.java:1371)
            	... 12 more
            Email was triggered for: Failure
            

            Are there any connections to this issue or is this a new one?

            Show
            thetaphi Uwe Schindler added a comment - Hudson at Apache was updated to v1.395 at the weekend. After that the error happens more seldom (most builds succeed now), but today we got a new stack trace related to this: Publishing Clover coverage report... Publishing Clover HTML report... Publishing Clover XML report... FATAL: Unable to copy coverage from /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29 hudson.util.IOException2: Failed to copy /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports/clover.xml to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29/clover.xml at hudson.FilePath.copyTo(FilePath.java:1374) at hudson.plugins.clover.CloverPublisher.copyXmlReport(CloverPublisher.java:233) at hudson.plugins.clover.CloverPublisher.perform(CloverPublisher.java:157) at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:622) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:601) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:579) at hudson.model.Build$RunnerImpl.post2(Build.java:156) at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:548) at hudson.model.Run.run(Run.java:1386) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:145) Caused by: java.io.IOException: Bad file descriptor at java.io.FileOutputStream.close0(Native Method) at java.io.FileOutputStream.close(FileOutputStream.java:279) at hudson.FilePath.copyTo(FilePath.java:1371) ... 12 more Email was triggered for: Failure Are there any connections to this issue or is this a new one?
            Hide
            lacostej lacostej added a comment -

            If you are affected by this problem and aren't afraid of patching&build from source, you might want to try
            https://github.com/lacostej/jenkins/commit/31b8361d3152fb7970e1c11c906a763fa1aa5c25

            I will try making a test build for those who want to test it.

            Show
            lacostej lacostej added a comment - If you are affected by this problem and aren't afraid of patching&build from source, you might want to try https://github.com/lacostej/jenkins/commit/31b8361d3152fb7970e1c11c906a763fa1aa5c25 I will try making a test build for those who want to test it.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jerome Lacoste
            Path:
            remoting/src/main/java/hudson/remoting/ProxyInputStream.java
            http://jenkins-ci.org/commit/core/31b8361d3152fb7970e1c11c906a763fa1aa5c25
            Log:
            JENKINS-7836 tentative fix for the copy from slave to master issues. The problem looks similar to JENKINS-7745, so we might as well synchronized the ProxyInputStream.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jerome Lacoste Path: remoting/src/main/java/hudson/remoting/ProxyInputStream.java http://jenkins-ci.org/commit/core/31b8361d3152fb7970e1c11c906a763fa1aa5c25 Log: JENKINS-7836 tentative fix for the copy from slave to master issues. The problem looks similar to JENKINS-7745 , so we might as well synchronized the ProxyInputStream.
            Hide
            thetaphi Uwe Schindler added a comment -

            Hi Jerome,
            thanks for the work done. As Apache is planning to move over to Jenkins, we hope to solve this soon. Please keep me informed, when a new release containing this fix is done!

            Show
            thetaphi Uwe Schindler added a comment - Hi Jerome, thanks for the work done. As Apache is planning to move over to Jenkins, we hope to solve this soon. Please keep me informed, when a new release containing this fix is done!
            Hide
            rickard_von_essen Rickard von Essen added a comment -

            Hi Jerome,

            I have built:
            https://github.com/jenkinsci/jenkins/commit/c15378ef53bbcbaac1df11662ac4e16d28a008ea
            Where your fix is included but actually it only seams to get worse with this build.

            (The only change I have made is to use jmdns-3.4.0 instead of the "bundled" version, see http://issues.hudson-ci.org/browse/HUDSON-8647)

            If you need any information (logs, stats, dumps) or if I can run any experimental code for you to help resolve this issue send me an e-mail at rickard.von.essen 'at' gmail.com

            Show
            rickard_von_essen Rickard von Essen added a comment - Hi Jerome, I have built: https://github.com/jenkinsci/jenkins/commit/c15378ef53bbcbaac1df11662ac4e16d28a008ea Where your fix is included but actually it only seams to get worse with this build. (The only change I have made is to use jmdns-3.4.0 instead of the "bundled" version, see http://issues.hudson-ci.org/browse/HUDSON-8647 ) If you need any information (logs, stats, dumps) or if I can run any experimental code for you to help resolve this issue send me an e-mail at rickard.von.essen 'at' gmail.com
            Hide
            robertredd robertredd added a comment -

            I can confirm this issue is still present in 1.398. I had thought some of the other related cases may have resolved it, but it hasn't. I just wanted to confirm it's still an issue that I know is being worked on. Out of about 80 builds I run at night for statistics, about 10 will fail with this error.

            Show
            robertredd robertredd added a comment - I can confirm this issue is still present in 1.398. I had thought some of the other related cases may have resolved it, but it hasn't. I just wanted to confirm it's still an issue that I know is being worked on. Out of about 80 builds I run at night for statistics, about 10 will fail with this error.
            Hide
            ccutrer Cody Cutrer added a comment -

            I also just experienced it again on 1.397, though with a much lower frequency than pre-1.397.

            Show
            ccutrer Cody Cutrer added a comment - I also just experienced it again on 1.397, though with a much lower frequency than pre-1.397.
            Hide
            oeuftete oeuftete added a comment -

            Still an issue in 1.399 (which included the fix for JENKINS-7809... so I guess that wasn't it).

            Show
            oeuftete oeuftete added a comment - Still an issue in 1.399 (which included the fix for JENKINS-7809 ... so I guess that wasn't it).
            Hide
            rogerzhang Roger Zhang added a comment -

            Does anyone know when this problem will be resolved?

            Show
            rogerzhang Roger Zhang added a comment - Does anyone know when this problem will be resolved?
            Hide
            rshelley rshelley added a comment -

            To work around this, I've setup Sonar and am pushing all my code quality reports to it instead of reporting it in Jenkins. Even if this gets fixed, I doubt I'll go back, Sonar is just too nice and useful.

            Show
            rshelley rshelley added a comment - To work around this, I've setup Sonar and am pushing all my code quality reports to it instead of reporting it in Jenkins. Even if this gets fixed, I doubt I'll go back, Sonar is just too nice and useful.
            Hide
            kohsuke Kohsuke Kawaguchi added a comment -

            The links in the original description of this issue are no longer working, so I cannot be sure of the failure mode that the reporter saw. As such, I'm also unclear what later "me, too" comments are referring to.

            Now, I've just committed a fix to JENKINS-7871 toward 1.402, and I suspect that fixes this issue as well.

            If you continue to see a problem after 1.402, please open a separate issue, or reopen JENKINS-7871 (provided that the failure mode is the same), instead of reopening this issue, since it's unclear exactly what this issue is referring to.

            Show
            kohsuke Kohsuke Kawaguchi added a comment - The links in the original description of this issue are no longer working, so I cannot be sure of the failure mode that the reporter saw. As such, I'm also unclear what later "me, too" comments are referring to. Now, I've just committed a fix to JENKINS-7871 toward 1.402, and I suspect that fixes this issue as well. If you continue to see a problem after 1.402, please open a separate issue, or reopen JENKINS-7871 (provided that the failure mode is the same), instead of reopening this issue, since it's unclear exactly what this issue is referring to.
            Hide
            thetaphi Uwe Schindler added a comment -

            The links are no longer working, but the field "URL", still shows the issue in the apache mailing list: http://mail-archives.apache.org/mod_mbox/www-builds/201010.mbox/%3c007b01cb6af6$e5f74e00$b1e5ea00$@thetaphi.de%3e

            Show
            thetaphi Uwe Schindler added a comment - The links are no longer working, but the field "URL", still shows the issue in the apache mailing list: http://mail-archives.apache.org/mod_mbox/www-builds/201010.mbox/%3c007b01cb6af6$e5f74e00$b1e5ea00$@thetaphi.de%3e

              People

              Assignee:
              kohsuke Kohsuke Kawaguchi
              Reporter:
              thetaphi Uwe Schindler
              Votes:
              68 Vote for this issue
              Watchers:
              65 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: