Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-7836

Clover and cobertura parsing on hudson master fails because of invalid XML

      Since a few days, on our Apache Hudson installation, parsing of Clover's clover.xml or the Coberture's coverage.xml file fails (but not in all cases, sometimes it simply passes with the same build and same job configuration). This only happens after transferring to master, the reports and xml file is created on Hudson slave. It seems like the network code somehow breaks the xml file during transfer to the master.

      Downloading th clover.xml from the workspace to my local computer and validating it confirms, that it is not incorrectly formatted and has no XML parse errors.

          [JENKINS-7836] Clover and cobertura parsing on hudson master fails because of invalid XML

          tdunning added a comment -

          Also occurs with the Mahout Apache project:

          https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend

          Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes
          to the project itself.

          Apache recently changed to the new remoting code and has restarted the slaves in question.

          tdunning added a comment - Also occurs with the Mahout Apache project: https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes to the project itself. Apache recently changed to the new remoting code and has restarted the slaves in question.

          We started to experience this issue about a week ago - previously stable builds now intermittently fail with,

          FATAL: Unable to copy coverage from /xxx/ to /yyy/
          hudson.util.IOException2: Cannot parse coverage results

          Downloading the XML file locally proves the file is valid XML.

          Simon Westcott added a comment - We started to experience this issue about a week ago - previously stable builds now intermittently fail with, FATAL: Unable to copy coverage from /xxx/ to /yyy/ hudson.util.IOException2: Cannot parse coverage results Downloading the XML file locally proves the file is valid XML.

          Stubbs added a comment -

          I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again.

          I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which.

          I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.

          Stubbs added a comment - I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again. I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which. I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.

          tdunning added a comment -

          One additional note from the Apache side.

          This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails.

          I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.

          tdunning added a comment - One additional note from the Apache side. This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails. I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.

          Stubbs added a comment -

          We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live.

          We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.

          Stubbs added a comment - We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live. We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.

          Stubbs added a comment -

          Stack trace from the master Hudson's log at the time the build failed with this error.

          Stubbs added a comment - Stack trace from the master Hudson's log at the time the build failed with this error.

          fpavageau added a comment -

          The problem started with 1.378. I've tried almost all versions since, they all exhibit the same behavior. We're staying with 1.377 at the moment.

          I think it might be related to the fixes for http://issues.jenkins-ci.org/browse/JENKINS-5977, with the channel to the slave being closed too early sometimes. It seems to happen more under heavy load.

          fpavageau added a comment - The problem started with 1.378. I've tried almost all versions since, they all exhibit the same behavior. We're staying with 1.377 at the moment. I think it might be related to the fixes for http://issues.jenkins-ci.org/browse/JENKINS-5977 , with the channel to the slave being closed too early sometimes. It seems to happen more under heavy load.

          carltonb added a comment -

          We are seeing the problem as well... for every build that threw an IOException when publishing cobertura.xml, the original cobertura.xml is valid but the archived copy is truncated. We are using Hudson 1.384.

          carltonb added a comment - We are seeing the problem as well... for every build that threw an IOException when publishing cobertura.xml, the original cobertura.xml is valid but the archived copy is truncated. We are using Hudson 1.384.

          hudsonfsc added a comment -

          Same issue

          hudsonfsc added a comment - Same issue

          maxence added a comment -

          Same issue here, occurs with Hudson 1.384 on JDK 1.6.0_u22, with master running Debian 5.0 and slave running Ubuntu 10.04.

          maxence added a comment - Same issue here, occurs with Hudson 1.384 on JDK 1.6.0_u22, with master running Debian 5.0 and slave running Ubuntu 10.04.

          erickerickson added a comment -

          +1, it's really hard to convince people to try the any trunk code when they look and see all the "failures" even when they don't really have anything to do with the project..

          I'm coming at this from a Solr angle.

          Erick

          erickerickson added a comment - +1, it's really hard to convince people to try the any trunk code when they look and see all the "failures" even when they don't really have anything to do with the project.. I'm coming at this from a Solr angle. Erick

          npellow added a comment -

          Hi,
          As far as I can tell, this is neither an issue with Clover nor Corbutura, rather how build artefacts are passed between agent and server.

          If there is anything Clover could do to improve this, please let us know.

          Cheers,
          Nick Pellow
          Atlassian Clover.

          npellow added a comment - Hi, As far as I can tell, this is neither an issue with Clover nor Corbutura, rather how build artefacts are passed between agent and server. If there is anything Clover could do to improve this, please let us know. Cheers, Nick Pellow Atlassian Clover.

          Uwe Schindler added a comment -

          Hi Nick,

          thanks for the support! I also think this is a problem of artifact publishing, I just wanted to report it also to you.

          Uwe Schindler added a comment - Hi Nick, thanks for the support! I also think this is a problem of artifact publishing, I just wanted to report it also to you.

          hbjastad added a comment -

          +1
          We have tried earlier to upgrade from 1.377, but due to this problem, we had to revert back to 1.377. Tried again to upgrade yesterday, but got the same problem. So it definitely seems to be a problem that got introduced by 1.378 - and hasn't been fixed since.

          hbjastad added a comment - +1 We have tried earlier to upgrade from 1.377, but due to this problem, we had to revert back to 1.377. Tried again to upgrade yesterday, but got the same problem. So it definitely seems to be a problem that got introduced by 1.378 - and hasn't been fixed since.

          Stubbs added a comment -

          We get multiple examples of this every night, so if you need any extra info just give me a shout.

          Stubbs added a comment - We get multiple examples of this every night, so if you need any extra info just give me a shout.

          tdunning added a comment -

          I anybody actually looking into this issue?

          It seems to affect lots of high profile installs of hudson. Surely it isn't good to have dozens of projects at Apache have the impression that hudson is this unstable.

          tdunning added a comment - I anybody actually looking into this issue? It seems to affect lots of high profile installs of hudson. Surely it isn't good to have dozens of projects at Apache have the impression that hudson is this unstable.

          jlaurila added a comment -

          +1 - Have this happening on 1.385.

          Our configuration has a master and 2-7 slaves. All run CentOS 5.5, Sun JDK from RPM jdk-1.6.0_21-fcs except one slave which runs Fedora 12. Each machine has only one executor due to Ivy locking issues.

          jlaurila added a comment - +1 - Have this happening on 1.385. Our configuration has a master and 2-7 slaves. All run CentOS 5.5, Sun JDK from RPM jdk-1.6.0_21-fcs except one slave which runs Fedora 12. Each machine has only one executor due to Ivy locking issues.

          Uwe Schindler added a comment -

          Hudson at Apache was updated to v1.395 at the weekend. After that the error happens more seldom (most builds succeed now), but today we got a new stack trace related to this:

          Publishing Clover coverage report...
          Publishing Clover HTML report...
          Publishing Clover XML report...
          FATAL: Unable to copy coverage from /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29
          hudson.util.IOException2: Failed to copy /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports/clover.xml to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29/clover.xml
          	at hudson.FilePath.copyTo(FilePath.java:1374)
          	at hudson.plugins.clover.CloverPublisher.copyXmlReport(CloverPublisher.java:233)
          	at hudson.plugins.clover.CloverPublisher.perform(CloverPublisher.java:157)
          	at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36)
          	at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:622)
          	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:601)
          	at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:579)
          	at hudson.model.Build$RunnerImpl.post2(Build.java:156)
          	at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:548)
          	at hudson.model.Run.run(Run.java:1386)
          	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
          	at hudson.model.ResourceController.execute(ResourceController.java:88)
          	at hudson.model.Executor.run(Executor.java:145)
          Caused by: java.io.IOException: Bad file descriptor
          	at java.io.FileOutputStream.close0(Native Method)
          	at java.io.FileOutputStream.close(FileOutputStream.java:279)
          	at hudson.FilePath.copyTo(FilePath.java:1371)
          	... 12 more
          Email was triggered for: Failure
          

          Are there any connections to this issue or is this a new one?

          Uwe Schindler added a comment - Hudson at Apache was updated to v1.395 at the weekend. After that the error happens more seldom (most builds succeed now), but today we got a new stack trace related to this: Publishing Clover coverage report... Publishing Clover HTML report... Publishing Clover XML report... FATAL: Unable to copy coverage from /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29 hudson.util.IOException2: Failed to copy /home/hudson/hudson-slave/workspace/Solr-trunk/checkout/solr/build/tests/clover/reports/clover.xml to /home/hudson/hudson/jobs/Solr-trunk/builds/2011-01-25_08-13-29/clover.xml at hudson.FilePath.copyTo(FilePath.java:1374) at hudson.plugins.clover.CloverPublisher.copyXmlReport(CloverPublisher.java:233) at hudson.plugins.clover.CloverPublisher.perform(CloverPublisher.java:157) at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:622) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:601) at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:579) at hudson.model.Build$RunnerImpl.post2(Build.java:156) at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:548) at hudson.model.Run.run(Run.java:1386) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:145) Caused by: java.io.IOException: Bad file descriptor at java.io.FileOutputStream.close0(Native Method) at java.io.FileOutputStream.close(FileOutputStream.java:279) at hudson.FilePath.copyTo(FilePath.java:1371) ... 12 more Email was triggered for: Failure Are there any connections to this issue or is this a new one?

          lacostej added a comment -

          If you are affected by this problem and aren't afraid of patching&build from source, you might want to try
          https://github.com/lacostej/jenkins/commit/31b8361d3152fb7970e1c11c906a763fa1aa5c25

          I will try making a test build for those who want to test it.

          lacostej added a comment - If you are affected by this problem and aren't afraid of patching&build from source, you might want to try https://github.com/lacostej/jenkins/commit/31b8361d3152fb7970e1c11c906a763fa1aa5c25 I will try making a test build for those who want to test it.

          Code changed in jenkins
          User: Jerome Lacoste
          Path:
          remoting/src/main/java/hudson/remoting/ProxyInputStream.java
          http://jenkins-ci.org/commit/core/31b8361d3152fb7970e1c11c906a763fa1aa5c25
          Log:
          JENKINS-7836 tentative fix for the copy from slave to master issues. The problem looks similar to JENKINS-7745, so we might as well synchronized the ProxyInputStream.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jerome Lacoste Path: remoting/src/main/java/hudson/remoting/ProxyInputStream.java http://jenkins-ci.org/commit/core/31b8361d3152fb7970e1c11c906a763fa1aa5c25 Log: JENKINS-7836 tentative fix for the copy from slave to master issues. The problem looks similar to JENKINS-7745 , so we might as well synchronized the ProxyInputStream.

          Uwe Schindler added a comment -

          Hi Jerome,
          thanks for the work done. As Apache is planning to move over to Jenkins, we hope to solve this soon. Please keep me informed, when a new release containing this fix is done!

          Uwe Schindler added a comment - Hi Jerome, thanks for the work done. As Apache is planning to move over to Jenkins, we hope to solve this soon. Please keep me informed, when a new release containing this fix is done!

          Hi Jerome,

          I have built:
          https://github.com/jenkinsci/jenkins/commit/c15378ef53bbcbaac1df11662ac4e16d28a008ea
          Where your fix is included but actually it only seams to get worse with this build.

          (The only change I have made is to use jmdns-3.4.0 instead of the "bundled" version, see http://issues.hudson-ci.org/browse/HUDSON-8647)

          If you need any information (logs, stats, dumps) or if I can run any experimental code for you to help resolve this issue send me an e-mail at rickard.von.essen 'at' gmail.com

          Rickard von Essen added a comment - Hi Jerome, I have built: https://github.com/jenkinsci/jenkins/commit/c15378ef53bbcbaac1df11662ac4e16d28a008ea Where your fix is included but actually it only seams to get worse with this build. (The only change I have made is to use jmdns-3.4.0 instead of the "bundled" version, see http://issues.hudson-ci.org/browse/HUDSON-8647 ) If you need any information (logs, stats, dumps) or if I can run any experimental code for you to help resolve this issue send me an e-mail at rickard.von.essen 'at' gmail.com

          robertredd added a comment -

          I can confirm this issue is still present in 1.398. I had thought some of the other related cases may have resolved it, but it hasn't. I just wanted to confirm it's still an issue that I know is being worked on. Out of about 80 builds I run at night for statistics, about 10 will fail with this error.

          robertredd added a comment - I can confirm this issue is still present in 1.398. I had thought some of the other related cases may have resolved it, but it hasn't. I just wanted to confirm it's still an issue that I know is being worked on. Out of about 80 builds I run at night for statistics, about 10 will fail with this error.

          Cody Cutrer added a comment -

          I also just experienced it again on 1.397, though with a much lower frequency than pre-1.397.

          Cody Cutrer added a comment - I also just experienced it again on 1.397, though with a much lower frequency than pre-1.397.

          oeuftete added a comment -

          Still an issue in 1.399 (which included the fix for JENKINS-7809... so I guess that wasn't it).

          oeuftete added a comment - Still an issue in 1.399 (which included the fix for JENKINS-7809 ... so I guess that wasn't it).

          Roger Zhang added a comment -

          Does anyone know when this problem will be resolved?

          Roger Zhang added a comment - Does anyone know when this problem will be resolved?

          rshelley added a comment -

          To work around this, I've setup Sonar and am pushing all my code quality reports to it instead of reporting it in Jenkins. Even if this gets fixed, I doubt I'll go back, Sonar is just too nice and useful.

          rshelley added a comment - To work around this, I've setup Sonar and am pushing all my code quality reports to it instead of reporting it in Jenkins. Even if this gets fixed, I doubt I'll go back, Sonar is just too nice and useful.

          The links in the original description of this issue are no longer working, so I cannot be sure of the failure mode that the reporter saw. As such, I'm also unclear what later "me, too" comments are referring to.

          Now, I've just committed a fix to JENKINS-7871 toward 1.402, and I suspect that fixes this issue as well.

          If you continue to see a problem after 1.402, please open a separate issue, or reopen JENKINS-7871 (provided that the failure mode is the same), instead of reopening this issue, since it's unclear exactly what this issue is referring to.

          Kohsuke Kawaguchi added a comment - The links in the original description of this issue are no longer working, so I cannot be sure of the failure mode that the reporter saw. As such, I'm also unclear what later "me, too" comments are referring to. Now, I've just committed a fix to JENKINS-7871 toward 1.402, and I suspect that fixes this issue as well. If you continue to see a problem after 1.402, please open a separate issue, or reopen JENKINS-7871 (provided that the failure mode is the same), instead of reopening this issue, since it's unclear exactly what this issue is referring to.

          Uwe Schindler added a comment -

          The links are no longer working, but the field "URL", still shows the issue in the apache mailing list: http://mail-archives.apache.org/mod_mbox/www-builds/201010.mbox/%3c007b01cb6af6$e5f74e00$b1e5ea00$@thetaphi.de%3e

          Uwe Schindler added a comment - The links are no longer working, but the field "URL", still shows the issue in the apache mailing list: http://mail-archives.apache.org/mod_mbox/www-builds/201010.mbox/%3c007b01cb6af6$e5f74e00$b1e5ea00$@thetaphi.de%3e

            kohsuke Kohsuke Kawaguchi
            thetaphi Uwe Schindler
            Votes:
            68 Vote for this issue
            Watchers:
            65 Start watching this issue

              Created:
              Updated:
              Resolved: