Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-7836

Clover and cobertura parsing on hudson master fails because of invalid XML

      Since a few days, on our Apache Hudson installation, parsing of Clover's clover.xml or the Coberture's coverage.xml file fails (but not in all cases, sometimes it simply passes with the same build and same job configuration). This only happens after transferring to master, the reports and xml file is created on Hudson slave. It seems like the network code somehow breaks the xml file during transfer to the master.

      Downloading th clover.xml from the workspace to my local computer and validating it confirms, that it is not incorrectly formatted and has no XML parse errors.

          [JENKINS-7836] Clover and cobertura parsing on hudson master fails because of invalid XML

          Uwe Schindler created issue -

          tdunning added a comment -

          Also occurs with the Mahout Apache project:

          https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend

          Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes
          to the project itself.

          Apache recently changed to the new remoting code and has restarted the slaves in question.

          tdunning added a comment - Also occurs with the Mahout Apache project: https://hudson.apache.org/hudson/job/Mahout-Quality/buildTimeTrend Note how one slave (solaris1) seems reliable and the other seems to fail about 50% of the time. This job has been stable for eons with no changes to the project itself. Apache recently changed to the new remoting code and has restarted the slaves in question.

          We started to experience this issue about a week ago - previously stable builds now intermittently fail with,

          FATAL: Unable to copy coverage from /xxx/ to /yyy/
          hudson.util.IOException2: Cannot parse coverage results

          Downloading the XML file locally proves the file is valid XML.

          Simon Westcott added a comment - We started to experience this issue about a week ago - previously stable builds now intermittently fail with, FATAL: Unable to copy coverage from /xxx/ to /yyy/ hudson.util.IOException2: Cannot parse coverage results Downloading the XML file locally proves the file is valid XML.
          torbent made changes -
          Link New: This issue is related to JENKINS-7809 [ JENKINS-7809 ]

          Stubbs added a comment -

          I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again.

          I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which.

          I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.

          Stubbs added a comment - I'd like to add to Simon's comments and say that our build box is almost useless to us at the moment. we currently have 27 "broken" builds, but 5 minutes ago it was a lot higher. I kicked off a build for each of the broken jobs & about 6 or 7 started to work again, others are still running and will no doubt bring that number down lower again. I know though that when I come in int he morning the wallboard will be a sea of red and I'll no idea which are actual broken builds and which are caused by this problem. We have over 100 builds and at present ~30 are showing as broken. Some are config issues because we have just migrated from CruiseControl, a couple are genuine failures and others are caused by this bug, but without going through each one, I can't tell which is which. I'd argue that this needs to be a higher priority than major, it pretty much means we have to turn off publishing for our cobertura/clover based reports.

          tdunning added a comment -

          One additional note from the Apache side.

          This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails.

          I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.

          tdunning added a comment - One additional note from the Apache side. This problem seems to be both intermittent and host specific for us. We have two nearly identical build VM's and one fails about 50% of the time and one doesn't (so far). Our infra guys claim no difference, but the faster one is the one that fails. I know that this isn't a big hint, but it might eliminate some hypotheses since it does (weakly) imply that the problem is somehow environmentally related.
          Uwe Schindler made changes -
          Priority Original: Major [ 3 ] New: Critical [ 2 ]

          Stubbs added a comment -

          We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live.

          We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.

          Stubbs added a comment - We have 15 slaves, they're all virtual machines running on a Xen host & they all started as the same image and the only differences on them is the jobs that they may or may not have run in the months since we made the cluster live. We see the same issue with the problem being intermittent, and host specific. The slaves are called BuildSlave03, BuildSlave05 & BuildSlave15. There may be more effected, but it's hard to tell right now.
          rshelley made changes -
          Link New: This issue is duplicated by JENKINS-7897 [ JENKINS-7897 ]

          Stubbs added a comment -

          Stack trace from the master Hudson's log at the time the build failed with this error.

          Stubbs added a comment - Stack trace from the master Hudson's log at the time the build failed with this error.
          Stubbs made changes -
          Attachment New: JENKINS-7836-stacktrace.txt [ 19949 ]

            kohsuke Kohsuke Kawaguchi
            thetaphi Uwe Schindler
            Votes:
            68 Vote for this issue
            Watchers:
            65 Start watching this issue

              Created:
              Updated:
              Resolved: