Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43176

MercurialChangeLogParser fails in parallel checkouts

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • Operating System: Centos 7.2, 64-bit
      JDK: 1.8.0
      Jenkins: 2.48
      Mercurial Plugin: 1.59
      Running Jenkins directly
      No reverse proxy
      installed via yum
      issue occurs on the master node

      Using parallel nodes and checkout sometimes causes the following error:

      hudson.util.IOException2: Failed to parse /var/lib/jenkins/jobs/parallel-test/builds/28/changelog0.xml: '<?xml version="1.0" encoding="UTF-8"?>
       <changesets>
       '
       at hudson.plugins.mercurial.MercurialChangeLogParser.parse(MercurialChangeLogParser.java:55)
       at hudson.plugins.mercurial.MercurialChangeLogParser.parse(MercurialChangeLogParser.java:26)
       at org.jenkinsci.plugins.workflow.job.WorkflowRun.onCheckout(WorkflowRun.java:746)
       at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$1500(WorkflowRun.java:125)
       at org.jenkinsci.plugins.workflow.job.WorkflowRun$SCMListenerImpl.onCheckout(WorkflowRun.java:936)
       at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:123)
       at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:83)
       at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:73)
       at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47)
       at hudson.security.ACL.impersonate(ACL.java:260)
       at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:745)
       Suppressed: hudson.util.IOException2: Failed to parse /var/lib/jenkins/jobs/parallel-test/builds/28/changelog0.xml: '<?xml version="1.0" encoding="UTF-8"?>
       <changesets>
       '
       ... 16 more
       Caused by: org.xml.sax.SAXParseException; systemId: file:/var/lib/jenkins/jobs/parallel-test/builds/28/changelog0.xml; lineNumber: 3; columnNumber: 1; XML document structures must start and end within the same entity.
       at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
       at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
       at org.apache.commons.digester.Digester.parse(Digester.java:1871)
       at hudson.plugins.mercurial.MercurialChangeLogParser.parse(MercurialChangeLogParser.java:51)
       ... 15 more
       Caused by: org.xml.sax.SAXParseException; systemId: file:/var/lib/jenkins/jobs/parallel-test/builds/28/changelog0.xml; lineNumber: 3; columnNumber: 1; XML document structures must start and end within the same entity.
       at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
       at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
       at org.apache.commons.digester.Digester.parse(Digester.java:1871)
       at hudson.plugins.mercurial.MercurialChangeLogParser.parse(MercurialChangeLogParser.java:51)
       ... 15 more
       Finished: FAILURE

      Here is the code that I used to generate that error:

       

      #!groovy
      
      try
      {
      parallel (
      0: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      1: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      2: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      3: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      4: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      5: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      6: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      7: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      8: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } },
      9: { node { checkout($class: 'MercurialSCM', source: 'https://bitbucket.org/vicyap/jenkins-parallel-test', clean: true) } }
      )
      }
      catch (e)
      {
      throw e
      }
      

       

      However this error happens randomly. In the attachments, I did two identical runs, one passes, the other has multiple failed nodes.

       

      Also I was able to reproduce this error with just two parallel nodes.

       

      My current workaround is to not use parallel nodes, but then jobs run much slower. Does anyone have an alternative workaround or solution? Or how do I even get started trying to debug this?

       

          [JENKINS-43176] MercurialChangeLogParser fails in parallel checkouts

          I'm seeing the same problem with the Subversion SCM.

          Martin Filteau added a comment - I'm seeing the same problem with the Subversion SCM.

          The problem starting to be blocking for us, we tried to understand what is happening. It seems all the parallel branches write to the same changelog.xml file, at the same time. It leads to a file that cannot be read at some point. Most of the time, the file looks as in the provided log:

          '<?xml version="1.0" encoding="UTF-8"?>
          <changesets>
          '
          

          But once or twice we got `NUL` (0x00 in hex) characters inside:

          '<?xml version="1.0" encoding="UTF-8"?>
          <changesets>
          \0\0\0\0\0\0\0\0\0\author='...' rev='9230' date='1499864372.0-7200'><msg>...
          

           

          Olivier Sechet added a comment - The problem starting to be blocking for us, we tried to understand what is happening. It seems all the parallel branches write to the same changelog.xml file, at the same time. It leads to a file that cannot be read at some point. Most of the time, the file looks as in the provided log: '<?xml version= "1.0" encoding= "UTF-8" ?> <changesets> ' But once or twice we got `NUL` (0x00 in hex) characters inside: '<?xml version= "1.0" encoding= "UTF-8" ?> <changesets> \0\0\0\0\0\0\0\0\0\author= '...' rev= '9230' date= '1499864372.0-7200' ><msg>...  

          Jesse Glick added a comment -

          I suspect this is a race condition in SCMStep.checkout. Should try to atomically create the file before moving on to another choice. Or should use some other means of uniquifying files per build.

          Jesse Glick added a comment - I suspect this is a race condition in SCMStep.checkout . Should try to atomically create the file before moving on to another choice. Or should use some other means of uniquifying files per build.

          I have the same problem, but using subversion

          Hector Miuler Malpica Gallegos added a comment - I have the same problem, but using subversion

          After digging in the source of MercurialSCM and SCMStep, I found a working workaround by setting the SCMStep's changelog to false. Just change:

          checkout scm

          by:

          checkout(scm: scm, changelog: false)

          As jglick was suggesting, the creation of the file name in SCMStep is not thread safe. Two concurrent threads can get the same changelog file name which leads to the failure. However, when SCMStep.changelog is false, the filename is set to null and MercurialSCM won't try to write the changes.

          Olivier Sechet added a comment - After digging in the source of MercurialSCM and SCMStep, I found a working workaround by setting the SCMStep's changelog to false. Just change: checkout scm by: checkout(scm: scm, changelog: false) As jglick was suggesting, the creation of the file name in SCMStep is not thread safe. Two concurrent threads can get the same changelog file name which leads to the failure. However, when SCMStep.changelog is false, the filename is set to null and MercurialSCM won't try to write the changes.

          We see the same issue and stack trace in our declarative multi-branch pipeline. It has parallel stages and does checkouts in each (dealing with x86 and x64 builds).

          We don't want to sacrifice the changelog, so our workaround is to wrap every checkout scm with a

          lock("master-changelog-${BUILD_TAG}") {
            checkout scm
          }

           
          This means using skipDefaultCheckout() option and explict checkout scm in all stages.

          Konstantin Veretennicov added a comment - We see the same issue and stack trace in our declarative multi-branch pipeline. It has parallel stages and does checkouts in each (dealing with x86 and x64 builds). We don't want to sacrifice the changelog, so our workaround is to wrap every  checkout scm with a lock("master-changelog-${BUILD_TAG}") {   checkout scm }   This means using skipDefaultCheckout() option and explict checkout scm in all stages.

          Sam Van Oort added a comment -

          Reducing priority to reflect that this (a) Is intermittent (b) only applies to one specific SCM (c) Has a known workaround (d) Doesn't completely break the system, only a fraction of the jobs.

          Sam Van Oort added a comment - Reducing priority to reflect that this (a) Is intermittent (b) only applies to one specific SCM (c) Has a known workaround (d) Doesn't completely break the system, only a fraction of the jobs.

          Some comments to your priority justification, svanoort:

          a) Intermittent is actually worse than failing reliably.

          b) A comment above reports this issue for Subversion too. Anyway, even if it's somehow Mercurial-specific, it's of little consolation when that's your VCS - it's not like we can snap fingers and switch to another one. It might be easier to switch the CI instead.

          d) That fraction can be 100% of the jobs that really matter.

           

          But c) is correct. We still use the workaround. It clutters our pipeline code to some extent, but we learned to look away.

          Konstantin Veretennicov added a comment - Some comments to your priority justification, svanoort : a) Intermittent is actually worse than failing reliably. b) A comment above reports this issue for Subversion too. Anyway, even if it's somehow Mercurial-specific, it's of little consolation when that's your VCS - it's not like we can snap fingers and switch to another one. It might be easier to switch the CI instead. d) That fraction can be 100% of the jobs that really matter.   But c) is correct. We still use the workaround. It clutters our pipeline code to some extent, but we learned to look away.

          Sam Van Oort added a comment -

          kveretennicov I understand your frustration at seeing the priority downgraded, but it's only been downgraded a step – please note that Major priority still denotes an important issue, to quote the Wiki causing a "Major loss of function" where Critical is reserved for issues that cause "Crashes, loss of data, severe memory leak." It will still be fixed, but please bear in mind that we are responsible for a huge amount of functionality and it's important we don't miss issues that cause catastrophic failures

          That said, perhaps rsandell could take a look and might have some insight into this – it sounds superficially like there's some sort of synchronization/race condition at play here with changelog read/write? It seems to me that while the likely bug is in the SCM implementation itself, there may be something we can do in Pipeline to protect against this...?

          Sam Van Oort added a comment - kveretennicov I understand your frustration at seeing the priority downgraded, but it's only been downgraded a step – please note that Major priority still denotes an important issue, to quote the Wiki causing a "Major loss of function" where Critical is reserved for issues that cause "Crashes, loss of data, severe memory leak." It will still be fixed, but please bear in mind that we are responsible for a huge amount of functionality and it's important we don't miss issues that cause catastrophic failures That said, perhaps rsandell could take a look and might have some insight into this – it sounds superficially like there's some sort of synchronization/race condition at play here with changelog read/write? It seems to me that while the likely bug is in the SCM implementation itself, there may be something we can do in Pipeline to protect against this...?

          svanoort, to be clear, for me it's completely fair to downgrade from "Blocker" when a workaround is available. I appreciate the sheer number of all the other issues you have to deal with in a project like Jenkins. I only wanted to correct the assessment of the impact before it's used by someone else to decrease the priority even further.

          Konstantin Veretennicov added a comment - svanoort , to be clear, for me it's completely fair to downgrade from "Blocker" when a workaround is available. I appreciate the sheer number of all the other issues you have to deal with in a project like Jenkins. I only wanted to correct the assessment of the impact before it's used by someone else to decrease the priority even further.

            Unassigned Unassigned
            vicyap Victor Yap
            Votes:
            13 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: