Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-14537

Template SCMs + Multiple SCMs results in failure parsing change log

      Using a configuration similar that outlined in the Environment field, Jenkins is unable to parse the changelog.xml file, because it results in a nested CDATA section. That is because Multiple-SCMs creates an XML structure similar to:

      <multi-scm-log>
        <sub-log scm="hudson.plugins.git.GitSCM">
          <![CDATA[Changes in origin/master ...
          ...
          ...]]>
        </sub-log>
        <sub-log scm="hudson.plugins.git.GitSCM">
        </sub-log>
      </multi-scm-log>
      

      That works fine for just the multiple-scm configuration. But when combined with the template-project "Use SCM from another project" option, the sub-log itself is completely wrapped in CDATA, thus resulting in an invalid XML document (it is invalid to nest CDATA sections):

      <multiple-scms>
        <sub-log scm="hudson.plugins.git.GitSCM">
          <![CDATA[Changes in origin/master, ...
          ...
          ...]]>
        </sub-log>
        <sub-log scm="hudson.plugins.templateproject.ProxySCM">
          <![CDATA[<multiple-scms>
            <sub-log scm="hudson.plugins.git.GitSCM">
              <![CDATA[Changes in projectA/master, ...
              ...
              ...]]>
            </sub-log>
            <sub-log scm="hudson.plugins.git.GitSCM">
              <![CDATA[Changes in projectB/master, ...
              ...
              ...]]>
            </sub-log>
          </multiple-scms>]]>
        </sub-log>
      </multiple-scms>
      

      As you can see, the GitSCM sub-log entries here are wrapped in CDATA, but the ProxySCM entry was already wrapped in CDATA.

      Ideally, CDATA should only be used if needed (e.g., in this case, the nodes could simply be nested, and only use CDATA around the actual commit data [from GitSCM]).

          [JENKINS-14537] Template SCMs + Multiple SCMs results in failure parsing change log

          Joe Hansche added a comment -

          To elaborate, I think the problem is that ProxySCM should avoid the CDATA entirely, because it should be understood that the remote ("proxied") SCM will already use CDATA where it needs to:

          <multiple-scms>
            <sub-log scm="hudson.plugins.git.GitSCM">
              <![CDATA[Changes in origin/master, ...
              ...
              ...]]>
            </sub-log>
            <sub-log scm="hudson.plugins.templateproject.ProxySCM">
              <sub-log scm="hudson.plugins.git.GitSCM">
                <![CDATA[Changes in projectA/master, ...
                ...
                ...]]>
              </sub-log>
            </sub-log>
          </multiple-scms>
          

          That way, even if ProxySCM uses only a single remote GitSCM configuration, it is still valid. I haven't looked into the code, so the problem may actually be Multiple-SCMs using CDATA all the time, when it is only necessary for the actual final SCM log data. That may be difficult to determine though, since neither plugin was probably designed to work with the other. In our case though (as mentioned in the Environment field above), this combination is actually very useful, and we use it for dozens of jobs. It works well in general, the only thing that fails is the changelog parsing (which in turn causes our log file to fill up because of the error in parsing the changelog.xml file)

          Joe Hansche added a comment - To elaborate, I think the problem is that ProxySCM should avoid the CDATA entirely, because it should be understood that the remote ("proxied") SCM will already use CDATA where it needs to: <multiple-scms> <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in origin/master, ... ... ...]]> </sub-log> <sub-log scm= "hudson.plugins.templateproject.ProxySCM" > <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in projectA/master, ... ... ...]]> </sub-log> </sub-log> </multiple-scms> That way, even if ProxySCM uses only a single remote GitSCM configuration, it is still valid. I haven't looked into the code, so the problem may actually be Multiple-SCMs using CDATA all the time, when it is only necessary for the actual final SCM log data. That may be difficult to determine though, since neither plugin was probably designed to work with the other. In our case though (as mentioned in the Environment field above), this combination is actually very useful, and we use it for dozens of jobs. It works well in general, the only thing that fails is the changelog parsing (which in turn causes our log file to fill up because of the error in parsing the changelog.xml file)

          Joe Hansche added a comment - - edited

          Looking into this more, I guess the real problem is that the changelog itself is generally not XML (but plain text). ProxySCM generally does not inject any sort of identifier that it is the one responsible for the changelog (just proxies what the original SCM changelog said, verbatim). I see that Multiple-SCMs actually expects to never have a nested <multiple-scms/> node (because it removes the MultiSCM descriptor from the list of available SCMs to choose from). However, the ProxySCM now makes it possible to have that nested changelog, and because it's a template project, it actually does make sense to allow for that.

          One way to get around this would be to strip the <multiple-scms></multiple-scms> tags from the ProxySCM sublog, but then you're relying on text-based magic to achieve it, and it still wouldn't really be perfect.

          Instead, I think the most appropriate way to fix this is to go back to what the standard ChangeLogParser does (at least, how GitSCM works), and NOT expect any XML structure at all. Instead, maybe a kind of binary-safe parser that inserts a marker with the SCM class identifier, plus the length of the next "sub-log chunk". The reader would then read the length number, then read that many bytes from the file, and treat that as one separate sublog file. Then the sublog writer call would look something more like:

                                  String subLogText = FileUtils.readFileToString(subChangeLog);
                                  logWriter.write(String.format("MultiSCM:\"%s\"\n%d\n%s\n",
                                                  scm.getType(),
                                                  subLogText.length(),
                                                  subLogText);
          

          And the output (e.g., from my initial description) would be more like:

          MultiSCM:hudson.plugins.git.GitSCM
          512
          Changes in origin/master, ...
          ...
          total of 512 bytes here
          ...
          MultiSCM:hudson.plugins.templateproject.ProxySCM
          1122
          MultiSCM:hudson.plugins.git.GitSCM
          512
          Changes in projectA/master, ...
          ...
          total of 512 bytes here
          ...
          MultiSCM:hudson.plugins.git.GitSCM
          512
          Changes in projectB/master, ...
          ...
          total of 512 bytes here
          ...
          

          The tokenization is still not perfect, and particularly with the way ProxySCM works (since there is no easy way to tell that the proxied SCM is in fact a MultiSCM changelog).

          Although, for that matter, ... It would actually make sense to use a true libxml document generator, and let it decide whether to do CDATA or not, based on what the text is. Could also just encode the cdata section (e.g, using &lt; &gt; ), so that the nested <![CDATA[]]> is not interpreted as such.

          In general, I think it's a mistake to use the SAX parser to read the file, but not use the SAX framework to generate the XML in the first place. By using plain String.format(), you are not guaranteeing that the resulting XML file is valid, thus the SAX parser will barf on the invalid document, because you didn't use a proper XML-generating library to create the file initially. I'm sure using <![CDATA[]]> was your way of getting around that, but as you can see here, that is still not perfect (and in fact, you would still have the same problem, if a commit message was written with something like:

          $ git commit -m'Wrap the unknown content in <![CDATA[]]> to avoid parsing issues'

          Because it would result in the same bug described here.

          Joe Hansche added a comment - - edited Looking into this more, I guess the real problem is that the changelog itself is generally not XML (but plain text). ProxySCM generally does not inject any sort of identifier that it is the one responsible for the changelog (just proxies what the original SCM changelog said, verbatim). I see that Multiple-SCMs actually expects to never have a nested <multiple-scms/> node (because it removes the MultiSCM descriptor from the list of available SCMs to choose from). However, the ProxySCM now makes it possible to have that nested changelog, and because it's a template project, it actually does make sense to allow for that. One way to get around this would be to strip the <multiple-scms></multiple-scms> tags from the ProxySCM sublog, but then you're relying on text-based magic to achieve it, and it still wouldn't really be perfect. Instead, I think the most appropriate way to fix this is to go back to what the standard ChangeLogParser does (at least, how GitSCM works), and NOT expect any XML structure at all. Instead, maybe a kind of binary-safe parser that inserts a marker with the SCM class identifier, plus the length of the next "sub-log chunk". The reader would then read the length number, then read that many bytes from the file, and treat that as one separate sublog file. Then the sublog writer call would look something more like: String subLogText = FileUtils.readFileToString(subChangeLog); logWriter.write( String .format( "MultiSCM:\" %s\ "\n%d\n%s\n" , scm.getType(), subLogText.length(), subLogText); And the output (e.g., from my initial description) would be more like: MultiSCM:hudson.plugins.git.GitSCM 512 Changes in origin/master, ... ... total of 512 bytes here ... MultiSCM:hudson.plugins.templateproject.ProxySCM 1122 MultiSCM:hudson.plugins.git.GitSCM 512 Changes in projectA/master, ... ... total of 512 bytes here ... MultiSCM:hudson.plugins.git.GitSCM 512 Changes in projectB/master, ... ... total of 512 bytes here ... The tokenization is still not perfect, and particularly with the way ProxySCM works (since there is no easy way to tell that the proxied SCM is in fact a MultiSCM changelog). Although, for that matter, ... It would actually make sense to use a true libxml document generator, and let it decide whether to do CDATA or not, based on what the text is. Could also just encode the cdata section (e.g, using &lt; &gt; ), so that the nested <![CDATA[]]> is not interpreted as such. In general, I think it's a mistake to use the SAX parser to read the file, but not use the SAX framework to generate the XML in the first place. By using plain String.format(), you are not guaranteeing that the resulting XML file is valid, thus the SAX parser will barf on the invalid document, because you didn't use a proper XML-generating library to create the file initially. I'm sure using <![CDATA[]]> was your way of getting around that, but as you can see here, that is still not perfect (and in fact, you would still have the same problem, if a commit message was written with something like: $ git commit -m'Wrap the unknown content in <![CDATA[]]> to avoid parsing issues' Because it would result in the same bug described here.

          Joe Hansche added a comment -

          After some discussion and trying to find the best way to avoid this problem, I decided to XML entity-encode any "]]>" (and &) in the sublog text, which allows for recursive encoding and decoding:

          <multi-scm-log version="2">
          <sub-log scm="hudson.plugins.templateproject.ProxySCM">
          <![CDATA[<multi-scm-log version="2">
          <sub-log scm="hudson.plugins.git.GitSCM">
          <![CDATA[&93;&93;&gt;
          </sub-log>
          <sub-log scm="hudson.plugins.git.GitSCM">
          <![CDATA[&93;&93;&gt;
          </sub-log>
          </multi-scm-log>
          ]]>
          </sub-log>
          </multi-scm-log>

          In this case, each log was blank, but you can see that the nested multi-scm-log's sub-log nodes have the ]]> encoded to &93;&93;&gt;. Before that, it will encode any "&" into "&amp;" – which means if it were nested even further, you would end up with "&amp;93&amp;93&amp;gt;". On the way out, I decode &amp; into &, and &93;&93;&gt; back into ]]>.

          Now if I introduce an actual commit (e.g., if the commit contains "]]>" in the commit message), you can see how the nested encoding and decoding works:

          <multi-scm-log version="2">
          <sub-log scm="hudson.plugins.templateproject.ProxySCM">
          <![CDATA[<multi-scm-log version="2">
          <sub-log scm="hudson.plugins.git.GitSCM">
          <![CDATA[Changes in branch origin/HEAD, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63
          Changes in branch origin/master, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63
          &93;&93;&gt;
          </sub-log>
          <sub-log scm="hudson.plugins.git.GitSCM">
          <![CDATA[Changes in branch origin/master, between cd4b6a92715d10b63aa2f9d84101233034c20a85 and e3a66cdc1f8b3aac2ec585f1649d959482aecd11
          commit e3a66cdc1f8b3aac2ec585f1649d959482aecd11
          tree 6db2b3a0c13c04b459b5376c9aeb343edb09fb87
          parent da5cb23b4e2989553fc14cdcf305595fcaa4820e
          author Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400
          committer Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400
          
              Testing 5 &amp;93;&amp;93;&amp;gt; x
          
          :100644 100644 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 097039f4f62342d3253b42297f19eb90aacb026f M	README
          
          commit da5cb23b4e2989553fc14cdcf305595fcaa4820e
          tree 9e004ae6f87c001f30c569feb1360eafb62cd3d4
          parent cd4b6a92715d10b63aa2f9d84101233034c20a85
          author Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400
          committer Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400
          
              Testing 4 &amp;93;&amp;93;&amp;gt; x
          
          :100644 100644 8e1178035834ac70cd49c258dbbe898d3badd476 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 M	README
          &93;&93;&gt;
          </sub-log>
          </multi-scm-log>
          ]]>
          </sub-log>
          </multi-scm-log>
          

          And the changelog shows the expected "Testing 4 ]]> x"

          Joe Hansche added a comment - After some discussion and trying to find the best way to avoid this problem, I decided to XML entity-encode any "]]>" (and &) in the sublog text, which allows for recursive encoding and decoding: <multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.templateproject.ProxySCM" > <![CDATA[<multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[&93;&93;&gt; </sub-log> <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[&93;&93;&gt; </sub-log> </multi-scm-log> ]]> </sub-log> </multi-scm-log> In this case, each log was blank, but you can see that the nested multi-scm-log's sub-log nodes have the ]]> encoded to &93;&93;&gt; . Before that, it will encode any "&" into "&amp;" – which means if it were nested even further, you would end up with " &amp;93&amp;93&amp;gt; ". On the way out, I decode &amp; into & , and &93;&93;&gt; back into ]]> . Now if I introduce an actual commit (e.g., if the commit contains " ]]> " in the commit message), you can see how the nested encoding and decoding works: <multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.templateproject.ProxySCM" > <![CDATA[<multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in branch origin/HEAD, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63 Changes in branch origin/master, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63 &93;&93;&gt; </sub-log> <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in branch origin/master, between cd4b6a92715d10b63aa2f9d84101233034c20a85 and e3a66cdc1f8b3aac2ec585f1649d959482aecd11 commit e3a66cdc1f8b3aac2ec585f1649d959482aecd11 tree 6db2b3a0c13c04b459b5376c9aeb343edb09fb87 parent da5cb23b4e2989553fc14cdcf305595fcaa4820e author Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400 committer Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400 Testing 5 &amp;93;&amp;93;&amp;gt; x :100644 100644 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 097039f4f62342d3253b42297f19eb90aacb026f M README commit da5cb23b4e2989553fc14cdcf305595fcaa4820e tree 9e004ae6f87c001f30c569feb1360eafb62cd3d4 parent cd4b6a92715d10b63aa2f9d84101233034c20a85 author Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400 committer Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400 Testing 4 &amp;93;&amp;93;&amp;gt; x :100644 100644 8e1178035834ac70cd49c258dbbe898d3badd476 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 M README &93;&93;&gt; </sub-log> </multi-scm-log> ]]> </sub-log> </multi-scm-log> And the changelog shows the expected " Testing 4 ]]> x "

          Joe Hansche added a comment -

          A pull-request has been submitted to fix this issue, at https://github.com/jenkinsci/multiple-scms-plugin/pull/2

          Joe Hansche added a comment - A pull-request has been submitted to fix this issue, at https://github.com/jenkinsci/multiple-scms-plugin/pull/2

          Joe Hansche added a comment -

          I've written a sed script that worked to resolve the issue on my existing changelog.xml files. It will need to be tweaked for any file that is not in the same format as mine.

          $ cat fix-changelog.sh
          #!/bin/sh
          
          sed -n -r -i '1h;1!H;$ {;g;s#<sub-log scm="hudson.plugins.templateproject.ProxySCM">\n<!\[CDATA\[<multi-scm-log>\n<sub-log scm="hudson.plugins.git.GitSCM">\n<!\[CDATA\[((.*\n)*?)\]\]>\n</sub-log>\n<sub-log scm="hudson.plugins.git.GitSCM">\n<!\[CDATA\[((.*\n)*?)\]\]>\n</sub-log>\n</multi-scm-log>\n\]\]>\n</sub-log>#<sub-log scm="hudson.plugins.templateproject.ProxySCM">\n<!\[CDATA\[<multi-scm-log>\n<sub-log scm="hudson.plugins.git.GitSCM">\n<!\[CDATA\[\2\&93;\&93;\&gt;\n</sub-log>\n<sub-log scm="hudson.plugins.git.GitSCM">\n<!\[CDATA\[\4\&93;\&93;\&gt;\n</sub-log>\n</multi-scm-log>\n\]\]>\n</sub-log>#g;p;}' "$1"
          

          Then just pass the filename you want to change into the script, or do all at once:

          $ ./fix-changelog.sh path/to/changelog.xml
          
          : or to find and fix all:
          
          $ find . -name 'changelog.xml' -print0 | xargs -0 -n1 ./fix-changelog.sh
          

          Joe Hansche added a comment - I've written a sed script that worked to resolve the issue on my existing changelog.xml files. It will need to be tweaked for any file that is not in the same format as mine. $ cat fix-changelog.sh #!/bin/sh sed -n -r -i '1h;1!H;$ {;g;s#<sub-log scm= "hudson.plugins.templateproject.ProxySCM" >\n<!\[CDATA\[<multi-scm-log>\n<sub-log scm= "hudson.plugins.git.GitSCM" >\n<!\[CDATA\[((.*\n)*?)\]\]>\n</sub-log>\n<sub-log scm= "hudson.plugins.git.GitSCM" >\n<!\[CDATA\[((.*\n)*?)\]\]>\n</sub-log>\n</multi-scm-log>\n\]\]>\n</sub-log>#<sub-log scm= "hudson.plugins.templateproject.ProxySCM" >\n<!\[CDATA\[<multi-scm-log>\n<sub-log scm= "hudson.plugins.git.GitSCM" >\n<!\[CDATA\[\2\&93;\&93;\&gt;\n</sub-log>\n<sub-log scm= "hudson.plugins.git.GitSCM" >\n<!\[CDATA\[\4\&93;\&93;\&gt;\n</sub-log>\n</multi-scm-log>\n\]\]>\n</sub-log>#g;p;}' "$1" Then just pass the filename you want to change into the script, or do all at once: $ ./fix-changelog.sh path/to/changelog.xml : or to find and fix all: $ find . -name 'changelog.xml' -print0 | xargs -0 -n1 ./fix-changelog.sh

          Joe Hansche added a comment -

          kbell, bump. I've been running this snapshot for about 3 weeks. Seems to have fixed the issue.

          Joe Hansche added a comment - kbell , bump. I've been running this snapshot for about 3 weeks. Seems to have fixed the issue.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          pom.xml
          src/test/java/hudson/plugins/git/MultipleSCMTest.java
          http://jenkins-ci.org/commit/git-plugin/7b8cc1b20264c19d009cb79cbd916245e2a81c17
          Log:
          Updated to branched multiple-scms-plugin, but JENKINS-14537 https://github.com/jenkinsci/multiple-scms-plugin/pull/5 broke MultipleSCMTest, so skipping it for now.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/test/java/hudson/plugins/git/MultipleSCMTest.java http://jenkins-ci.org/commit/git-plugin/7b8cc1b20264c19d009cb79cbd916245e2a81c17 Log: Updated to branched multiple-scms-plugin, but JENKINS-14537 https://github.com/jenkinsci/multiple-scms-plugin/pull/5 broke MultipleSCMTest, so skipping it for now.

            kbell Kevin Bell
            jhansche Joe Hansche
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: