[JENKINS-14537] Template SCMs + Multiple SCMs results in failure parsing change log

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: multiple-scms-plugin, template-project-plugin
Labels:
None
Environment:

Hide
* Template1:
** Multiple SCMs:
*** Git[subprojectA]: git://server/subprojectA.git
*** Git[subprojectB]: git://server/subprojectB.git
* ProjectJob:
** Multiple SCMs:
*** Use SCM from another job: Template1
*** Git[project]: git://server/project.git

Show
* Template1: ** Multiple SCMs: *** Git[subprojectA]: git://server/subprojectA.git *** Git[subprojectB]: git://server/subprojectB.git * ProjectJob: ** Multiple SCMs: *** Use SCM from another job: Template1 *** Git[project]: git://server/project.git

Similar Issues:
Powered by SuggestiMate

Show

Using a configuration similar that outlined in the Environment field, Jenkins is unable to parse the changelog.xml file, because it results in a nested CDATA section. That is because Multiple-SCMs creates an XML structure similar to:

<multi-scm-log>
  <sub-log scm="hudson.plugins.git.GitSCM">
    <![CDATA[Changes in origin/master ...
    ...
    ...]]>
  </sub-log>
  <sub-log scm="hudson.plugins.git.GitSCM">
  </sub-log>
</multi-scm-log>

That works fine for just the multiple-scm configuration. But when combined with the template-project "Use SCM from another project" option, the sub-log itself is completely wrapped in CDATA, thus resulting in an invalid XML document (it is invalid to nest CDATA sections):

<multiple-scms>
  <sub-log scm="hudson.plugins.git.GitSCM">
    <![CDATA[Changes in origin/master, ...
    ...
    ...]]>
  </sub-log>
  <sub-log scm="hudson.plugins.templateproject.ProxySCM">
    <![CDATA[<multiple-scms>
      <sub-log scm="hudson.plugins.git.GitSCM">
        <![CDATA[Changes in projectA/master, ...
        ...
        ...]]>
      </sub-log>
      <sub-log scm="hudson.plugins.git.GitSCM">
        <![CDATA[Changes in projectB/master, ...
        ...
        ...]]>
      </sub-log>
    </multiple-scms>]]>
  </sub-log>
</multiple-scms>

As you can see, the GitSCM sub-log entries here are wrapped in CDATA, but the ProxySCM entry was already wrapped in CDATA.

Ideally, CDATA should only be used if needed (e.g., in this case, the nodes could simply be nested, and only use CDATA around the actual commit data [from GitSCM]).

is related to

JENKINS-14099 Excessive exception logging

Resolved

Joe Hansche created issue - 2012-07-23 18:21

Joe Hansche added a comment - 2012-07-23 18:29

To elaborate, I think the problem is that ProxySCM should avoid the CDATA entirely, because it should be understood that the remote ("proxied") SCM will already use CDATA where it needs to:

<multiple-scms>
  <sub-log scm="hudson.plugins.git.GitSCM">
    <![CDATA[Changes in origin/master, ...
    ...
    ...]]>
  </sub-log>
  <sub-log scm="hudson.plugins.templateproject.ProxySCM">
    <sub-log scm="hudson.plugins.git.GitSCM">
      <![CDATA[Changes in projectA/master, ...
      ...
      ...]]>
    </sub-log>
  </sub-log>
</multiple-scms>

That way, even if ProxySCM uses only a single remote GitSCM configuration, it is still valid. I haven't looked into the code, so the problem may actually be Multiple-SCMs using CDATA all the time, when it is only necessary for the actual final SCM log data. That may be difficult to determine though, since neither plugin was probably designed to work with the other. In our case though (as mentioned in the Environment field above), this combination is actually very useful, and we use it for dozens of jobs. It works well in general, the only thing that fails is the changelog parsing (which in turn causes our log file to fill up because of the error in parsing the changelog.xml file)

Joe Hansche added a comment - 2012-07-23 18:29 To elaborate, I think the problem is that ProxySCM should avoid the CDATA entirely, because it should be understood that the remote ("proxied") SCM will already use CDATA where it needs to: <multiple-scms> <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in origin/master, ... ... ...]]> </sub-log> <sub-log scm= "hudson.plugins.templateproject.ProxySCM" > <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in projectA/master, ... ... ...]]> </sub-log> </sub-log> </multiple-scms> That way, even if ProxySCM uses only a single remote GitSCM configuration, it is still valid. I haven't looked into the code, so the problem may actually be Multiple-SCMs using CDATA all the time, when it is only necessary for the actual final SCM log data. That may be difficult to determine though, since neither plugin was probably designed to work with the other. In our case though (as mentioned in the Environment field above), this combination is actually very useful, and we use it for dozens of jobs. It works well in general, the only thing that fails is the changelog parsing (which in turn causes our log file to fill up because of the error in parsing the changelog.xml file)

Joe Hansche added a comment - 2012-07-23 19:15 - edited

Looking into this more, I guess the real problem is that the changelog itself is generally not XML (but plain text). ProxySCM generally does not inject any sort of identifier that it is the one responsible for the changelog (just proxies what the original SCM changelog said, verbatim). I see that Multiple-SCMs actually expects to never have a nested <multiple-scms/> node (because it removes the MultiSCM descriptor from the list of available SCMs to choose from). However, the ProxySCM now makes it possible to have that nested changelog, and because it's a template project, it actually does make sense to allow for that.

One way to get around this would be to strip the <multiple-scms></multiple-scms> tags from the ProxySCM sublog, but then you're relying on text-based magic to achieve it, and it still wouldn't really be perfect.

Instead, I think the most appropriate way to fix this is to go back to what the standard ChangeLogParser does (at least, how GitSCM works), and NOT expect any XML structure at all. Instead, maybe a kind of binary-safe parser that inserts a marker with the SCM class identifier, plus the length of the next "sub-log chunk". The reader would then read the length number, then read that many bytes from the file, and treat that as one separate sublog file. Then the sublog writer call would look something more like:

                        String subLogText = FileUtils.readFileToString(subChangeLog);
                        logWriter.write(String.format("MultiSCM:\"%s\"\n%d\n%s\n",
                                        scm.getType(),
                                        subLogText.length(),
                                        subLogText);

And the output (e.g., from my initial description) would be more like:

MultiSCM:hudson.plugins.git.GitSCM
512
Changes in origin/master, ...
...
total of 512 bytes here
...
MultiSCM:hudson.plugins.templateproject.ProxySCM
1122
MultiSCM:hudson.plugins.git.GitSCM
512
Changes in projectA/master, ...
...
total of 512 bytes here
...
MultiSCM:hudson.plugins.git.GitSCM
512
Changes in projectB/master, ...
...
total of 512 bytes here
...

The tokenization is still not perfect, and particularly with the way ProxySCM works (since there is no easy way to tell that the proxied SCM is in fact a MultiSCM changelog).

Although, for that matter, ... It would actually make sense to use a true libxml document generator, and let it decide whether to do CDATA or not, based on what the text is. Could also just encode the cdata section (e.g, using < > ), so that the nested <![CDATA[]]> is not interpreted as such.

In general, I think it's a mistake to use the SAX parser to read the file, but not use the SAX framework to generate the XML in the first place. By using plain String.format(), you are not guaranteeing that the resulting XML file is valid, thus the SAX parser will barf on the invalid document, because you didn't use a proper XML-generating library to create the file initially. I'm sure using <![CDATA[]]> was your way of getting around that, but as you can see here, that is still not perfect (and in fact, you would still have the same problem, if a commit message was written with something like:

$ git commit -m'Wrap the unknown content in <![CDATA[]]> to avoid parsing issues'

Because it would result in the same bug described here.

Joe Hansche added a comment - 2012-07-23 19:15 - edited Looking into this more, I guess the real problem is that the changelog itself is generally not XML (but plain text). ProxySCM generally does not inject any sort of identifier that it is the one responsible for the changelog (just proxies what the original SCM changelog said, verbatim). I see that Multiple-SCMs actually expects to never have a nested <multiple-scms/> node (because it removes the MultiSCM descriptor from the list of available SCMs to choose from). However, the ProxySCM now makes it possible to have that nested changelog, and because it's a template project, it actually does make sense to allow for that. One way to get around this would be to strip the <multiple-scms></multiple-scms> tags from the ProxySCM sublog, but then you're relying on text-based magic to achieve it, and it still wouldn't really be perfect. Instead, I think the most appropriate way to fix this is to go back to what the standard ChangeLogParser does (at least, how GitSCM works), and NOT expect any XML structure at all. Instead, maybe a kind of binary-safe parser that inserts a marker with the SCM class identifier, plus the length of the next "sub-log chunk". The reader would then read the length number, then read that many bytes from the file, and treat that as one separate sublog file. Then the sublog writer call would look something more like: String subLogText = FileUtils.readFileToString(subChangeLog); logWriter.write( String .format( "MultiSCM:\" %s\ "\n%d\n%s\n" , scm.getType(), subLogText.length(), subLogText); And the output (e.g., from my initial description) would be more like: MultiSCM:hudson.plugins.git.GitSCM 512 Changes in origin/master, ... ... total of 512 bytes here ... MultiSCM:hudson.plugins.templateproject.ProxySCM 1122 MultiSCM:hudson.plugins.git.GitSCM 512 Changes in projectA/master, ... ... total of 512 bytes here ... MultiSCM:hudson.plugins.git.GitSCM 512 Changes in projectB/master, ... ... total of 512 bytes here ... The tokenization is still not perfect, and particularly with the way ProxySCM works (since there is no easy way to tell that the proxied SCM is in fact a MultiSCM changelog). Although, for that matter, ... It would actually make sense to use a true libxml document generator, and let it decide whether to do CDATA or not, based on what the text is. Could also just encode the cdata section (e.g, using < > ), so that the nested <![CDATA[]]> is not interpreted as such. In general, I think it's a mistake to use the SAX parser to read the file, but not use the SAX framework to generate the XML in the first place. By using plain String.format(), you are not guaranteeing that the resulting XML file is valid, thus the SAX parser will barf on the invalid document, because you didn't use a proper XML-generating library to create the file initially. I'm sure using <![CDATA[]]> was your way of getting around that, but as you can see here, that is still not perfect (and in fact, you would still have the same problem, if a commit message was written with something like: $ git commit -m'Wrap the unknown content in <![CDATA[]]> to avoid parsing issues' Because it would result in the same bug described here.

Joe Hansche made changes - 2012-08-01 16:54

Link

New: This issue is related to ~~JENKINS-14099~~ [ ~~JENKINS-14099~~ ]

Joe Hansche made changes - 2012-08-01 20:14

Assignee

Original: Kevin Bell [ kbell ]

New: Joe Hansche [ jhansche ]

Joe Hansche made changes - 2012-08-01 20:14

Status

Original: Open [ 1 ]

New: In Progress [ 3 ]

Joe Hansche added a comment - 2012-08-01 20:21

After some discussion and trying to find the best way to avoid this problem, I decided to XML entity-encode any "]]>" (and &) in the sublog text, which allows for recursive encoding and decoding:

<multi-scm-log version="2">
<sub-log scm="hudson.plugins.templateproject.ProxySCM">
<![CDATA[<multi-scm-log version="2">
<sub-log scm="hudson.plugins.git.GitSCM">
<![CDATA[&93;&93;&gt;
</sub-log>
<sub-log scm="hudson.plugins.git.GitSCM">
<![CDATA[&93;&93;&gt;
</sub-log>
</multi-scm-log>
]]>
</sub-log>
</multi-scm-log>

In this case, each log was blank, but you can see that the nested multi-scm-log's sub-log nodes have the ]]> encoded to &93;&93;>. Before that, it will encode any "&" into "&" – which means if it were nested even further, you would end up with "&93&93&gt;". On the way out, I decode & into &, and &93;&93;> back into ]]>.

Now if I introduce an actual commit (e.g., if the commit contains "]]>" in the commit message), you can see how the nested encoding and decoding works:

<multi-scm-log version="2">
<sub-log scm="hudson.plugins.templateproject.ProxySCM">
<![CDATA[<multi-scm-log version="2">
<sub-log scm="hudson.plugins.git.GitSCM">
<![CDATA[Changes in branch origin/HEAD, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63
Changes in branch origin/master, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63
&93;&93;&gt;
</sub-log>
<sub-log scm="hudson.plugins.git.GitSCM">
<![CDATA[Changes in branch origin/master, between cd4b6a92715d10b63aa2f9d84101233034c20a85 and e3a66cdc1f8b3aac2ec585f1649d959482aecd11
commit e3a66cdc1f8b3aac2ec585f1649d959482aecd11
tree 6db2b3a0c13c04b459b5376c9aeb343edb09fb87
parent da5cb23b4e2989553fc14cdcf305595fcaa4820e
author Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400
committer Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400

    Testing 5 &amp;93;&amp;93;&amp;gt; x

:100644 100644 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 097039f4f62342d3253b42297f19eb90aacb026f M	README

commit da5cb23b4e2989553fc14cdcf305595fcaa4820e
tree 9e004ae6f87c001f30c569feb1360eafb62cd3d4
parent cd4b6a92715d10b63aa2f9d84101233034c20a85
author Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400
committer Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400

    Testing 4 &amp;93;&amp;93;&amp;gt; x

:100644 100644 8e1178035834ac70cd49c258dbbe898d3badd476 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 M	README
&93;&93;&gt;
</sub-log>
</multi-scm-log>
]]>
</sub-log>
</multi-scm-log>

And the changelog shows the expected "Testing 4 ]]> x"

Joe Hansche added a comment - 2012-08-01 20:21 After some discussion and trying to find the best way to avoid this problem, I decided to XML entity-encode any "]]>" (and &) in the sublog text, which allows for recursive encoding and decoding: <multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.templateproject.ProxySCM" > <![CDATA[<multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[&93;&93;> </sub-log> <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[&93;&93;> </sub-log> </multi-scm-log> ]]> </sub-log> </multi-scm-log> In this case, each log was blank, but you can see that the nested multi-scm-log's sub-log nodes have the ]]> encoded to &93;&93;> . Before that, it will encode any "&" into "&" – which means if it were nested even further, you would end up with " &93&93&gt; ". On the way out, I decode & into & , and &93;&93;> back into ]]> . Now if I introduce an actual commit (e.g., if the commit contains " ]]> " in the commit message), you can see how the nested encoding and decoding works: <multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.templateproject.ProxySCM" > <![CDATA[<multi-scm-log version= "2" > <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in branch origin/HEAD, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63 Changes in branch origin/master, between 8683af570511301fc8ea3ebeae3a8315f607bb63 and 8683af570511301fc8ea3ebeae3a8315f607bb63 &93;&93;> </sub-log> <sub-log scm= "hudson.plugins.git.GitSCM" > <![CDATA[Changes in branch origin/master, between cd4b6a92715d10b63aa2f9d84101233034c20a85 and e3a66cdc1f8b3aac2ec585f1649d959482aecd11 commit e3a66cdc1f8b3aac2ec585f1649d959482aecd11 tree 6db2b3a0c13c04b459b5376c9aeb343edb09fb87 parent da5cb23b4e2989553fc14cdcf305595fcaa4820e author Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400 committer Joe Hansche <jhansche@myyearbook.com> 1343852323 -0400 Testing 5 &93;&93;&gt; x :100644 100644 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 097039f4f62342d3253b42297f19eb90aacb026f M README commit da5cb23b4e2989553fc14cdcf305595fcaa4820e tree 9e004ae6f87c001f30c569feb1360eafb62cd3d4 parent cd4b6a92715d10b63aa2f9d84101233034c20a85 author Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400 committer Joe Hansche <jhansche@myyearbook.com> 1343851983 -0400 Testing 4 &93;&93;&gt; x :100644 100644 8e1178035834ac70cd49c258dbbe898d3badd476 c9f2a7b2f5ea69d3eb178486fbe15c9757accbd6 M README &93;&93;> </sub-log> </multi-scm-log> ]]> </sub-log> </multi-scm-log> And the changelog shows the expected " Testing 4 ]]> x "

Joe Hansche added a comment - 2012-08-01 20:46

A pull-request has been submitted to fix this issue, at https://github.com/jenkinsci/multiple-scms-plugin/pull/2

Joe Hansche added a comment - 2012-08-01 20:46 A pull-request has been submitted to fix this issue, at https://github.com/jenkinsci/multiple-scms-plugin/pull/2

Joe Hansche made changes - 2012-08-01 20:46

Assignee

Original: Joe Hansche [ jhansche ]

New: Kevin Bell [ kbell ]

Joe Hansche made changes - 2012-08-02 19:25

Status

Original: In Progress [ 3 ]

New: Open [ 1 ]

Assignee:: Kevin Bell

Reporter:: Joe Hansche

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2012-07-23 18:21

Updated:: 2014-06-16 19:15

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Joe Hansche added a comment - 2012-07-23 18:29

Expand comment: Joe Hansche added a comment - 2012-07-23 18:29

Collapse comment: Joe Hansche added a comment - 2012-07-23 19:15, Edited by Joe Hansche - 2012-07-23 19:16

Expand comment: Joe Hansche added a comment - 2012-07-23 19:15, Edited by Joe Hansche - 2012-07-23 19:16

Collapse comment: Joe Hansche added a comment - 2012-08-01 20:21

Expand comment: Joe Hansche added a comment - 2012-08-01 20:21

Collapse comment: Joe Hansche added a comment - 2012-08-01 20:46

Expand comment: Joe Hansche added a comment - 2012-08-01 20:46

People

Dates