Looking into this more, I guess the real problem is that the changelog itself is generally not XML (but plain text). ProxySCM generally does not inject any sort of identifier that it is the one responsible for the changelog (just proxies what the original SCM changelog said, verbatim). I see that Multiple-SCMs actually expects to never have a nested <multiple-scms/> node (because it removes the MultiSCM descriptor from the list of available SCMs to choose from). However, the ProxySCM now makes it possible to have that nested changelog, and because it's a template project, it actually does make sense to allow for that.
One way to get around this would be to strip the <multiple-scms></multiple-scms> tags from the ProxySCM sublog, but then you're relying on text-based magic to achieve it, and it still wouldn't really be perfect.
Instead, I think the most appropriate way to fix this is to go back to what the standard ChangeLogParser does (at least, how GitSCM works), and NOT expect any XML structure at all. Instead, maybe a kind of binary-safe parser that inserts a marker with the SCM class identifier, plus the length of the next "sub-log chunk". The reader would then read the length number, then read that many bytes from the file, and treat that as one separate sublog file. Then the sublog writer call would look something more like:
String subLogText = FileUtils.readFileToString(subChangeLog);
logWriter.write(String.format("MultiSCM:\"%s\"\n%d\n%s\n",
scm.getType(),
subLogText.length(),
subLogText);
And the output (e.g., from my initial description) would be more like:
MultiSCM:hudson.plugins.git.GitSCM
512
Changes in origin/master, ...
...
total of 512 bytes here
...
MultiSCM:hudson.plugins.templateproject.ProxySCM
1122
MultiSCM:hudson.plugins.git.GitSCM
512
Changes in projectA/master, ...
...
total of 512 bytes here
...
MultiSCM:hudson.plugins.git.GitSCM
512
Changes in projectB/master, ...
...
total of 512 bytes here
...
The tokenization is still not perfect, and particularly with the way ProxySCM works (since there is no easy way to tell that the proxied SCM is in fact a MultiSCM changelog).
Although, for that matter, ... It would actually make sense to use a true libxml document generator, and let it decide whether to do CDATA or not, based on what the text is. Could also just encode the cdata section (e.g, using < > ), so that the nested <![CDATA[]]> is not interpreted as such.
In general, I think it's a mistake to use the SAX parser to read the file, but not use the SAX framework to generate the XML in the first place. By using plain String.format(), you are not guaranteeing that the resulting XML file is valid, thus the SAX parser will barf on the invalid document, because you didn't use a proper XML-generating library to create the file initially. I'm sure using <![CDATA[]]> was your way of getting around that, but as you can see here, that is still not perfect (and in fact, you would still have the same problem, if a commit message was written with something like:
Because it would result in the same bug described here.
To elaborate, I think the problem is that ProxySCM should avoid the CDATA entirely, because it should be understood that the remote ("proxied") SCM will already use CDATA where it needs to:
That way, even if ProxySCM uses only a single remote GitSCM configuration, it is still valid. I haven't looked into the code, so the problem may actually be Multiple-SCMs using CDATA all the time, when it is only necessary for the actual final SCM log data. That may be difficult to determine though, since neither plugin was probably designed to work with the other. In our case though (as mentioned in the Environment field above), this combination is actually very useful, and we use it for dozens of jobs. It works well in general, the only thing that fails is the changelog parsing (which in turn causes our log file to fill up because of the error in parsing the changelog.xml file)