Details
-
Type:
Bug
-
Status: Reopened (View Workflow)
-
Priority:
Blocker
-
Resolution: Unresolved
-
Component/s: workflow-basic-steps-plugin
-
Labels:None
-
Environment:Jenkins 2.121.2 and Jenkins 2.81 Pipeline Groovy Plugin 2.54
-
Similar Issues:
Description
I'm extracting xml file (nuspec) from some nuget packages and trying to parse it. In most cases it works fine, but in some the xml was written using UTF-8 with BOM encoding, and then parser gets upset and reports:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
The way I'm parsing xml is:
@NonCPS def parsePackage(packageName, packageVersion) { def packageFullName = "${packageName}.${packageVersion}" bat """curl -L https://www.nuget.org/api/v2/package/${packageName}/${packageVersion} -o ${packageFullName}.nupkg""" bat """unzip ${packageFullName}.nupkg -d ${packageFullName}""" def nuspecPath = """${packageFullName}\\${packageName}.nuspec""" def nuspecContent = readFile file:nuspecPath def nuspecXML = new XmlSlurper( false, false ).parseText(nuspecContent) println nuspecXML.metadata.version def newXml = XmlUtil.serialize(nuspecXML) return newXml }
It looks like readFile is not supporting UTF-8 with BOM as it is passing leading BOM characters into returned string.
I tried to replicate it directly in groovy doing
def xmldata = new File("Newtonsoft.Json.nuspec").text def pkg = new XmlSlurper().parseText(xmldata) println pkg.metadata.version.text()
But here the leading BOM characters are not passed into xmldata variable
Attached example nuspec with BOM in it.
Jakub Pawlinski This is a known with the Unicode spec and the Java platform implementation of it, not Pipeline. In UTF-8 the BOM is neither needed nor suggested - since the BOM is essentially meaningless in UTF-8, Java transparently passes the BOM through.
First I'd make sure to add the "encloding: 'UTF-8'" argument to your readFile step to ensure it reads as UTF-8. Then we do postprocessing to correct for nonstandard input.
Some suggested solutions are available on StackOverflow.
Personally, I'd do something like this to sanitize your input:
(might need to be \u FEFF, try it both ways).
There's also code snippets out there that do a more efficient approach, which only considers the leading bytes of the String.