Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48493

Compress Artifacts Plugin corrupts non-ASCII file names on Windows

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • compress-artifacts-plugin 1.10
      Jenkins 2.73.3
      Java(TM) SE Runtime Environment 1.8.0_144-b01
      Windows Server 2012 R2

      I have a project that produces "cite/käyttötapakuvaus.html" as an artifact. With compress-artifacts-plugin 1.10 installed, the resulting archive.zip has the following file header in its central directory:

      • central file header signature: 50 4B 01 02
      • version made by: 3F 00, i.e. spec v6.3
      • version needed to extract: 14 00, i.e. spec v2.0
      • general purpose bit flag: 08 00, i.e. the file name is not claimed to be UTF-8
      • compression method: 08 00
      • last mod file time: 0D A7
      • last mod file date: 88 4B
      • crc-32: D2 3A 07 C1
      • compressed size: 05 07 00 00
      • uncompressed size: DF 17 00 00
      • file name length: 1A 00
      • extra field length: 00 00, i.e. no alternative file name is stored in the extra field
      • file comment length: 00 00
      • disk number start: 00 00
      • internal file attributes: 00 00
      • external file attributes: 00 00 00 00
      • relative offset of local header: 2E 90 03 00
      • file name: 63 69 74 65 2F 6B E4 79 74 74 F6 74 61 70 61 6B 75 76 61 75 73 2E 68 74 6D 6C, i.e. "ä" was encoded as 0xE4, and "ö" was encoded as 0xF6. This matches Latin-1 and Windows-1252, but not CP437 nor UTF-8.

      However, when I view the artifacts listing in Jenkins, it includes a link <a href="k%EF%BF%BDytt%EF%BF%BDtapakuvaus.html">k�ytt�tapakuvaus.html</a>, i.e. the non-ASCII characters have been replaced with U+FFFD REPLACEMENT CHARACTER. This link actually works, but it looks very ugly. Other HTML artifacts contain links like <a href="k%C3%A4ytt%C3%B6tapakuvaus.html">käyttötapakuvaus</a>, and those links do not work.

      If I understand correctly, the file names in archive.zip should not be Latin-1 at all. APPNOTE.TXT - .ZIP File Format Specification v6.3.4 says they should be CP437 by default, or UTF-8 if bit 11 of the general purpose bit flag is set. However, TrueZipArchiver.java does zip = new ZipOutputStream(out, Charset.defaultCharset()), and I suppose the default charset is Windows-1252 here.

      I'm not sure which charset ZipFile expects when ZipStorage.java constructs it as new ZipFile(archive); the javadocs used to be at java.net, which has been shut down. RawZipFile.DEFAULT_CHARSET suggests it may be expecting UTF-8.

      Because the archive.zip files are intended to be read back by the compress-artifacts-plugin itself rather than published as is, I think it would be best to hardcode UTF-8 in TrueZipArchiver.java.

          [JENKINS-48493] Compress Artifacts Plugin corrupts non-ASCII file names on Windows

          A similar problem still occurs with Compress Artifacts Plugin 112.v52b_808b_85a_e8, but the symptoms are worse than before: when archive.zip contains such a file name, Jenkins does not show that the build has any artifacts at all. Instead, it writes an exception to the system log:

          java.util.zip.ZipException: invalid CEN header (bad entry name)
          	at java.base/java.util.zip.ZipFile$Source.zerror(ZipFile.java:1762)
          	at java.base/java.util.zip.ZipFile$Source.checkAndAddEntry(ZipFile.java:1243)
          	at java.base/java.util.zip.ZipFile$Source.initCEN(ZipFile.java:1701)
          	at java.base/java.util.zip.ZipFile$Source.<init>(ZipFile.java:1479)
          	at java.base/java.util.zip.ZipFile$Source.get(ZipFile.java:1441)
          	at java.base/java.util.zip.ZipFile$CleanableResource.<init>(ZipFile.java:718)
          	at java.base/java.util.zip.ZipFile.<init>(ZipFile.java:252)
          	at java.base/java.util.zip.ZipFile.<init>(ZipFile.java:181)
          	at java.base/java.util.zip.ZipFile.<init>(ZipFile.java:195)
          	at PluginClassLoader for compress-artifacts//org.jenkinsci.plugins.compress_artifacts.ZipStorage.list(ZipStorage.java:181)
          	at hudson.model.Run.addArtifacts(Run.java:1139)
          	at hudson.model.Run$AddArtifacts.call(Run.java:1131)
          	at hudson.model.Run$AddArtifacts.call(Run.java:1118)
          	at jenkins.util.VirtualFile.run(VirtualFile.java:510)
          	at hudson.model.Run.getArtifactsUpTo(Run.java:1098)
          [omitted the rest of the stack trace]
          

          I have not checked whether the file header in the central directory is the same as before.

          Kalle Niemitalo added a comment - A similar problem still occurs with Compress Artifacts Plugin 112.v52b_808b_85a_e8, but the symptoms are worse than before: when archive.zip contains such a file name, Jenkins does not show that the build has any artifacts at all. Instead, it writes an exception to the system log: java.util.zip.ZipException: invalid CEN header (bad entry name) at java.base/java.util.zip.ZipFile$Source.zerror(ZipFile.java:1762) at java.base/java.util.zip.ZipFile$Source.checkAndAddEntry(ZipFile.java:1243) at java.base/java.util.zip.ZipFile$Source.initCEN(ZipFile.java:1701) at java.base/java.util.zip.ZipFile$Source.<init>(ZipFile.java:1479) at java.base/java.util.zip.ZipFile$Source.get(ZipFile.java:1441) at java.base/java.util.zip.ZipFile$CleanableResource.<init>(ZipFile.java:718) at java.base/java.util.zip.ZipFile.<init>(ZipFile.java:252) at java.base/java.util.zip.ZipFile.<init>(ZipFile.java:181) at java.base/java.util.zip.ZipFile.<init>(ZipFile.java:195) at PluginClassLoader for compress-artifacts//org.jenkinsci.plugins.compress_artifacts.ZipStorage.list(ZipStorage.java:181) at hudson.model.Run.addArtifacts(Run.java:1139) at hudson.model.Run$AddArtifacts.call(Run.java:1131) at hudson.model.Run$AddArtifacts.call(Run.java:1118) at jenkins.util.VirtualFile.run(VirtualFile.java:510) at hudson.model.Run.getArtifactsUpTo(Run.java:1098) [omitted the rest of the stack trace] I have not checked whether the file header in the central directory is the same as before.

          Looking at the file header in the central directory, as encoded by Compress Artifacts Plugin 112.v52b_808b_85a_e8 on JDK 17:

          • central file header signature: 50 4B 01 02 (like before)
          • version made by: 14 00 (previously 3F 00)
          • version needed to extract: 14 00 (same as before), i.e. spec v2.0
          • general purpose bit flag: 08 00 (same as before), i.e. the file name is not claimed to be UTF-8
          • compression method: 08 00 (same as before)
          • last mod file time: 4F 69
          • last mod file date: 3D 5A
          • crc-32: B8 25 6C 07
          • compressed size: A2 04 00 00
          • uncompressed size: 1E 11 00 00
          • file name length: 49 00
          • extra field length: 00 00, i.e. no alternative file name is stored in the extra field
          • file comment length: 00 00
          • disk number start: 00 00
          • internal file attributes: 00 00
          • external file attributes: 00 00 00 00
          • relative offset of local header: BF A9 7A 05
          • file name: "ä" is still encoded as 0xE4. This matches Latin-1 and Windows-1252, but not CP437 nor UTF-8.

          Kalle Niemitalo added a comment - Looking at the file header in the central directory, as encoded by Compress Artifacts Plugin 112.v52b_808b_85a_e8 on JDK 17: central file header signature: 50 4B 01 02 (like before) version made by: 14 00 (previously 3F 00 ) version needed to extract: 14 00 (same as before), i.e. spec v2.0 general purpose bit flag: 08 00 (same as before), i.e. the file name is not claimed to be UTF-8 compression method: 08 00 (same as before) last mod file time: 4F 69 last mod file date: 3D 5A crc-32: B8 25 6C 07 compressed size: A2 04 00 00 uncompressed size: 1E 11 00 00 file name length: 49 00 extra field length: 00 00 , i.e. no alternative file name is stored in the extra field file comment length: 00 00 disk number start: 00 00 internal file attributes: 00 00 external file attributes: 00 00 00 00 relative offset of local header: BF A9 7A 05 file name: "ä" is still encoded as 0xE4 . This matches Latin-1 and Windows-1252, but not CP437 nor UTF-8.

          When ZipStorage.java reads an existing archive,zip, it consistently uses the java.util.zip.ZipFile(java.io.File) constructor, which uses UTF-8 to decode file names.

          When the ZipStorage.archive static method creates archive.zip, it calls FilePath.archive(ArchiverFactory.ZIP, other arguments). I think that ends up using ArchiverFactory.create(OutputStream), which then uses Charset.defaultCharset().

          FilePath.writeToTar takes a Charset filenamesEncoding parameter but FilePath.archive doesn't seem to have the same feature.

          It's not obvious to me how this should be fixed. Perhaps by making the Compress Artifacts Plugin implement its own ArchiverFactory that overrides public Archiver create(OutputStream out) to call ArchiverFactory.ZIP.create(out, StandardCharsets.UTF_8). That would feel somewhat hacky though.

          Kalle Niemitalo added a comment - When ZipStorage.java reads an existing archive,zip, it consistently uses the java.util.zip.ZipFile(java.io.File) constructor, which uses UTF-8 to decode file names . When the ZipStorage.archive static method creates archive.zip, it calls FilePath.archive (ArchiverFactory.ZIP, other arguments). I think that ends up using ArchiverFactory.create(OutputStream) , which then uses Charset.defaultCharset(). FilePath.writeToTar takes a Charset filenamesEncoding parameter but FilePath.archive doesn't seem to have the same feature. It's not obvious to me how this should be fixed. Perhaps by making the Compress Artifacts Plugin implement its own ArchiverFactory that overrides public Archiver create(OutputStream out) to call ArchiverFactory.ZIP.create(out, StandardCharsets.UTF_8). That would feel somewhat hacky though.

            Unassigned Unassigned
            kon Kalle Niemitalo
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: