Repo rules fail to extract unicode archives due to latin-1 hack #12986

jvolkman · 2021-02-10T06:41:05Z

Description of the problem / feature request:

rules_go currently fails on some machines due to some unicode characters included in filenames within the Go source archive - specifically the character Ä. For Linux and macOS, Go archives are distributed as tar.gz files with pax headers in the tar files.

Affected systems include:

ZFS volumes with the utf8only option. This is Ubuntu's default when choosing ZFS at install time.
macOS (HFS+ and APFS require UTF-8 filenames)

Bazel uses Apache Commons Compress to extract tar archives. For most tar files, Commons Compress defers to the encoding specified by the JVM's -Dfile.encoding param, or the platform default. With ISO-8859-1 - Bazel's preference - UTF-8 encoded filename bytes in tar files basically pass through verbatim when extracted and everything works.

But when the tar entry has a pax path header, the path name is always decoded as UTF-8.

The character Ä in its composed form is unicode character U+00C4. In Java's internal UTF-16, it's simply represented as 0xC4. But encoded as UTF-8 it becomes the multi-byte sequence 0xC3 0x84 since 0xC4 as a single byte is not a valid UTF-8 value.

When Commons Compress parses a pax-formatted tar file with a filename containing Ä as the 0xC3 0x84 UTF-8 string, the resulting Java string contains the value 0xC4 after decoding. But this value is never re-encoded as UTF-8 when creating the file on the filesystem. Instead, Bazel uses the Java char values verbatim as long as they're < 0xff. An attempt to create a filename containing 0xC4 on a filesystem that requires UTF-8 filenames will fail.

But rules_go doesn't currently fail on macOS systems despite them also requiring UTF-8 filenames. This is because the darwin archives use decomposed representations of unicode characters. OS X has a history of preferring the decomposed forms over composed.

So instead of Ä being U+00C4 ("Latin Capital Letter A with Diaeresis"), it's U+0041 (just capital A) followed by U+0308 ("Combining Diaeresis"). Encoded in UTF-8 as seen in the macOS golang tarballs, the byte string is 0x41 0xCC 0x88. Decoded to a Java string (16-bit chars) it's 0x0041 0x0308. Coincidentally I presume, Bazel is able to extract this decomposed form on UTF-8 filesystems because it ignores the diaeresis and replaces it with a literal '?' character. So instead of Äfoo.go as is contained in the Go source archive, Bazel writes A?foo.go on macOS.

Like Linux on ZFS, Bazel fails to extract the linux archive on macOS as reported here.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

test.bzl

def _tar_round_trip_impl(ctx):
    ctx.file("Äfoo.txt", "boo!\n")
    ctx.execute(["tar", "--format=" + ctx.attr.format, "-czvf", "file.tar.gz", "Äfoo.txt"])
    ctx.extract("file.tar.gz", "out")
    ctx.file("BUILD.bazel", 'exports_files(["Äfoo.txt", "out/Äfoo.txt", "file.tar.gz"])', legacy_utf8=False)

tar_round_trip = repository_rule(
    implementation = _tar_round_trip_impl,
    attrs = {
	"format": attr.string(
            mandatory = True,
        ),
    },
)

WORKSPACE

load("//:test.bzl", "tar_round_trip")

tar_round_trip(
    name = "non_pax",
    format = "ustar",  # supported by both macos BSD tar and linux GNU tar
)

tar_round_trip(
    name = "pax",
    format = "pax",
)

BUILD

genrule(
    name = "non_pax_test",
    srcs = ["@non_pax//:out/Äfoo.txt"],
    outs = ["non_pax.txt"],
    cmd = """
        cp $(location @non_pax//:out/Äfoo.txt) "$@"
    """,
)

genrule(
    name = "pax_test",
    srcs = ["@pax//:out/Äfoo.txt"],
    outs = ["pax.txt"],
    cmd = """
	cp $(location @pax//:out/Äfoo.txt) "$@"
    """,
)

# Works
bazel build //:non_pax_test

# Fails, either due to not being able to write the file (utf8 filesystem),
# or because the written filename is mangled.
bazel build //:pax_test

What operating system are you running Bazel on?

Mac OS 10.15.7
Ubuntu 20.04 with a ZFS root partition with utf8only enabled (the default for Ubuntu's ZFS support).

What's the output of `bazel info release`?

release 4.0.0

Have you found anything relevant by searching the web?

bazel-contrib/rules_go#2771
#374 - pretty generic issue regarding filename characters.
#7055 - an issue with the same problematic file in the Go archive, but targeted at Darwin only.

The text was updated successfully, but these errors were encountered:

illicitonion · 2021-02-10T11:24:12Z

I happened to do some debugging of this yesterday - the probematic call chain is that

bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/CompressedTarFunction.java

Line 104 in 2910d3d

try (OutputStream out = filePath.getOutputStream()) {

ends up calling into

bazel/src/main/native/unix_jni.cc

Lines 1045 to 1058 in 2910d3d

    
           extern "C" JNIEXPORT jint JNICALL 
        
           Java_com_google_devtools_build_lib_unix_NativePosixFiles_openWrite( 
        
               JNIEnv *env, jclass clazz, jstring path, jboolean append) { 
        
             const char *path_chars = GetStringLatin1Chars(env, path); 
        
             int flags = (O_WRONLY | O_CREAT) | (append ? O_APPEND : O_TRUNC); 
        
             int fd; 
        
             while ((fd = open(path_chars, flags, 0666)) == -1 && errno == EINTR) { 
        
             } 
        
             if (fd == -1) { 
        
               PostException(env, errno, path_chars); 
        
             } 
        
             ReleaseStringLatin1Chars(path_chars); 
        
             return fd; 
        
           }

Ideally we could use the raw buffer of bytes from the tar header to pass to open, rather than round-tripping via a String, but I'm not sure how to holistically go about that...

aiuto · 2021-02-12T21:08:56Z

cc @tetromino This is a nice summary of the MacOS file handling strangeness I was talking about.

brandjon · 2021-02-15T18:06:15Z

Addressing the unicode handling situation for strings (replacing latin-1 with utf-8) is on the schedule for this year, perhaps in Q2.

QIvan · 2021-04-20T13:51:40Z

hi! Here is a victim of this bug (I'm using Ubuntu 20.04 with ZFS and, seems like, it was set up with an utf8only flag, according to the original description).
I tried to investigate the problem yesterday and copy some parts of Bazel's CompressedTarFunction.java to a small snippet. And, probably, I've found a workaround for this issue with just replacing filePath.getOutputStream() to new FileOutputStream(filePath.getPathFile())
So this code works for me https://gist.github.com/QIvan/bf88d152c31a35eccf162845ce05c455

jvolkman · 2021-04-20T14:54:12Z

@QIvan per my understanding this breaks other cases in which an archive contains a filename as a string of non-UTF-8 compatible bytes and the destination filesystem will accept it. #7757 has more context.

in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed. It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986) Signed-off-by: Ivan Zemlyanskiy <[email protected]>

in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed. It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986) Signed-off-by: izemlyanskiy <[email protected]>

in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed. It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986) Signed-off-by: izemlyanskiy <[email protected]> Signed-off-by: Gokul Nair <[email protected]>

brandjon · 2022-11-02T17:31:53Z

Update: We haven't had bandwidth to work on eliminating the latin-1 hack, but may reconsider after 2023 Q1.

aiuto · 2023-06-16T04:33:53Z

I started a playground to observe some actual behaviors. We'll need to know how macos behaves when it sees NFC style names.
https://github.com/aiuto/bazel_samples/tree/main/utf8

When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes bazelbuild#12986 Fixes bazel-contrib/rules_go#2771 Closes bazelbuild#18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89

…mes (#19765) When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes #12986 Fixes bazel-contrib/rules_go#2771 Closes #18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89 Fixes #19671

iancha1992 · 2023-10-10T21:35:39Z

A fix for this issue has been included in Bazel 6.4.0 RC2. Please test out the release candidate and report any issues as soon as possible. Thanks!

sventiffe added area-Windows Windows-specific issues and feature requests untriaged platform: apple and removed area-Windows Windows-specific issues and feature requests labels Feb 10, 2021

jvolkman mentioned this issue Feb 10, 2021

Cannot successfully extract go_sdk because of unicode filename bazel-contrib/rules_go#2771

Closed

sventiffe added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Feb 12, 2021

brandjon mentioned this issue Feb 15, 2021

Allow any characters in filenames / labels #374

Open

brandjon changed the title ~~Bazel's extract and download_and_extract mangle some filenames in pax-formatted tar files~~ Repo rules fail to extract unicode archives due to latin-1 hack Feb 15, 2021

gibfahn mentioned this issue Mar 8, 2021

go_download_sdk: work around Bazel .tar.gz extraction bug bazel-contrib/rules_go#2836

Merged

asraa mentioned this issue Apr 20, 2021

Bazel build error “Invalid or incomplete multibyte or wide character” envoyproxy/envoy#16065

Closed

aiuto added this to the unicode milestone Apr 20, 2021

QIvan mentioned this issue Apr 20, 2021

upgrade rules_go to v0.27.0 (#16065) envoyproxy/envoy#16083

Merged

jvolkman mentioned this issue Jun 8, 2021

Go builds fail on filesystems that require UTF-8 filenames dropbox/dbx_build_tools#37

Open

y3llowcake mentioned this issue Dec 14, 2021

Bazel fails to unzip archives containing files with non-latin characters in their name #11670

Closed

brandjon added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed P2 We'll consider working on this in future. (Assignee optional) labels Nov 2, 2022

meteorcloudy added the help wanted Someone outside the Bazel team could own this label Feb 27, 2023

aiuto added type: bug and removed type: feature request labels Feb 28, 2023

fmeum mentioned this issue May 19, 2023

Fix handling of non-ASCII characters in archive entry file names #18448

Closed

copybara-service bot closed this as completed in 10169bb Oct 9, 2023

fmeum mentioned this issue Oct 9, 2023

[6.4.0] Fix handling of non-ASCII characters in archive entry file names #19765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repo rules fail to extract unicode archives due to latin-1 hack #12986

Repo rules fail to extract unicode archives due to latin-1 hack #12986

jvolkman commented Feb 10, 2021 •

edited

Loading

illicitonion commented Feb 10, 2021

aiuto commented Feb 12, 2021

brandjon commented Feb 15, 2021

QIvan commented Apr 20, 2021

jvolkman commented Apr 20, 2021

brandjon commented Nov 2, 2022

aiuto commented Jun 16, 2023

iancha1992 commented Oct 10, 2023

Repo rules fail to extract unicode archives due to latin-1 hack #12986

Repo rules fail to extract unicode archives due to latin-1 hack #12986

Comments

jvolkman commented Feb 10, 2021 • edited Loading

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

Have you found anything relevant by searching the web?

illicitonion commented Feb 10, 2021

aiuto commented Feb 12, 2021

brandjon commented Feb 15, 2021

QIvan commented Apr 20, 2021

jvolkman commented Apr 20, 2021

brandjon commented Nov 2, 2022

aiuto commented Jun 16, 2023

iancha1992 commented Oct 10, 2023

jvolkman commented Feb 10, 2021 •

edited

Loading

What's the output of `bazel info release`?