Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo rules fail to extract unicode archives due to latin-1 hack #12986

Closed
jvolkman opened this issue Feb 10, 2021 · 8 comments
Closed

Repo rules fail to extract unicode archives due to latin-1 hack #12986

jvolkman opened this issue Feb 10, 2021 · 8 comments
Labels
help wanted Someone outside the Bazel team could own this P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) platform: apple team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols type: bug
Milestone

Comments

@jvolkman
Copy link
Contributor

jvolkman commented Feb 10, 2021

Description of the problem / feature request:

rules_go currently fails on some machines due to some unicode characters included in filenames within the Go source archive - specifically the character Ä. For Linux and macOS, Go archives are distributed as tar.gz files with pax headers in the tar files.

Affected systems include:

  • ZFS volumes with the utf8only option. This is Ubuntu's default when choosing ZFS at install time.
  • macOS (HFS+ and APFS require UTF-8 filenames)

Bazel uses Apache Commons Compress to extract tar archives. For most tar files, Commons Compress defers to the encoding specified by the JVM's -Dfile.encoding param, or the platform default. With ISO-8859-1 - Bazel's preference - UTF-8 encoded filename bytes in tar files basically pass through verbatim when extracted and everything works.

But when the tar entry has a pax path header, the path name is always decoded as UTF-8.

The character Ä in its composed form is unicode character U+00C4. In Java's internal UTF-16, it's simply represented as 0xC4. But encoded as UTF-8 it becomes the multi-byte sequence 0xC3 0x84 since 0xC4 as a single byte is not a valid UTF-8 value.

When Commons Compress parses a pax-formatted tar file with a filename containing Ä as the 0xC3 0x84 UTF-8 string, the resulting Java string contains the value 0xC4 after decoding. But this value is never re-encoded as UTF-8 when creating the file on the filesystem. Instead, Bazel uses the Java char values verbatim as long as they're < 0xff. An attempt to create a filename containing 0xC4 on a filesystem that requires UTF-8 filenames will fail.

But rules_go doesn't currently fail on macOS systems despite them also requiring UTF-8 filenames. This is because the darwin archives use decomposed representations of unicode characters. OS X has a history of preferring the decomposed forms over composed.

So instead of Ä being U+00C4 ("Latin Capital Letter A with Diaeresis"), it's U+0041 (just capital A) followed by U+0308 ("Combining Diaeresis"). Encoded in UTF-8 as seen in the macOS golang tarballs, the byte string is 0x41 0xCC 0x88. Decoded to a Java string (16-bit chars) it's 0x0041 0x0308. Coincidentally I presume, Bazel is able to extract this decomposed form on UTF-8 filesystems because it ignores the diaeresis and replaces it with a literal '?' character. So instead of Äfoo.go as is contained in the Go source archive, Bazel writes A?foo.go on macOS.

Like Linux on ZFS, Bazel fails to extract the linux archive on macOS as reported here.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

test.bzl

def _tar_round_trip_impl(ctx):
    ctx.file("Äfoo.txt", "boo!\n")
    ctx.execute(["tar", "--format=" + ctx.attr.format, "-czvf", "file.tar.gz", "Äfoo.txt"])
    ctx.extract("file.tar.gz", "out")
    ctx.file("BUILD.bazel", 'exports_files(["Äfoo.txt", "out/Äfoo.txt", "file.tar.gz"])', legacy_utf8=False)

tar_round_trip = repository_rule(
    implementation = _tar_round_trip_impl,
    attrs = {
	"format": attr.string(
            mandatory = True,
        ),
    },
)

WORKSPACE

load("//:test.bzl", "tar_round_trip")

tar_round_trip(
    name = "non_pax",
    format = "ustar",  # supported by both macos BSD tar and linux GNU tar
)

tar_round_trip(
    name = "pax",
    format = "pax",
)

BUILD

genrule(
    name = "non_pax_test",
    srcs = ["@non_pax//:out/Äfoo.txt"],
    outs = ["non_pax.txt"],
    cmd = """
        cp $(location @non_pax//:out/Äfoo.txt) "$@"
    """,
)

genrule(
    name = "pax_test",
    srcs = ["@pax//:out/Äfoo.txt"],
    outs = ["pax.txt"],
    cmd = """
	cp $(location @pax//:out/Äfoo.txt) "$@"
    """,
)
# Works
bazel build //:non_pax_test

# Fails, either due to not being able to write the file (utf8 filesystem),
# or because the written filename is mangled.
bazel build //:pax_test

What operating system are you running Bazel on?

Mac OS 10.15.7
Ubuntu 20.04 with a ZFS root partition with utf8only enabled (the default for Ubuntu's ZFS support).

What's the output of bazel info release?

release 4.0.0

Have you found anything relevant by searching the web?

bazel-contrib/rules_go#2771
#374 - pretty generic issue regarding filename characters.
#7055 - an issue with the same problematic file in the Go archive, but targeted at Darwin only.

@sventiffe sventiffe added area-Windows Windows-specific issues and feature requests untriaged platform: apple and removed area-Windows Windows-specific issues and feature requests labels Feb 10, 2021
@illicitonion
Copy link
Contributor

I happened to do some debugging of this yesterday - the probematic call chain is that

try (OutputStream out = filePath.getOutputStream()) {
ends up calling into
extern "C" JNIEXPORT jint JNICALL
Java_com_google_devtools_build_lib_unix_NativePosixFiles_openWrite(
JNIEnv *env, jclass clazz, jstring path, jboolean append) {
const char *path_chars = GetStringLatin1Chars(env, path);
int flags = (O_WRONLY | O_CREAT) | (append ? O_APPEND : O_TRUNC);
int fd;
while ((fd = open(path_chars, flags, 0666)) == -1 && errno == EINTR) {
}
if (fd == -1) {
PostException(env, errno, path_chars);
}
ReleaseStringLatin1Chars(path_chars);
return fd;
}

Ideally we could use the raw buffer of bytes from the tar header to pass to open, rather than round-tripping via a String, but I'm not sure how to holistically go about that...

@sventiffe sventiffe added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Feb 12, 2021
@aiuto
Copy link
Contributor

aiuto commented Feb 12, 2021

cc @tetromino This is a nice summary of the MacOS file handling strangeness I was talking about.

@brandjon
Copy link
Member

Addressing the unicode handling situation for strings (replacing latin-1 with utf-8) is on the schedule for this year, perhaps in Q2.

@brandjon brandjon added team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols P2 We'll consider working on this in future. (Assignee optional) type: feature request and removed team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website untriaged labels Feb 15, 2021
@brandjon brandjon changed the title Bazel's extract and download_and_extract mangle some filenames in pax-formatted tar files Repo rules fail to extract unicode archives due to latin-1 hack Feb 15, 2021
@QIvan
Copy link

QIvan commented Apr 20, 2021

hi! Here is a victim of this bug (I'm using Ubuntu 20.04 with ZFS and, seems like, it was set up with an utf8only flag, according to the original description).
I tried to investigate the problem yesterday and copy some parts of Bazel's CompressedTarFunction.java to a small snippet. And, probably, I've found a workaround for this issue with just replacing filePath.getOutputStream() to new FileOutputStream(filePath.getPathFile())
So this code works for me https://gist.github.com/QIvan/bf88d152c31a35eccf162845ce05c455

@aiuto aiuto added this to the unicode milestone Apr 20, 2021
@jvolkman
Copy link
Contributor Author

@QIvan per my understanding this breaks other cases in which an archive contains a filename as a string of non-UTF-8 compatible bytes and the destination filesystem will accept it. #7757 has more context.

QIvan added a commit to QIvan/envoy that referenced this issue Apr 20, 2021
in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed.
It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986)

Signed-off-by: Ivan Zemlyanskiy <[email protected]>
lizan pushed a commit to envoyproxy/envoy that referenced this issue Apr 21, 2021
in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed. 
It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986)

Signed-off-by: izemlyanskiy <[email protected]>
gokulnair pushed a commit to gokulnair/envoy that referenced this issue May 6, 2021
in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed.
It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986)

Signed-off-by: izemlyanskiy <[email protected]>
Signed-off-by: Gokul Nair <[email protected]>
@brandjon
Copy link
Member

brandjon commented Nov 2, 2022

Update: We haven't had bandwidth to work on eliminating the latin-1 hack, but may reconsider after 2023 Q1.

@brandjon brandjon added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed P2 We'll consider working on this in future. (Assignee optional) labels Nov 2, 2022
@meteorcloudy meteorcloudy added the help wanted Someone outside the Bazel team could own this label Feb 27, 2023
@aiuto
Copy link
Contributor

aiuto commented Jun 16, 2023

I started a playground to observe some actual behaviors. We'll need to know how macos behaves when it sees NFC style names.
https://github.com/aiuto/bazel_samples/tree/main/utf8

fmeum added a commit to fmeum/bazel that referenced this issue Oct 9, 2023
When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths.

Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations.

Fixes bazelbuild#12986

Fixes bazel-contrib/rules_go#2771

Closes bazelbuild#18448.

PiperOrigin-RevId: 571857847
Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89
meteorcloudy pushed a commit that referenced this issue Oct 9, 2023
…mes (#19765)

When creating a `PathFragment` from a ZIP or TAR entry file name, the
raw bytes of the name are now wrapped into a Latin-1 encoded String,
which is how Bazel internally represents file paths.

Previously, ZIP entries as well as TAR entries with PAX headers would
result in ordinary decoded Java strings, resulting in corrupted file
names when passed to Bazel's file system operations.

Fixes #12986

Fixes bazel-contrib/rules_go#2771

Closes #18448.

PiperOrigin-RevId: 571857847
Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89

Fixes #19671
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 6.4.0 RC2. Please test out the release candidate and report any issues as soon as possible. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Someone outside the Bazel team could own this P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) platform: apple team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants