-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repo rules fail to extract unicode archives due to latin-1 hack #12986
Comments
I happened to do some debugging of this yesterday - the probematic call chain is that bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/CompressedTarFunction.java Line 104 in 2910d3d
bazel/src/main/native/unix_jni.cc Lines 1045 to 1058 in 2910d3d
Ideally we could use the raw buffer of bytes from the tar header to pass to |
cc @tetromino This is a nice summary of the MacOS file handling strangeness I was talking about. |
Addressing the unicode handling situation for strings (replacing latin-1 with utf-8) is on the schedule for this year, perhaps in Q2. |
hi! Here is a victim of this bug (I'm using Ubuntu 20.04 with ZFS and, seems like, it was set up with an |
in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed. It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986) Signed-off-by: Ivan Zemlyanskiy <[email protected]>
in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed. It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986) Signed-off-by: izemlyanskiy <[email protected]>
in the resent version of rules_go, the issue bazel-contrib/rules_go#2771 was fixed. It should address the bazel build issue on some Linux or MacOS (bazelbuild/bazel#12986) Signed-off-by: izemlyanskiy <[email protected]> Signed-off-by: Gokul Nair <[email protected]>
Update: We haven't had bandwidth to work on eliminating the latin-1 hack, but may reconsider after 2023 Q1. |
I started a playground to observe some actual behaviors. We'll need to know how macos behaves when it sees NFC style names. |
When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes bazelbuild#12986 Fixes bazel-contrib/rules_go#2771 Closes bazelbuild#18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89
…mes (#19765) When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes #12986 Fixes bazel-contrib/rules_go#2771 Closes #18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89 Fixes #19671
A fix for this issue has been included in Bazel 6.4.0 RC2. Please test out the release candidate and report any issues as soon as possible. Thanks! |
Description of the problem / feature request:
rules_go
currently fails on some machines due to some unicode characters included in filenames within the Go source archive - specifically the character Ä. For Linux and macOS, Go archives are distributed astar.gz
files with pax headers in the tar files.Affected systems include:
utf8only
option. This is Ubuntu's default when choosing ZFS at install time.Bazel uses Apache Commons Compress to extract tar archives. For most tar files, Commons Compress defers to the encoding specified by the JVM's
-Dfile.encoding
param, or the platform default. With ISO-8859-1 - Bazel's preference - UTF-8 encoded filename bytes in tar files basically pass through verbatim when extracted and everything works.But when the tar entry has a pax
path
header, the path name is always decoded as UTF-8.The character Ä in its composed form is unicode character
U+00C4
. In Java's internal UTF-16, it's simply represented as0xC4
. But encoded as UTF-8 it becomes the multi-byte sequence0xC3 0x84
since0xC4
as a single byte is not a valid UTF-8 value.When Commons Compress parses a pax-formatted tar file with a filename containing Ä as the
0xC3 0x84
UTF-8 string, the resulting Java string contains the value0xC4
after decoding. But this value is never re-encoded as UTF-8 when creating the file on the filesystem. Instead, Bazel uses the Java char values verbatim as long as they're < 0xff. An attempt to create a filename containing0xC4
on a filesystem that requires UTF-8 filenames will fail.But
rules_go
doesn't currently fail on macOS systems despite them also requiring UTF-8 filenames. This is because thedarwin
archives use decomposed representations of unicode characters. OS X has a history of preferring the decomposed forms over composed.So instead of Ä being
U+00C4
("Latin Capital Letter A with Diaeresis"), it'sU+0041
(just capital A) followed byU+0308
("Combining Diaeresis"). Encoded in UTF-8 as seen in the macOS golang tarballs, the byte string is0x41 0xCC 0x88
. Decoded to a Java string (16-bit chars) it's0x0041 0x0308
. Coincidentally I presume, Bazel is able to extract this decomposed form on UTF-8 filesystems because it ignores the diaeresis and replaces it with a literal'?'
character. So instead ofÄfoo.go
as is contained in the Go source archive, Bazel writesA?foo.go
on macOS.Like Linux on ZFS, Bazel fails to extract the linux archive on macOS as reported here.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
test.bzl
WORKSPACE
BUILD
What operating system are you running Bazel on?
Mac OS 10.15.7
Ubuntu 20.04 with a ZFS root partition with
utf8only
enabled (the default for Ubuntu's ZFS support).What's the output of
bazel info release
?release 4.0.0
Have you found anything relevant by searching the web?
bazel-contrib/rules_go#2771
#374 - pretty generic issue regarding filename characters.
#7055 - an issue with the same problematic file in the Go archive, but targeted at Darwin only.
The text was updated successfully, but these errors were encountered: