-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix most Unicode encoding bugs #24010
Conversation
Switching back to draft as there are some unexpected failures. |
c927001
to
793fd28
Compare
4a2476c
to
531eaf0
Compare
The RBE environment doesn't have a Latin-1 locale installed. I am skipping the tests in that case, but maybe we could add it? |
I submitted the separate PR #24172 to force |
@tjgq, are you fine with the current state of this pull request? If so, I'll take a look at the internal references to |
Yes, I am happy with the current state.
…On Mon, Nov 4, 2024 at 13:06 lberki ***@***.***> wrote:
@tjgq <https://github.com/tjgq>, are you fine with the current state of
this pull request?
If so, I'll take a look at the internal references to ISO_8859_1 and UTF_8
to see if there are any obvious landmines in the code. The internal test
battery looks fine -- aside from the aforementioned breakages that have all
been fixed since then, everything seems to be green.
—
Reply to this email directly, view it on GitHub
<#24010 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABBK5HOICVK5USXCLRZ2DA3Z65PKZAVCNFSM6AAAAABQB7H2RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJUGU2DCOJRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ack, I'll put it on my TODO list then (I'm on duty this week, though, so my bandwidth is limited, so no guarantees as to when I get around doing this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went on a hunt for remaining occurrences of ISO_8859_1 and I found these, which (I think) are in scope for this PR:
https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java;l=529;drc=e60d88672cc4999e057f1c12bac2a1380d7e0a50 (and similar ones below)
https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/runtime/mobileinstall/MobileInstallCommand.java;l=377;drc=73b0da986f5b18fbc208910c82e96a32e90d35db (and similar ones below)
Can we replace them with the appropriate StringEncoding methods?
@tjgq Note that these call sites don't convert I actually had a helper function for this in We could add an |
@bazel-io fork 8.0.0 |
I was hoping for a helper similar |
@lberki lmk if you'd rather have me do the import instead |
@tjgq I'll give it a stab now, then if I fail, I'll hand over the baton |
Update: I imported the change and somewhat surprisingly, https://buildkite.com/bazel/google-bazel-presubmit/builds/85996#01930156-97b2-41cc-a133-8e331e07dd42 @tjgq , mind taking a look? |
This was due to a bad import that was missing the changes in src/tools/remote/BUILD. |
I'm going to replace |
Sounds good, this avoids assumptions on what the JIT may inline. |
@fmeum thanks for this pull request and your persistence. It really is a monster-sized one that makes Bazel better in ways that I didn't think was possible incrementally. |
Thanks, it took me multiple failed attempts before this one that seems to work. I have two follow-up PRs in the pipeline, but they won't be monster-sized. :-) Thanks for the thorough reviews! |
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards #374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards bazelbuild#374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes
String
s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of theString
contents, depending on the OS and availability of a Latin-1 locale.This PR introduces the concepts of internal, Unicode, and platform strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new
StringEncoding
helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.
SetEnv
on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.charset
parameter from all methods related to parameter files. TheISO-8859-1
vs.UTF-8
choice was flawed since Bazel's internal string representation doesn't maintain any encoding information -ISO-8859-1
just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.--client_cwd
and--workspace_directory
if the workspace path contains non-ASCII characters.MacOSXFsEventsDiffAwareness
UTF-8 aware. It previously used theGetStringUTF
JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).LocalDiffAwareness
.user.dir
.ExecRequest
fields into strings forProcessBuilder
forbazel --batch run
. This makes it possible to reenable thetest_consistent_command_line_encoding
test, fixing Fix (and reenable) run_test.test_consistent_command_line_encoding #1775.TargetCompleteEvents
.SubprocessFactory
implementations.file.encoding
doesn't equalISO-8859-1
as file names are encoded withsun.jnu.encoding
now.FileSystem
implementations, e.g. if reading a symlink target. Tests are added that verify the interaction betweenFileSystem
implementations and the Java (N)IO APIs on Unicode file paths.Fixes #1775
Fixes #11602
Fixes #18293
Work towards #374
Work towards #23859