Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix most Unicode encoding bugs #24010

Closed
wants to merge 19 commits into from
Closed

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Oct 16, 2024

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of internal, Unicode, and platform strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new StringEncoding helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

  • Replace ad-hoc conversion logic with the new consistent set of helper functions.
  • Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of SetEnv on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
  • Drop the charset parameter from all methods related to parameter files. The ISO-8859-1 vs. UTF-8 choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - ISO-8859-1 just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
  • Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between --client_cwd and --workspace_directory if the workspace path contains non-ASCII characters.
  • Read the downloader config using Bazel's filesystem implementation.
  • Make MacOSXFsEventsDiffAwareness UTF-8 aware. It previously used the GetStringUTF JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
  • Correctly reencode path strings for LocalDiffAwareness.
  • Correctly reencode the value of user.dir.
  • Correctly turn ExecRequest fields into strings for ProcessBuilder for bazel --batch run. This makes it possible to reenable the test_consistent_command_line_encoding test, fixing Fix (and reenable) run_test.test_consistent_command_line_encoding #1775.
  • Fix encoding issues in TargetCompleteEvents.
  • Fix encoding issues in SubprocessFactory implementations.
  • Drop obsolete warning if file.encoding doesn't equal ISO-8859-1 as file names are encoded with sun.jnu.encoding now.
  • Consistently reencode internal strings passed into and out of FileSystem implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between FileSystem implementations and the Java (N)IO APIs on Unicode file paths.

Fixes #1775
Fixes #11602
Fixes #18293
Work towards #374
Work towards #23859

@fmeum fmeum marked this pull request as ready for review October 16, 2024 17:54
@fmeum fmeum requested a review from a team as a code owner October 16, 2024 17:54
@github-actions github-actions bot added team-Performance Issues for Performance teams team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Oct 16, 2024
@fmeum fmeum marked this pull request as draft October 16, 2024 18:20
@fmeum
Copy link
Collaborator Author

fmeum commented Oct 16, 2024

Switching back to draft as there are some unexpected failures.

@fmeum fmeum force-pushed the 23859-unicode branch 3 times, most recently from c927001 to 793fd28 Compare October 16, 2024 20:48
@fmeum fmeum changed the title Add support for non-ASCII characters in workspace paths Add support for non-ASCII characters in workspace paths on Unix Oct 16, 2024
@fmeum fmeum force-pushed the 23859-unicode branch 14 times, most recently from 4a2476c to 531eaf0 Compare October 17, 2024 20:36
@fmeum
Copy link
Collaborator Author

fmeum commented Oct 31, 2024

The RBE environment doesn't have a Latin-1 locale installed. I am skipping the tests in that case, but maybe we could add it?

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 4, 2024

I submitted the separate PR #24172 to force sun.jnu.encoding = "UTF-8" on Windows.

@lberki
Copy link
Contributor

lberki commented Nov 4, 2024

@tjgq, are you fine with the current state of this pull request?

If so, I'll take a look at the internal references to ISO_8859_1 and UTF_8 to see if there are any obvious landmines in the code. The internal test battery looks fine -- aside from the aforementioned breakages that have all been fixed since then, everything seems to be green.

@tjgq
Copy link
Contributor

tjgq commented Nov 4, 2024 via email

@lberki
Copy link
Contributor

lberki commented Nov 4, 2024

Ack, I'll put it on my TODO list then (I'm on duty this week, though, so my bandwidth is limited, so no guarantees as to when I get around doing this)

Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 4, 2024

@tjgq Note that these call sites don't convert Strings to Strings, they turn Strings into ByteStrings. That never requires reencoding, it just means "give me the raw bytes of the internal encoding".

I actually had a helper function for this in StringEncoding, but it requires a dep on protobuf that is problematic for some consumers of StringEncoding (I recall some tricky issues with bootstrap_java_library). That's why I ended up removing it.

We could add an INTERNAL_CHARSET constant in StringEncoding that is set to ISO_8859_1 and may be flipped to UTF_8 in the future (but not necessarily, we may very well choose to continue to represent strings as UTF-8 in memory). What do you think?

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 4, 2024

@bazel-io fork 8.0.0

@tjgq
Copy link
Contributor

tjgq commented Nov 4, 2024

@tjgq Note that these call sites don't convert Strings to Strings, they turn Strings into ByteStrings. That never requires reencoding, it just means "give me the raw bytes of the internal encoding".

I actually had a helper function for this in StringEncoding, but it requires a dep on protobuf that is problematic for some consumers of StringEncoding (I recall some tricky issues with bootstrap_java_library). That's why I ended up removing it.

We could add an INTERNAL_CHARSET constant in StringEncoding that is set to ISO_8859_1 and may be flipped to UTF_8 in the future (but not necessarily, we may very well choose to continue to represent strings as UTF-8 in memory). What do you think?

I was hoping for a helper similar STRING_UNSAFE.getInternalStringBytes which, in addition to being self-documenting, would also give us an additional opportunity to catch bugs. But if that creates a problematic dependency, I say let's leave it for a separate PR; I don't want to delay this further, since we still have an internal review to do and I'd very much like it to make it into the 8.0.0 release.

@fmeum fmeum mentioned this pull request Nov 5, 2024
@fmeum fmeum requested a review from tjgq November 5, 2024 15:57
@tjgq
Copy link
Contributor

tjgq commented Nov 5, 2024

@lberki lmk if you'd rather have me do the import instead

@lberki
Copy link
Contributor

lberki commented Nov 6, 2024

@tjgq I'll give it a stab now, then if I fail, I'll hand over the baton

@lberki
Copy link
Contributor

lberki commented Nov 7, 2024

Update: I imported the change and somewhat surprisingly, remote_execution_test.test_unicode_execution started failing, even though I only needed very minimal changes:

https://buildkite.com/bazel/google-bazel-presubmit/builds/85996#01930156-97b2-41cc-a133-8e331e07dd42

@tjgq , mind taking a look?

@tjgq
Copy link
Contributor

tjgq commented Nov 7, 2024

Update: I imported the change and somewhat surprisingly, remote_execution_test.test_unicode_execution started failing, even though I only needed very minimal changes:

https://buildkite.com/bazel/google-bazel-presubmit/builds/85996#01930156-97b2-41cc-a133-8e331e07dd42

@tjgq , mind taking a look?

This was due to a bad import that was missing the changes in src/tools/remote/BUILD.

@tjgq
Copy link
Contributor

tjgq commented Nov 7, 2024

I'm going to replace private static final Charset SUN_JNU_ENCODING in StringEncoding.java with private static final boolean SUN_JNU_ENCODING_IS_ISO_8859_1 because (1) reference equality isn't guaranteed to work for Charset and (2) we might as well cache the comparison, anyway.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

I'm going to replace private static final Charset SUN_JNU_ENCODING in StringEncoding.java with private static final boolean SUN_JNU_ENCODING_IS_ISO_8859_1 because (1) reference equality isn't guaranteed to work for Charset and (2) we might as well cache the comparison, anyway.

Sounds good, this avoids assumptions on what the JIT may inline.

@copybara-service copybara-service bot closed this in a58fe3f Nov 7, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Nov 7, 2024
@lberki
Copy link
Contributor

lberki commented Nov 7, 2024

@fmeum thanks for this pull request and your persistence. It really is a monster-sized one that makes Bazel better in ways that I didn't think was possible incrementally.

@fmeum fmeum deleted the 23859-unicode branch November 7, 2024 15:53
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

@fmeum thanks for this pull request and your persistence. It really is a monster-sized one that makes Bazel better in ways that I didn't think was possible incrementally.

Thanks, it took me multiple failed attempts before this one that seems to work. I have two follow-up PRs in the pipeline, but they won't be monster-sized. :⁠-⁠) Thanks for the thorough reviews!

iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request Nov 8, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards #374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this pull request Dec 18, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards bazelbuild#374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team team-Rules-CPP Issues for C++ rules team-Rules-Java Issues for Java rules
Projects
None yet
7 participants