Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.0.0] Fix most Unicode encoding bugs #24260

Closed
wants to merge 1 commit into from

Conversation

iancha1992
Copy link
Member

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of internal, Unicode, and platform strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new StringEncoding helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

  • Replace ad-hoc conversion logic with the new consistent set of helper functions.
  • Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of SetEnv on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
  • Drop the charset parameter from all methods related to parameter files. The ISO-8859-1 vs. UTF-8 choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - ISO-8859-1 just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
  • Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between --client_cwd and --workspace_directory if the workspace path contains non-ASCII characters.
  • Read the downloader config using Bazel's filesystem implementation.
  • Make MacOSXFsEventsDiffAwareness UTF-8 aware. It previously used the GetStringUTF JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
  • Correctly reencode path strings for LocalDiffAwareness.
  • Correctly reencode the value of user.dir.
  • Correctly turn ExecRequest fields into strings for ProcessBuilder for bazel --batch run. This makes it possible to reenable the test_consistent_command_line_encoding test, fixing Fix (and reenable) run_test.test_consistent_command_line_encoding #1775.
  • Fix encoding issues in TargetCompleteEvents.
  • Fix encoding issues in SubprocessFactory implementations.
  • Drop obsolete warning if file.encoding doesn't equal ISO-8859-1 as file names are encoded with sun.jnu.encoding now.
  • Consistently reencode internal strings passed into and out of FileSystem implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between FileSystem implementations and the Java (N)IO APIs on Unicode file paths.

Fixes #1775.

Fixes #11602.

Fixes #18293.

Work towards #374.

Work towards #23859.

Closes #24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

Commit a58fe3f

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards #374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
@iancha1992 iancha1992 added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Rules-Java Issues for Java rules team-Rules-CPP Issues for C++ rules team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Nov 8, 2024
@iancha1992 iancha1992 requested a review from a team as a code owner November 8, 2024 21:56
@github-actions github-actions bot added team-Local-Exec Issues and PRs for the Execution (Local) team awaiting-review PR is awaiting review from an assigned reviewer labels Nov 8, 2024
@meteorcloudy
Copy link
Member

FYI @fmeum please take a look at the test failures

@fmeum
Copy link
Collaborator

fmeum commented Nov 11, 2024

@iancha1992 This patch should fix the build, could you apply it?

diff --git a/src/main/java/com/google/devtools/build/lib/BUILD b/src/main/java/com/google/devtools/build/lib/BUILD
index 71deae4ac9..898ef5a81d 100644
--- a/src/main/java/com/google/devtools/build/lib/BUILD
+++ b/src/main/java/com/google/devtools/build/lib/BUILD
@@ -325,6 +325,7 @@ java_library(
         "//src/main/java/com/google/devtools/build/lib/actions:file_metadata",
         "//src/main/java/com/google/devtools/build/lib/actions:important_output_handler",
         "//src/main/java/com/google/devtools/build/lib/actions:localhost_capacity",
+        "//src/main/java/com/google/devtools/build/lib/actions:middleman_type",
         "//src/main/java/com/google/devtools/build/lib/actions:package_roots",
         "//src/main/java/com/google/devtools/build/lib/actions:resource_manager",
         "//src/main/java/com/google/devtools/build/lib/actions:runfiles_tree",

Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original (head) PR is getting rolled back - please don't submit this cherry pick until we figure out what to do

@iancha1992 iancha1992 disabled auto-merge November 11, 2024 19:01
@haxorz
Copy link
Contributor

haxorz commented Nov 11, 2024

The original (head) PR is getting rolled back - please don't submit this cherry pick until we figure out what to do

fc6b384

@tjgq tjgq closed this Nov 11, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Local-Exec Issues and PRs for the Execution (Local) team team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team team-Rules-CPP Issues for C++ rules team-Rules-Java Issues for Java rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants