Unable to use Bazel in a folder that has Japanese characters in it #23859

PSmithsonp4 · 2024-10-03T08:34:44Z

Description of the bug:

If you work in a folder that has Japanese characters in it, running "bazel build" will fail (even "bazel info release" fails). Here's what happens when I try it on my Mint 21 install -

/tmp/ワーク:$ bazel build //...
Starting local Bazel server and connecting to it...
ERROR: Client cwd '/tmp/ワーク' is not inside workspace '/tmp/???'
/tmp/ワーク:$

That's an English installation of Mint 21. On Windows, I have a JP VM. i.e. the OS is in Japanese. I get a slightly different error -

C:\tmp\ウェブ>bazel build //...
FATAL: changing directory into c:\tmp\ウェブ failed: (error: 123): t@CAfBNgA܂̓{[ x̍\Ԉ��Ă܂B


C:\tmp\ウェブ>

The contents of the BUILD and WORKSPACE file don't matter (as far as I can tell).

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

On Linux, you could type -

mkdir /tmp/ワーク
cd /tmp/ワーク
touch BUILD
touch WORKSPACE
bazel build //...

Which operating system are you running Bazel on?

Mint 21, Rocky 9 and Windows 10

What is the output of `bazel info release`?

/tmp/ワーク:$ bazel info release ERROR: Client cwd '/tmp/ワーク' is not inside workspace '/tmp/???'

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

I just installed release versions

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

I posted a question here - https://stackoverflow.com/questions/79047697/bazel-build-not-working-in-a-folder-with-non-ascii-japanese-characters
I found a similar issue reported here - #2550

Any other information, logs, or outputs that you want to share?

Since I can't get the "bazel info release" output, I can tell you that it's 7.1.0 on Linux and 7.3.2 on Windows. That's what --version shows. Same if I run bazel info release in an ASCII folder.

The text was updated successfully, but these errors were encountered:

This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards #374 Work towards #18293 Work towards #23859 Closes #24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366

This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards bazelbuild#374 Work towards bazelbuild#18293 Work towards bazelbuild#23859 Closes bazelbuild#24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366

This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards #374 Work towards #18293 Work towards #23859 Closes #24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366 Commit 7bb8d2b Co-authored-by: Fabian Meumertzheim <[email protected]>

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing #1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes #1775. Fixes #11602. Fixes #18293. Work towards #374. Work towards #23859. Closes #24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards #374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later. This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters. The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`. Work towards bazelbuild#374 Work towards bazelbuild#18293 Work towards bazelbuild#23859 Closes bazelbuild#24172. PiperOrigin-RevId: 693466466 Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards bazelbuild#374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

PSmithsonp4 added type: bug untriaged labels Oct 3, 2024

PSmithsonp4 assigned iancha1992, satyanandak and sgowroji Oct 3, 2024

github-actions bot added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label Oct 3, 2024

satyanandak unassigned sgowroji, iancha1992 and satyanandak Oct 3, 2024

fmeum self-assigned this Oct 16, 2024

fmeum mentioned this issue Oct 16, 2024

Fix most Unicode encoding bugs #24010

Closed

fmeum mentioned this issue Nov 4, 2024

Force the JVM to use UTF-8 on Windows #24172

Closed

bazel-io mentioned this issue Nov 6, 2024

[8.0.0] Force the JVM to use UTF-8 on Windows #24231

Merged

iancha1992 mentioned this issue Nov 8, 2024

[8.0.0] Fix most Unicode encoding bugs #24260

Closed

fmeum mentioned this issue Nov 22, 2024

Make Bazel itself build under an output base with Unicode characters #24457

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use Bazel in a folder that has Japanese characters in it #23859

Unable to use Bazel in a folder that has Japanese characters in it #23859

PSmithsonp4 commented Oct 3, 2024 •

edited

Loading

Unable to use Bazel in a folder that has Japanese characters in it #23859

Unable to use Bazel in a folder that has Japanese characters in it #23859

Comments

PSmithsonp4 commented Oct 3, 2024 • edited Loading

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

PSmithsonp4 commented Oct 3, 2024 •

edited

Loading

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?