-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing #1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes #1775. Fixes #11602. Fixes #18293. Work towards #374. Work towards #23859. Closes #24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
- Loading branch information
1 parent
f6585d4
commit a58fe3f
Showing
98 changed files
with
1,398 additions
and
819 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
// Copyright 2024 The Bazel Authors. All rights reserved. | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
#include <string> | ||
#include <vector> | ||
|
||
#include "src/main/cpp/option_processor-internal.h" | ||
|
||
// On OSX, there apparently is no header that defines this. | ||
#ifndef environ | ||
extern char** environ; | ||
#endif | ||
|
||
namespace blaze::internal { | ||
|
||
std::vector<std::string> GetProcessedEnv() { | ||
std::vector<std::string> processed_env; | ||
for (char** env = environ; *env != nullptr; env++) { | ||
processed_env.emplace_back(*env); | ||
} | ||
return processed_env; | ||
} | ||
|
||
} // namespace blaze::internal |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
// Copyright 2024 The Bazel Authors. All rights reserved. | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
#define WIN32_LEAN_AND_MEAN | ||
#include <windows.h> | ||
|
||
#include <algorithm> | ||
#include <string> | ||
#include <string_view> | ||
|
||
#include "absl/strings/ascii.h" | ||
#include "src/main/cpp/option_processor-internal.h" | ||
#include "src/main/cpp/util/strings.h" | ||
|
||
namespace blaze::internal { | ||
|
||
#if defined(__CYGWIN__) | ||
|
||
static void PreprocessEnvString(std::string* env_str) { | ||
int pos = env_str->find_first_of('='); | ||
if (pos == string::npos) { | ||
return; | ||
} | ||
std::string name = env_str->substr(0, pos); | ||
if (name == "PATH") { | ||
env_str->assign("PATH=" + env_str->substr(pos + 1)); | ||
} else if (name == "TMP") { | ||
// A valid Windows path "c:/foo" is also a valid Unix path list of | ||
// ["c", "/foo"] so must use ConvertPath here. See GitHub issue #1684. | ||
env_str->assign("TMP=" + blaze_util::ConvertPath(env_str->substr(pos + 1))); | ||
} | ||
} | ||
|
||
#else // not defined(__CYGWIN__) | ||
|
||
static void PreprocessEnvString(std::string* env_str) { | ||
static constexpr const char* vars_to_uppercase[] = { | ||
"PATH", "SYSTEMROOT", "SYSTEMDRIVE", "TEMP", "TEMPDIR", "TMP"}; | ||
|
||
std::size_t pos = env_str->find_first_of('='); | ||
if (pos == std::string::npos) { | ||
return; | ||
} | ||
|
||
std::string name = absl::AsciiStrToUpper(env_str->substr(0, pos)); | ||
if (std::find(std::begin(vars_to_uppercase), std::end(vars_to_uppercase), | ||
name) != std::end(vars_to_uppercase)) { | ||
env_str->assign(name + "=" + env_str->substr(pos + 1)); | ||
} | ||
} | ||
|
||
#endif // defined(__CYGWIN__) | ||
|
||
static bool IsValidEnvName(std::string_view s) { | ||
std::string_view name = s.substr(0, s.find('=')); | ||
return std::all_of(name.begin(), name.end(), [](char c) { | ||
return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || | ||
(c >= '0' && c <= '9') || c == '_' || c == '(' || c == ')'; | ||
}); | ||
} | ||
|
||
// Use GetEnvironmentStringsW to get the environment variables to support | ||
// Unicode regardless of the current code page. | ||
std::vector<std::string> GetProcessedEnv() { | ||
std::vector<std::string> processed_env; | ||
wchar_t* env = GetEnvironmentStringsW(); | ||
if (env == nullptr) { | ||
return processed_env; | ||
} | ||
|
||
for (wchar_t* p = env; *p != L'\0'; p += wcslen(p) + 1) { | ||
std::string env_str = blaze_util::WstringToCstring(p); | ||
if (IsValidEnvName(env_str)) { | ||
PreprocessEnvString(&env_str); | ||
processed_env.push_back(std::move(env_str)); | ||
} | ||
} | ||
|
||
FreeEnvironmentStringsW(env); | ||
return processed_env; | ||
} | ||
|
||
} // namespace blaze::internal |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.