Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Starlark files as raw bytes for Bzlmod #24217

Closed
wants to merge 2 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Nov 5, 2024

As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain file system paths.

Also includes changes to the Python test setup:

  • ScratchFile now always writes files as UTF-8
  • RunProgram encodes and decodes stdin/stderr/stdout as UTF-8
  • download no longer leaks a file

@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Nov 5, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 5, 2024

This needs #24010 to get its test to pass.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 5, 2024

@bazel-io fork 8.0.0

@fmeum fmeum marked this pull request as draft November 7, 2024 16:26
@fmeum fmeum force-pushed the 23859-unicode-bzlmod branch from 5b5f9a5 to 31f0271 Compare November 7, 2024 18:22
@fmeum fmeum marked this pull request as ready for review November 7, 2024 18:22
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

Ready for review now.

@tjgq FYI, this is the other Unicode-related PR I am planning to get into 8.0.0. If you see anything else that looks incompatible, please let me know.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

Turns out that there is an interesting failure, switching back to draft.

@fmeum fmeum marked this pull request as draft November 7, 2024 19:01
@fmeum fmeum force-pushed the 23859-unicode-bzlmod branch 8 times, most recently from 165f03b to a68c20e Compare November 7, 2024 22:46
@fmeum fmeum marked this pull request as ready for review November 7, 2024 22:47
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

@Wyverald CI should be green now, I just had to fix some issues in the Python test setup code.

@Wyverald Wyverald added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Nov 13, 2024
@Wyverald
Copy link
Member

hmm, actually -- does this require the "re-land" PR first? if so, we should hold back on the import.

@Wyverald Wyverald added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally labels Nov 13, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 13, 2024

hmm, actually -- does this require the "re-land" PR first? if so, we should hold back on the import.

I think it does for the test to pass.

As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain e.g. file system paths.
@fmeum fmeum force-pushed the 23859-unicode-bzlmod branch from a68c20e to e2a9596 Compare November 14, 2024 18:03
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 14, 2024

@Wyverald The reland has happened.

@fmeum fmeum requested a review from Wyverald November 14, 2024 18:10
@Wyverald Wyverald added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Nov 15, 2024
@Wyverald
Copy link
Member

Do I understand correctly that, while this PR fixes the use case of non-ASCII characters in file paths, it "breaks" the case where non-ASCII characters are passed in e.g. string attrs in tags?

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

Do I understand correctly that, while this PR fixes the use case of non-ASCII characters in file paths, it "breaks" the case where non-ASCII characters are passed in e.g. string attrs in tags?

It actually fixes both: The .bzl files containing the module extension source are read as raw bytes, module and repo context functions read and write raw bytes from files, ... With this change, strings attrs in tags will also line up with e.g. string literals in extension code.

@Wyverald
Copy link
Member

I see!

module and repo context functions read and write raw bytes from files

I just double-checked, and it seems that we try to write files in UTF-8 in mctx/rctx: code

It's been broken like this forever, so presumably nobody has been relying on it working...

@Wyverald
Copy link
Member

Also, presumably this means print(some_tag.some_attr) could yield gibberish, but that's nothing new. (To be clear, the import is already under way -- this is just me trying to understand the problem better)

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

I just double-checked, and it seems that we try to write files in UTF-8 in mctx/rctx: code
It's been broken like this forever, so presumably nobody has been relying on it working...

That's being fixed in #24182 :-)

Also, presumably this means print(some_tag.some_attr) could yield gibberish, but that's nothing new.

It doesn't, since the Bazel client ultimately outputs raw bytes as well. That's being tested in #24243.

To be clear, the import is already under way -- this is just me trying to understand the problem better

In case it is not too late, I pushed a new test to demonstrate this end-to-end for tags.

@Wyverald
Copy link
Member

Wyverald commented Nov 15, 2024

Ahh, thanks for the explanation. It sounds like we're actually just treating Starlark strs as byte arrays everywhere now, right?

In case it is not too late, I pushed a new test to demonstrate this end-to-end for tags.

Ack -- I'll make sure that's imported too

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

Ahh, thanks for the explanation. It sounds like we're actually just treating Starlark strs as byte arrays everywhere now, right?

Yes, that's correct. It's actually not the worst in a world where you increasingly deal with binary data and UTF-8 strings only.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 15, 2024

@bazel-io fork 8.0.0

@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Nov 18, 2024
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Nov 18, 2024
As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain file system paths.

Also includes changes to the Python test setup:
* `ScratchFile` now always writes files as UTF-8
* `RunProgram` encodes and decodes stdin/stderr/stdout as UTF-8
* `download` no longer leaks a file

Closes bazelbuild#24217.

PiperOrigin-RevId: 697550082
Change-Id: If7f3fc7ddace2cda5e1f8e48a65406aa54f2a6d8
@fmeum fmeum deleted the 23859-unicode-bzlmod branch November 18, 2024 10:30
github-merge-queue bot pushed a commit that referenced this pull request Nov 18, 2024
As long as Bazel internally represents strings as raw bytes "encoded" in
Latin-1, the same must be true for all Starlark files that may contain
file system paths.

Also includes changes to the Python test setup:
* `ScratchFile` now always writes files as UTF-8
* `RunProgram` encodes and decodes stdin/stderr/stdout as UTF-8
* `download` no longer leaks a file

Closes #24217.

PiperOrigin-RevId: 697550082
Change-Id: If7f3fc7ddace2cda5e1f8e48a65406aa54f2a6d8

Commit
fba8603

Co-authored-by: Fabian Meumertzheim <[email protected]>
@iancha1992
Copy link
Member

The changes in this PR have been included in Bazel 8.0.0 RC3. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=8.0.0rc3. Thanks!

ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this pull request Dec 18, 2024
As long as Bazel internally represents strings as raw bytes "encoded" in Latin-1, the same must be true for all Starlark files that may contain file system paths.

Also includes changes to the Python test setup:
* `ScratchFile` now always writes files as UTF-8
* `RunProgram` encodes and decodes stdin/stderr/stdout as UTF-8
* `download` no longer leaks a file

Closes bazelbuild#24217.

PiperOrigin-RevId: 697550082
Change-Id: If7f3fc7ddace2cda5e1f8e48a65406aa54f2a6d8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants