Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor mtree_spec to vis encode filenames #795

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pcj
Copy link

@pcj pcj commented Mar 20, 2024

This PR refactors the mtree_spec rule:

Instead of emitting the .spec output file directly in it's final formatted form, it has been changed to emit a jsonl equivalent. This intermediate file is then processed line-by-line in a shell script with jq to assist with parsing. The filenames (and content) are encoded with vis prior to final mtree line formatting. A golden test has been added to e2e/smoke.

NOTE: this PR assumes the vis tool is present on the host machine (this is generally true for macos and linux). A future improvement could bundle the vis implementation from go-mtree.

Fixes #794

@pcj pcj changed the title WIP: vis encode mtree entries Refactor mtree_spec to vis encode filenames Mar 21, 2024
Copy link
Collaborator

@alexeagle alexeagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm excited to get this improvement in.

We are trying hard to make our bsdtar more hermetic (likely something like bazelbuild/bazel-central-registry#1662 with @dzbarsky )
so I'm pretty sad to regress that by assuming vis is on the host machine.
We have a lot of users under nix or in a distroless/wolfi image that has a stripped down distro.

I'm also not clear why it's beneficial to encode into jsonl as a transport between starlark and the multi-line run_shell command - can't the latter read data from the struct directly?

Will discuss with @thesayyn at standup.

@pcj
Copy link
Author

pcj commented Mar 22, 2024

We are trying hard to make our bsdtar more hermetic (likely something like bazelbuild/bazel-central-registry#1662 with @dzbarsky )
so I'm pretty sad to regress that by assuming vis is on the host machine.
We have a lot of users under nix or in a distroless/wolfi image that has a stripped down distro.

Sure. I intended this PR as a workable example that allows me to move forward, and possibly as a base for someone from your team to continue with. Maybe include go-mtree as a dependency and build/distribute pre-built binaries? Alternatively, a simple but dedicated C/C++ program using libarchive that gets built for the target architecture?

I'm also not clear why it's beneficial to encode into jsonl as a transport between starlark and the multi-line run_shell command - can't the latter read data from the struct directly?

Yeah there certainly many possible ways to pass the information into a separate subprocess that generates the final mtree output. I also wasn't totally clear on the code that generates the lines (first iterates through ctx.files.srcs, then the DefaultInfo of each one in ctx.attr.srcs (it is not redundant?)). In any case, I didn't want to mess with that so I made a minimal change to generate something (JSON) that I had a parser for (jq) without including additional tool dependencies (and without slurping everything into memory at once since these files can be huge). As long as the final output is correct, I of course don't care how it's done, feel free to revamp it.

@alexeagle alexeagle self-assigned this Mar 24, 2024
@alexeagle
Copy link
Collaborator

Thanks Paul - I'm not sure when I'll get time to work on this, but I imagine you can already patch in this change in the spot you need it, so you're not blocked on us merging and releasing?

@alexeagle
Copy link
Collaborator

Tried this out locally on my linux machine (stock Ubuntu 22.04.4 LTS), and I don't have vis installed. So I think we need a hermetic binary we can fetch and run.
@dzbarsky do you have some clever ideas where we should get it?

alexeagle added a commit that referenced this pull request May 7, 2024
This is a partial solution to #794 that covers the most common case needing escaping.
See #795 for a principled but blocked fix that requires a BSD vis executable
alexeagle added a commit that referenced this pull request May 7, 2024
This is a partial solution to #794 that covers the most common case needing escaping.
See #795 for a principled but blocked fix that requires a BSD vis executable
alexeagle added a commit that referenced this pull request May 7, 2024
This is a partial solution to #794 that covers the most common case needing escaping.
See #795 for a principled but blocked fix that requires a BSD vis executable
@tkinz27
Copy link

tkinz27 commented Jul 17, 2024

I just ran into a nodejs dependency where the directory name included a space in it that is not covered by #835. I am using bazel_dep(name = "aspect_bazel_lib", version = "2.7.8")

The line in my mtree spec is

src/typescript/tours-api/tours-api.sh.runfiles/npm/node_modules/thread-stream/test/dir with spaces uid=0 gid=0 time=1672560000 mode=0755 type=dir

I had claude generate this genrule to cover it. So far tested on mac, need to test on linux still but its a simple awk script so i'm hopeful shelling to the host awk wont be a problem

    native.genrule(
        name = name + ".mf-sanitized",
        srcs = [name + ".mf"],
        outs = [name + ".mf-sanitized.spec"],
        cmd = """
awk '
    BEGIN {FS="uid="; OFS=""}
    {
        gsub(/ /, "\\\\040", $$1)
        sub(/\\\\040$$/, " ", $$1)
        print $$1 "uid=" $$2
    }
' "$(location """ + name + """.mf)" > "$@"
""",
    )

@tkinz27
Copy link

tkinz27 commented Jul 17, 2024

I just ran into a nodejs dependency where the directory name included a space in it that is not covered by #835

The line in my mtree spec is

src/typescript/tours-api/tours-api.sh.runfiles/npm/node_modules/thread-stream/test/dir with spaces uid=0 gid=0 time=1672560000 mode=0755 type=dir

I had claude generate this genrule to cover it. So far tested on mac, need to test on linux still but its a simple awk script so i'm hopeful shelling to the host awk wont be a problem

    native.genrule(
        name = name + ".mf-sanitized",
        srcs = [name + ".mf"],
        outs = [name + ".mf-sanitized.spec"],
        cmd = """
awk '
    BEGIN {FS="uid="; OFS=""}
    {
        gsub(/ /, "\\\\040", $$1)
        sub(/\\\\040$$/, " ", $$1)
        print $$1 "uid=" $$2
    }
' "$(location """ + name + """.mf)" > "$@"
""",
    )

@j-eid
Copy link

j-eid commented Nov 21, 2024

Ran into the same issue with spaces in a directory. Ended up overriding the module with a simple one line patch

Old

lines.append(_mtree_line(parent, "dir"))

New

lines.append(_mtree_line(_vis_encode(parent), "dir"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: mtree_spec rule does not encode filenames with special characters
4 participants