Implement a basic busybox for carbon/clang. #4406

jonmeow · 2024-10-11T23:25:40Z

For reference, we're going down the busyboxing route because Carbon depends on Clang, and we want both to be available as binaries. Busyboxing allows this while avoiding duplicating symbols between multiple binaries.

I'm removing the cc_binary for driver:carbon because I want to avoid a significant increase in binary outputs; bazel run //toolchain still works great.

This still doesn't have great test coverage (but non-zero: //examples:sieve still builds/runs, for example). The problem is that we want to avoid subprocessing for performance, but this mainly deals with subprocessing. I'm still thinking about good approaches for that, since we'll probably want more significant testing for clang interaction... the solution might involve busyboxing file_test too.

Note development on this ran into the argv issue being fixed in #4405

chandlerc

Some what of a high level question inline to start off with...

chandlerc · 2024-10-21T23:00:20Z

toolchain/install/busybox_main.cpp

+
+  Carbon::SetWorkingDirForBazel();
+
+  llvm::StringRef busybox_target = argv[1];


This is more of a high-level question, but currently this isn't doing the more common argv[0] based busybox, this is just intercepting subcommands and handling them separately if I understand correctly? And we use bash scripts to take a command by the name of foo (when != to carbon and turn it into a special kind of subcommand?

I actually like having subcommands available for all of this, but I'd rather directly use the argument libraries subcommand logic. WDYT about adding a clang subcommand to the carbon driver, and having it take any additional positional parameters and forward them to Clang, much like this code is doing? The shell script approach would then need to be exec path/to/carbon clang -- "$0" or some such, but otherwise would work roughly the same?

Alternatively, I do think there is an advantage to having this be usable directly with symlinks and just looking at argv[0] to figure out whether to synthesize a special subcommand or just forward to the driver. We could do that and still use bash scripts with Bazel (which I understand are likely to work better) by using exec -a clang path/to/carbon "$0" -- I think the -a argument to exec in Bash lets us simulate how the symlink should work?

Either way, it seems nice to have an actual subcommand for invoking Clang and making sure that doing carbon clang -- <stuff> works the same as path/to/carbons/clang <stuff>?

This is more of a high-level question, but currently this isn't doing the more common argv[0] based busybox, this is just intercepting subcommands and handling them separately if I understand correctly? And we use bash scripts to take a command by the name of foo (when != to carbon and turn it into a special kind of subcommand?

I think these are different from subcommands. carbon is not special-cased; note it has a script too. For example, carbon compile ... becomes carbon-busybox carbon compile ... in order to ensure we choose the right main. I think that using the correct main is important, particularly for non-carbon LLVM things that we want to throw in.

I actually like having subcommands available for all of this, but I'd rather directly use the argument libraries subcommand logic. WDYT about adding a clang subcommand to the carbon driver, and having it take any additional positional parameters and forward them to Clang, much like this code is doing? The shell script approach would then need to be exec path/to/carbon clang -- "$0" or some such, but otherwise would work roughly the same?

I'm wary of this. I don't think that carbon-busybox -v should work, nor should carbon-busybox carbon help should do anything other than what carbon help would do. Entering into additional flag parsing seems risky.

Alternatively, I do think there is an advantage to having this be usable directly with symlinks and just looking at argv[0] to figure out whether to synthesize a special subcommand or just forward to the driver. We could do that and still use bash scripts with Bazel (which I understand are likely to work better) by using exec -a clang path/to/carbon "$0" -- I think the -a argument to exec in Bash lets us simulate how the symlink should work?

This would mean we need to be extra careful about providing an argv[0] that can be used to get both correct behavior and a correct executable path, making install path resolution more complex. That's similarly what I was hoping to avoid by ducking symlinks, because recursive readlinking isn't trivial.

Either way, it seems nice to have an actual subcommand for invoking Clang and making sure that doing carbon clang -- <stuff> works the same as path/to/carbons/clang <stuff>?

This doesn't eliminate the carbon clang subcommand, but I think the nature of flag parsing means these will differ... and actually, to get good cc1 behavior, it's a ton easier if these can differ (see notes on different cc1 main functions in the PR)

A couple more things I think worth noting:

This same model is intended to work with other tools, such as lld, following https://github.com/llvm/llvm-project/blob/main/llvm/tools/llvm-driver/llvm-driver.cpp

I'm just not doing all the tools here because of the additional work to set up and verify.

With the command line flag library, note that we generally don't want flags handled (e.g., lld should dispatch without going through looking at flags) and clang's cc1 handling is similarly different but special (probably should not handle the flag in arbitrary positions, since clang does not).

For symlinks, the particular issue I'm concerned about with is:

We create clang -> carbon-busybox

A user aliases ~/cc -> clang

The user runs cc, which is passed in as argv[0]

Right now, InstallPaths requires a reliable path for the exe. Given that we'll have executables in multiple paths (lib and bin), we probably need to support both cases. The easiest way of doing that means readlink from cc to clang to carbon-busybox, then using the path of carbon-busybox for InstallPaths.

But also, we need to know which tool to run. Do we invoke carbon-busybox as cc, or do we do some smart logic on the way to remember that clang was the final symlink to carbon-busybox?

My rationale for a script is to avoid this complexity: we have both the path of carbon-busybox in argv[0], and clang as the tool being run.

Symlinks have another issue to consider, though: if somebody's tooling tries to help with things like this by resolving symlinks (e.g., os.path.realpath), we'd break that. Do we want to do changes at the level of breaking filesystem operations, basically telling the user they can't use those operations with Carbon?

I understand the canonical busybox uses symlinks, and I think that'd be one thing if LLVM was already bought in and had symlinks that would break these use-cases, but I don't think it does.

Note, I do think there's a trade-off here, I expect there's subprocess overhead.

And to be clear this is all front-loading what I had put in the TODO

Trying to combine responses to both posts. Also happy to discuss in person at the LLVM dev meeting if that's more useful, but figured I'd write down some of my thoughts.

Also, sorry I couldn't get these together earlier, for some reason the replies didn't show up when I looked at the end of the day yesterday, still not sure why...

This is more of a high-level question, but currently this isn't doing the more common argv[0] based busybox, this is just intercepting subcommands and handling them separately if I understand correctly? And we use bash scripts to take a command by the name of foo (when != to carbon and turn it into a special kind of subcommand?

I think these are different from subcommands. carbon is not special-cased; note it has a script too. For example, carbon compile ... becomes carbon-busybox carbon compile ... in order to ensure we choose the right main. I think that using the correct main is important, particularly for non-carbon LLVM things that we want to throw in.

Yeah, I did see that.

But in a way, it means that there is a special form of very restricted subcommands to the carbon-busybox program. While I understand the motivation, and thanks for some of the added details, I do still find that structure a bit awkward to understand.

Anyways, the main points seem the other ones...

I actually like having subcommands available for all of this, but I'd rather directly use the argument libraries subcommand logic. WDYT about adding a clang subcommand to the carbon driver, and having it take any additional positional parameters and forward them to Clang, much like this code is doing? The shell script approach would then need to be exec path/to/carbon clang -- "$0" or some such, but otherwise would work roughly the same?

I'm wary of this. I don't think that carbon-busybox -v should work, nor should carbon-busybox carbon help should do anything other than what carbon help would do. Entering into additional flag parsing seems risky.

I somewhat see it the other way I guess. It would give us flexibility to choose to expose things (or not) at different layers. For example if we'd like to have a friendly flag available when using the subcommand without patching Clang (or LLD or ...), this is now a layer where we can provide that.

I have a few ideas in this space that I'd like the flexibility to explore, so hoping to set up the architecture with that capability.

Either way, it seems nice to have an actual subcommand for invoking Clang and making sure that doing carbon clang -- <stuff> works the same as path/to/carbons/clang <stuff>?

This doesn't eliminate the carbon clang subcommand, but I think the nature of flag parsing means these will differ... and actually, to get good cc1 behavior, it's a ton easier if these can differ (see notes on different cc1 main functions in the PR)

I worry that with two independent paths for calling into Clang there's a risk of subtle divergence. It seems nice to keep them unified to avoid that. It also seems nice to directly support the -cc1 dispatch in the clang subcommand, using the same logic that Clang itself does and that you have in the busybox.

To be clear, I do really like giving Clang the busybox clang binary in the install to support subprocessing cleanly. =]

(Pulling the other CC1 comment up here)

With the command line flag library, note that we generally don't want flags handled (e.g., lld should dispatch without going through looking at flags) and clang's cc1 handling is similarly different but special (probably should not handle the flag in arbitrary positions, since clang does not).

Agreed, but this seems well handled by modeling this as the flags that shouldn't be processed as coming after a --, much as the current carbon clang -- ... does.

Alternatively, I do think there is an advantage to having this be usable directly with symlinks and just looking at argv[0] to figure out whether to synthesize a special subcommand or just forward to the driver. We could do that and still use bash scripts with Bazel (which I understand are likely to work better) by using exec -a clang path/to/carbon "$0" -- I think the -a argument to exec in Bash lets us simulate how the symlink should work?

This would mean we need to be extra careful about providing an argv[0] that can be used to get both correct behavior and a correct executable path, making install path resolution more complex. That's similarly what I was hoping to avoid by ducking symlinks, because recursive readlinking isn't trivial.

(and)

For symlinks, the particular issue I'm concerned about with is:

We create clang -> carbon-busybox

A user aliases ~/cc -> clang

The user runs cc, which is passed in as argv[0]

Right now, InstallPaths requires a reliable path for the exe. Given that we'll have executables in multiple paths (lib and bin), we probably need to support both cases. The easiest way of doing that means readlink from cc to clang to carbon-busybox, then using the path of carbon-busybox for InstallPaths.

But also, we need to know which tool to run. Do we invoke carbon-busybox as cc, or do we do some smart logic on the way to remember that clang was the final symlink to carbon-busybox?

My rationale for a script is to avoid this complexity: we have both the path of carbon-busybox in argv[0], and clang as the tool being run.

Symlinks have another issue to consider, though: if somebody's tooling tries to help with things like this by resolving symlinks (e.g., os.path.realpath), we'd break that. Do we want to do changes at the level of breaking filesystem operations, basically telling the user they can't use those operations with Carbon?

I understand the canonical busybox uses symlinks, and I think that'd be one thing if LLVM was already bought in and had symlinks that would break these use-cases, but I don't think it does.

Note, I do think there's a trade-off here, I expect there's subprocess overhead.

Understood all of this.

I'm much less worried about breaking complex symlinking situations here because I think all of the busybox entry points should really be internal implementation details used to allow subprocessing and such. If we want to support folks adding them to their PATH, setting CC to them or a cc symlink as you mention, I think it's not unreasonable that this could be a bit brittle, and I think it's reasonable to have some restrictions in the interest of avoiding some overhead.

I also think we could easily provide a set of scripts that could be really durably used, and have an explicit flag to override the install path (which seems useful generally come to think of it), so that they can very explicitly set argv[0] and the install path in a reliable way...

It'll be some extra logic to find the install, but that doesn't seem like a bad tradeoff to get the low-overhead subprocess with no shell, and a very familiar busybox architecture.

And to be clear this is all front-loading what I had put in the TODO

I do understand that, but it seems to really change the architecture of the busy box itself. It seemed worth having a discussion around the design for this, and I still lean towards one where the busybox layer translates specially recognized argv[0]s into subcommands of carbon, and where the subcommands can be directly used and support the same full range of functionality.

It seems like both approaches could certainly be made to work, and the tradeoffs are more of a judgement call, but currently this is still the direction my judgement call would go... At least, with the information I see. Maybe there are still factors I'm not seeing though?

Again, happy to discuss a bit in person at the dev meeting if that would be more helpful.

This same model is intended to work with other tools, such as lld, following llvm/llvm-project@main/llvm/tools/llvm-driver/llvm-driver.cpp

I'm just not doing all the tools here because of the additional work to set up and verify.

Yep, +1.

Switched per discussion. Let me know if you still see issues.

chandlerc

Some what of a high level question inline to start off with...

chandlerc

Had some questions, but Jon helpfully answered those live. Otherwise, all LGTM!

github-actions bot requested a review from chandlerc October 11, 2024 23:25

github-actions bot added infrastructure toolchain labels Oct 11, 2024

jonmeow force-pushed the busybox branch from f7a90be to 071df3b Compare October 14, 2024 15:33

Implement a basic busybox for carbon/clang.

7efb232

jonmeow force-pushed the busybox branch 2 times, most recently from bbf6f7a to 7efb232 Compare October 21, 2024 20:42

chandlerc reviewed Oct 21, 2024

View reviewed changes

symmlinks

673c692

jonmeow requested a review from chandlerc October 23, 2024 21:40

chandlerc approved these changes Oct 30, 2024

View reviewed changes

jonmeow added this pull request to the merge queue Oct 30, 2024

Merged via the queue into carbon-language:trunk with commit 957599b Oct 30, 2024
8 checks passed

jonmeow deleted the busybox branch October 30, 2024 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a basic busybox for carbon/clang. #4406

Implement a basic busybox for carbon/clang. #4406

jonmeow commented Oct 11, 2024 •

edited

Loading

chandlerc left a comment

chandlerc Oct 21, 2024

jonmeow Oct 22, 2024 •

edited

Loading

jonmeow Oct 22, 2024 •

edited

Loading

chandlerc Oct 23, 2024

jonmeow Oct 23, 2024

chandlerc left a comment

chandlerc left a comment


		Carbon::SetWorkingDirForBazel();

		llvm::StringRef busybox_target = argv[1];

Implement a basic busybox for carbon/clang. #4406

Implement a basic busybox for carbon/clang. #4406

Conversation

jonmeow commented Oct 11, 2024 • edited Loading

chandlerc left a comment

Choose a reason for hiding this comment

chandlerc Oct 21, 2024

Choose a reason for hiding this comment

jonmeow Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

jonmeow Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

chandlerc Oct 23, 2024

Choose a reason for hiding this comment

jonmeow Oct 23, 2024

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

jonmeow commented Oct 11, 2024 •

edited

Loading

jonmeow Oct 22, 2024 •

edited

Loading

jonmeow Oct 22, 2024 •

edited

Loading