-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the size of formatted SemIR. #4534
Reduce the size of formatted SemIR. #4534
Conversation
zygoloid
commented
Nov 15, 2024
- Do not include entities imported from files that we are not dumping.
- Do not include constants and import_refs that are not referenced by something that we are including in the formatted output.
- Do not include entities imported from files that we are not dumping. - Do not include constants and import_refs that are not referenced by something that we are including in the formatted output.
I'm in two minds about this. On the one hand, our |
FWIW, I think we should do this, and maybe even more. I think we should make our tests as small and focused as we can TBH. While I agree there is some utility in having back pressure on excessive SemIR verbosity, I think we're already struggling with test focus, and we should invest in that. We have lots of good ways to more cheaply track slow growth of SemIR. We have benchmarking and we have memory statistics. We can get more statistics like # of import refs, etc. At some point, we can invest in CI infrastructure that updates a dashboard with these statistics (or even benchmark data) over time to watch for regressions. That seems like a more effective way to track accidental growth. I do think we should have a flag to get the complete dump because I suspect it could be useful when debugging growth or just problematic interactions with things like imports once it happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with zygoloid in person, so partly I feel like I'm documenting my thoughts here. :)
I think we could probably do better than just this exclusion, for example by doing more to support comments about which lines of code to include IR for. The implementation of that would possibly look different from this change.
We'd also discussed just "turning off" SemIR for a lot of files, maybe by default. I was okay with that too, actually, to force more thoughtful targeting of where we think the SemIR is significant. But I think if we go a longer-term direction of having more specific comments about where we want SemIR in output, it may be a better way to effect output trimming.
We will almost certainly miss issues due to having less SemIR. However, by having less, we could also end up focusing more on what is present and become more likely to notice issues as a consequence. Part of the issue with having too much information is that it can drown out useful signals.
That all said, I think there's a need to balance between unblocking work, having good tests, and git repository size (I was worried about having >10MB test churn PRs become the new normal). I think we can probably do better with more work, but I think this PR is doing a good balance of cost-benefit. I'm happy to take this change.
Before this change, #4532 is +214,973 −10,377
After this change, #4532 is +19,865 −5,758
FWIW, I think this is in the right direction and resolves my concerns for the related PRs. I do want to ask though, does the # of files affected change, or is it the same # of files with smaller deltas?
toolchain/sem_ir/formatter.h
Outdated
@@ -27,6 +36,7 @@ class Formatter { | |||
const File& sem_ir_; | |||
// Caches naming between Print calls. | |||
InstNamer inst_namer_; | |||
ShouldFormatEntityCallback should_format_entity_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe either whitespace or its own comment to make clearer that this isn't part of the caching comment above? e.g.:
ShouldFormatEntityCallback should_format_entity_; | |
ShouldFormatEntityCallback should_format_entity_; |
ShouldFormatEntityCallback should_format_entity_; | |
// See `ShouldFormatEntityCallback`. | |
ShouldFormatEntityCallback should_format_entity_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordered before inst_namer_
; I think that's also better, to keep the values specified in the constructor call separate from the computed members.
toolchain/sem_ir/formatter.cpp
Outdated
if (block.empty()) { | ||
return; | ||
} | ||
|
||
llvm::SaveAndRestore scope(scope_, scope_id); | ||
out_ << inst_namer_->GetScopeName(scope_id) << " "; | ||
OpenBrace(); | ||
FormatCodeBlock(block); | ||
FormatCodeBlock(block, tentative); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only spot that does tentative code block formatting, maybe the branched logic should be inlined here? i.e., per above comment about tentative=true
, this is the only call to FormatCodeBlock
with tentative
, so maybe:
FormatCodeBlock(block, tentative); | |
for (const InstId inst_id : block) { | |
TentativeOutputScope scope(*this); | |
tentative_inst_chunks_[inst_id.index] = scope.index; | |
FormatInst(inst_id); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
toolchain/sem_ir/formatter.cpp
Outdated
auto FormatScope(InstNamer::ScopeId scope_id, llvm::ArrayRef<InstId> block, | ||
bool tentative = false) -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are just two calls to this, and both pass tentative=true
. Had you considered just making it the standard behavior?
auto FormatScope(InstNamer::ScopeId scope_id, llvm::ArrayRef<InstId> block, | |
bool tentative = false) -> void { | |
auto FormatScope(InstNamer::ScopeId scope_id, llvm::ArrayRef<InstId> block) | |
-> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a somewhat surprising default, but maybe with a rename of the function that can be addressed. I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Before: 278 files changed. After: 263 files changed. These files no longer change:
I've done spot checks of a few of those and it looks like this is as we would expect: they have new constants, new |
Co-authored-by: Jon Ross-Perkins <[email protected]>
Co-authored-by: Carbon Infra Bot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
bool tentative = false) -> void { | ||
// Formats a top-level scope, and any of the instructions in that scope that | ||
// are used. | ||
auto FormatScopeIfUsed(InstNamer::ScopeId scope_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, you might consider "FormatTentativeScope" or similar, to match that terminology. "IfUsed" makes me kind of wonder if it's saying the scope itself might not be used, when it's actually the instructions contained by the scope that are conditional.