Reduce the size of formatted SemIR. #4534

zygoloid · 2024-11-15T00:38:09Z

Do not include entities imported from files that we are not dumping.
Do not include constants and import_refs that are not referenced by something that we are including in the formatted output.

- Do not include entities imported from files that we are not dumping. - Do not include constants and import_refs that are not referenced by something that we are including in the formatted output.

zygoloid · 2024-11-15T00:39:53Z

I'm in two minds about this. On the one hand, our check/file_test tests are getting really big (eg, #4532 adds megabytes of output) and are rendered much less useful by all this extra content. But on the other hand, the size of the output SemIR does reflect the size of the representation we're building, which is something we want to minimize, and having this text all present puts pressure on us to make that representation smaller.

zygoloid · 2024-11-15T00:48:35Z

Before this change, #4532 is +214,973 −10,377
After this change, #4532 is +19,865 −5,758

So this really is cutting down a lot of the "noise" there. But I do worry that this is also hiding the real fact that that change is adding a lot to the size of the SemIR we're creating.

chandlerc · 2024-11-15T01:13:16Z

FWIW, I think we should do this, and maybe even more.

I think we should make our tests as small and focused as we can TBH. While I agree there is some utility in having back pressure on excessive SemIR verbosity, I think we're already struggling with test focus, and we should invest in that.

We have lots of good ways to more cheaply track slow growth of SemIR. We have benchmarking and we have memory statistics. We can get more statistics like # of import refs, etc. At some point, we can invest in CI infrastructure that updates a dashboard with these statistics (or even benchmark data) over time to watch for regressions. That seems like a more effective way to track accidental growth.

I do think we should have a flag to get the complete dump because I suspect it could be useful when debugging growth or just problematic interactions with things like imports once it happens.

jonmeow

I discussed this with zygoloid in person, so partly I feel like I'm documenting my thoughts here. :)

I think we could probably do better than just this exclusion, for example by doing more to support comments about which lines of code to include IR for. The implementation of that would possibly look different from this change.

We'd also discussed just "turning off" SemIR for a lot of files, maybe by default. I was okay with that too, actually, to force more thoughtful targeting of where we think the SemIR is significant. But I think if we go a longer-term direction of having more specific comments about where we want SemIR in output, it may be a better way to effect output trimming.

We will almost certainly miss issues due to having less SemIR. However, by having less, we could also end up focusing more on what is present and become more likely to notice issues as a consequence. Part of the issue with having too much information is that it can drown out useful signals.

That all said, I think there's a need to balance between unblocking work, having good tests, and git repository size (I was worried about having >10MB test churn PRs become the new normal). I think we can probably do better with more work, but I think this PR is doing a good balance of cost-benefit. I'm happy to take this change.

Before this change, #4532 is +214,973 −10,377
After this change, #4532 is +19,865 −5,758

FWIW, I think this is in the right direction and resolves my concerns for the related PRs. I do want to ask though, does the # of files affected change, or is it the same # of files with smaller deltas?

toolchain/driver/compile_subcommand.cpp

toolchain/sem_ir/formatter.h

jonmeow · 2024-11-15T15:43:41Z

toolchain/sem_ir/formatter.h

@@ -27,6 +36,7 @@ class Formatter {
  const File& sem_ir_;
  // Caches naming between Print calls.
  InstNamer inst_namer_;
+  ShouldFormatEntityCallback should_format_entity_;


Maybe either whitespace or its own comment to make clearer that this isn't part of the caching comment above? e.g.:

Suggested change

ShouldFormatEntityCallback should_format_entity_;

ShouldFormatEntityCallback should_format_entity_;

Suggested change

ShouldFormatEntityCallback should_format_entity_;

// See `ShouldFormatEntityCallback`.

ShouldFormatEntityCallback should_format_entity_;

Reordered before inst_namer_; I think that's also better, to keep the values specified in the constructor call separate from the computed members.

toolchain/sem_ir/formatter.cpp

jonmeow · 2024-11-15T16:20:23Z

toolchain/sem_ir/formatter.cpp

    if (block.empty()) {
      return;
    }

    llvm::SaveAndRestore scope(scope_, scope_id);
    out_ << inst_namer_->GetScopeName(scope_id) << " ";
    OpenBrace();
-    FormatCodeBlock(block);
+    FormatCodeBlock(block, tentative);


This is the only spot that does tentative code block formatting, maybe the branched logic should be inlined here? i.e., per above comment about tentative=true, this is the only call to FormatCodeBlock with tentative, so maybe:

Suggested change

FormatCodeBlock(block, tentative);

for (const InstId inst_id : block) {

TentativeOutputScope scope(*this);

tentative_inst_chunks_[inst_id.index] = scope.index;

FormatInst(inst_id);

}

jonmeow · 2024-11-15T16:21:33Z

toolchain/sem_ir/formatter.cpp

+  auto FormatScope(InstNamer::ScopeId scope_id, llvm::ArrayRef<InstId> block,
+                   bool tentative = false) -> void {


I think there are just two calls to this, and both pass tentative=true. Had you considered just making it the standard behavior?

Suggested change

auto FormatScope(InstNamer::ScopeId scope_id, llvm::ArrayRef<InstId> block,

bool tentative = false) -> void {

auto FormatScope(InstNamer::ScopeId scope_id, llvm::ArrayRef<InstId> block)

-> void {

I think this is a somewhat surprising default, but maybe with a rename of the function that can be addressed. I'll take a look.

toolchain/sem_ir/formatter.cpp

zygoloid · 2024-11-15T19:12:24Z

Before this change, #4532 is +214,973 −10,377
After this change, #4532 is +19,865 −5,758

FWIW, I think this is in the right direction and resolves my concerns for the related PRs. I do want to ask though, does the # of files affected change, or is it the same # of files with smaller deltas?

Before: 278 files changed. After: 263 files changed. These files no longer change:

toolchain/check/testdata/class/extend_adapt.carbon
toolchain/check/testdata/class/fail_compound_type_mismatch.carbon
toolchain/check/testdata/class/fail_self.carbon
toolchain/check/testdata/class/generic/base_is_generic.carbon
toolchain/check/testdata/class/generic/member_access.carbon
toolchain/check/testdata/class/generic/stringify.carbon
toolchain/check/testdata/const/fail_collapse.carbon
toolchain/check/testdata/function/call/fail_param_type.carbon
toolchain/check/testdata/function/call/fail_return_type_mismatch.carbon
toolchain/check/testdata/operators/overloaded/fail_no_impl_for_arg.carbon
toolchain/check/testdata/pointer/fail_type_mismatch.carbon
toolchain/check/testdata/return/fail_type_mismatch.carbon
toolchain/check/testdata/struct/fail_type_assign.carbon
toolchain/check/testdata/tuple/fail_type_assign.carbon
toolchain/check/testdata/while/fail_bad_condition.carbon

I've done spot checks of a few of those and it looks like this is as we would expect: they have new constants, new import_refs, newly imported interfaces, impls, and specifics, but no changes to the SemIR produced for the test itself.

Co-authored-by: Jon Ross-Perkins <[email protected]>

toolchain/sem_ir/formatter.cpp

Co-authored-by: Carbon Infra Bot <[email protected]>

jonmeow

LG

jonmeow · 2024-11-15T20:49:00Z

toolchain/sem_ir/formatter.cpp

-                   bool tentative = false) -> void {
+  // Formats a top-level scope, and any of the instructions in that scope that
+  // are used.
+  auto FormatScopeIfUsed(InstNamer::ScopeId scope_id,


FWIW, you might consider "FormatTentativeScope" or similar, to match that terminology. "IfUsed" makes me kind of wonder if it's saying the scope itself might not be used, when it's actually the instructions contained by the scope that are conditional.

Reduce the size of formatted SemIR.

59ff944

- Do not include entities imported from files that we are not dumping. - Do not include constants and import_refs that are not referenced by something that we are including in the formatted output.

github-actions bot added the toolchain label Nov 15, 2024

github-actions bot requested a review from geoffromer November 15, 2024 00:38

zygoloid requested a review from jonmeow November 15, 2024 00:40

zygoloid mentioned this pull request Nov 15, 2024

Add prelude support for implicit conversion between integer literals and sized integer types. #4525

Merged

josh11b approved these changes Nov 15, 2024

View reviewed changes

jonmeow approved these changes Nov 15, 2024

View reviewed changes

This was referenced Nov 15, 2024

Convert array bounds to IntLiteral. #4526

Merged

Represent integer literals as IntLiteral not as i32. #4532

Merged

Apply suggestions from code review

ee81866

Co-authored-by: Jon Ross-Perkins <[email protected]>

CarbonInfraBot reviewed Nov 15, 2024

View reviewed changes

toolchain/sem_ir/formatter.cpp Outdated Show resolved Hide resolved

zygoloid and others added 2 commits November 15, 2024 12:03

Fix indentation

916fc48

Co-authored-by: Carbon Infra Bot <[email protected]>

Address review comments.

582e6d0

zygoloid enabled auto-merge November 15, 2024 20:43

jonmeow approved these changes Nov 15, 2024

View reviewed changes

zygoloid added this pull request to the merge queue Nov 15, 2024

Merged via the queue into carbon-language:trunk with commit 4ee65ef Nov 15, 2024
8 checks passed

zygoloid deleted the toolchain-reduce-semir branch November 15, 2024 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the size of formatted SemIR. #4534

Reduce the size of formatted SemIR. #4534

zygoloid commented Nov 15, 2024

zygoloid commented Nov 15, 2024

zygoloid commented Nov 15, 2024

chandlerc commented Nov 15, 2024

jonmeow left a comment

jonmeow Nov 15, 2024

zygoloid Nov 15, 2024

jonmeow Nov 15, 2024

zygoloid Nov 15, 2024

jonmeow Nov 15, 2024

zygoloid Nov 15, 2024

zygoloid Nov 15, 2024

zygoloid commented Nov 15, 2024

jonmeow left a comment

jonmeow Nov 15, 2024

	ShouldFormatEntityCallback should_format_entity_;

	ShouldFormatEntityCallback should_format_entity_;

	ShouldFormatEntityCallback should_format_entity_;
	// See `ShouldFormatEntityCallback`.
	ShouldFormatEntityCallback should_format_entity_;

-    FormatCodeBlock(block, tentative);
+    for (const InstId inst_id : block) {
+      TentativeOutputScope scope(*this);
+      tentative_inst_chunks_[inst_id.index] = scope.index;
+      FormatInst(inst_id);
+    }

		auto FormatScope(InstNamer::ScopeId scope_id, llvm::ArrayRef<InstId> block,
		bool tentative = false) -> void {

Reduce the size of formatted SemIR. #4534

Reduce the size of formatted SemIR. #4534

Conversation

zygoloid commented Nov 15, 2024

zygoloid commented Nov 15, 2024

zygoloid commented Nov 15, 2024

chandlerc commented Nov 15, 2024

jonmeow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zygoloid commented Nov 15, 2024

jonmeow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment