Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: very simple cloning heuristic #108771

Merged

Conversation

AndyAyersMS
Copy link
Member

Avoid cloning large loops.

We compute loop size by counting tree nodes of all statements of all blocks in the loop. If this is over a threshold, we inhibit cloning.

Threshold value was chosen based on distribution of unrestricted cloned loop sizes in the benchmark run_pgo collection.

Avoid cloning large loops.

We compute loop size by counting tree nodes of all statements of all
blocks in the loop. If this is over a threshold, we inhibit cloning.

Threshold value was chosen based on distribution of unrestricted cloned
loop sizes in the benchmark run_pgo collection.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 11, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib

Distribution of unrestricted loop clone sizes (from x64 benchmarks.run_pgo, units are tree node counts). There aren't many large loops and cloning them quite likely has a poor size/tp/perf tradeoff. So let's try not cloning them and see how it goes.

image

I played around with more "sophisticated" heuristics (some vestiges still there in this PR) where I also tried to compute the size reduction in the cloned loop and the performance impact using block weights. But without access to gtCostSz and gtCostEx (which themselves are questionable) I didn't have much confidence that such an approach would lead to better decisions.

My proposal is to merge this (or something like it) and adjust as needed after the fact based on perf lab and other data.

@AndyAyersMS AndyAyersMS marked this pull request as ready for review October 11, 2024 20:15
@AndyAyersMS
Copy link
Member Author

@BruceForstall PTAL
cc @dotnet/jit-contrib

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. Have you tried running the benchi microbenchmarks, e.g., lloops?

src/coreclr/jit/jitconfigvalues.h Outdated Show resolved Hide resolved
src/coreclr/jit/loopcloning.cpp Outdated Show resolved Hide resolved
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 14, 2024

Looks reasonable. Have you tried running the benchi microbenchmarks, e.g., lloops?

No diffs in any benchi/benchf(w/pgo) ... there are some in Bytemark's ludcmp which is one place where cloning was getting carried away (#8558 (comment)).

Fixes from review

Co-authored-by: Bruce Forstall <[email protected]>
@BruceForstall
Copy link
Member

Ah, right, ludcmp is the case I was thinking about.

@AndyAyersMS
Copy link
Member Author

Diffs

@BruceForstall
Copy link
Member

Diffs

Some huge diffs. Especially in libraries_tests. And in ludcmp, as expected. Surprisingly, some size regressions? Maybe cloning was enabling some size-improving optimizations (and getting rid of the cloning overhead)?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants