-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x86_64: align loop headers to 64 bytes #5004
base: main
Are you sure you want to change the base?
Conversation
506aa59
to
a0c63f8
Compare
// Unaligned loop headers can cause severe performance problems. | ||
// See https://github.com/bytecodealliance/wasmtime/issues/4883. | ||
// Here we use conservative 64 bytes alignment. | ||
align_to(offset, 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you disable this when optimizing for size? Also won't this cause a large blowup when there are a lot of tiny loops in a function with multiple being able to fit in a single cacheline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe disable it for cold blocks?
@@ -106,6 +106,8 @@ pub struct BlockLoweringOrder { | |||
/// which is used by VCode emission to sink the blocks at the last | |||
/// moment (when we actually emit bytes into the MachBuffer). | |||
cold_blocks: FxHashSet<BlockIndex>, | |||
/// These are loop headers. Used for alignment. | |||
loop_headers: FxHashSet<BlockIndex>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to merge cold_blocks
and loop_headers
into a single SecondaryMap
with a bitflag as element? Or are both still too infrequently used such that it causes a large increase in memory usage?
@pepyakin , thanks for this change. However, I'm not sure that we want to take it as-is; or at least, I'd like to see more data. The Sightglass runs you provide are all over the place -- in some cases this is better by 5-10%, in other cases What's more concerning, the Would you be willing to take a few of the larger benchmarks ( I share @bjorn3's concern about code size here as well. Aligning every loop to a 64-byte granularity is likely to bloat some cases nontrivially. Could you report the effect on code-size (the |
That's a good catch. I must admit that I did not notice that it was about instantiation! Probably something was running on my test machine and ruined the measurement. I also agree with the points made by @bjorn3. In retrospect, that seems obvious 🤦 I had an impression that this PR would be a quick one, but it seems I was wrong = ) I will try to further it, but that won't be my primary focus, so it will likely take some time1. Footnotes
|
Doing some necromancing here, but I found a nice article detailing how the .NET JIT is padding their loops and adopting a similar strategy might be useful. It's quite a bit more involved than this PR. |
@lpereira how are the loop / basic-block weights computed in the .NET JIT? Are they based on profiling counters or some ahead-of-time estimate? It'd be helpful to see a summary of the technique given your background with the .NET JIT. The heuristics seem to be the crux of things here: we can't align all loops because it would lead to (presumably) a significant increase in code size (how much? someone needs to do the experiments). We are also an AOT compiler (even in JIT mode, we are morally AOT because we compile once at load-time) so we can't pick based on profiling counters. So we need some heuristics for loops that are hot enough for this to matter, yet infrequent enough that the size impact (and fetch bandwidth impact on the entry to the loop) is acceptable. (EDIT: to clarify, I'm hoping to learn more about e.g. "block-weight meets a certain weight threshold" in that article: is a block-weight static or dynamic, and how is it measured?) |
I don't know! What I know, however, is that CoreCLR has a tiered JIT: AOT compilation is possible and usually performed to speed up startup, but the compiler in AOT mode does as much work optimizing code as Winch, IIRC. Once it detects that a function needs a second look (and I don't know how it does, maybe a per-function counter and some tiny code in the epilogue that decrements and calls into a "jit this harder" function that does that, then patches the caller to become a trampoline? I really don't know; it's probably something a whole lot more clever than that), it then spends more time recompiling things. This is pretty complicated to implement, and wouldn't help with the pure AOT scenarios we have, of course (unless we had some kind of PGO thing). I'll spend some time looking through CoreCLR later to figure out exactly what they do and how they do it. I'll look at how GCC and Clang does it too.
Right. I've given this some thought, and we could experiment with some heuristics, based on:
It's all a bit fiddly, of course, and we'd need to empirically adjust these scores, but I think it's a doable strategy to identify hot loops. Then, when lowering, at least on x86, two things need to happen if a loop is considered hot:
|
Closes #4883
This PR introduces alignment for the loop headers to 64 bytes on x86_64. I benchmarked this on AMD Zen 3 x5950 and it produced the following results.
I haven't figured out how to write a test though since the
test compile precise-output
seems to be ignoringnop
s. I assume that's intentional. If so, what would be the best way to introduce one?