-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarks for InitBlock/CopyBlock operations #2154
Add benchmarks for InitBlock/CopyBlock operations #2154
Conversation
@DrewScoggins @adamsitnik PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks overall look good, but I am afraid that they are mostly testing the same code path. We should reduce the numbers of benchmarks. PTAL at my comments and links that I've provided.
@echesakovMSFT thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik Thank you for your review! As I mentioned in the replies, at the moment, there is no difference between heap and localAddr code paths for InitBlock
and CopyBlock
. However, I am trying to implement an optimization to see if we can utilize 16-byte alignment of sp
(and known alignment of fp
) in the JIT on Arm64 and make a use of appropriate store operations.
For context: according to Arm Cortex-A76 Software Optimization Guide
(and similar guides for other microarchitectures) store operations that cross a 16-byte boundary can incur additional latency on Arm64. Hence, we should use these only when the JIT can proof that the store won't cross such boundary. The JIT won't be able to do this in a general case, but can do this for locals.
If you prefer, I can hold off on the benchmark PR until after dotnet/runtime#61030 is merged.
@echesakovMSFT thank you for detailed answers! In such case it would be better to merge the benchmarks before dotnet/runtime#61030 so our Reporting System can start gathering data and when the PR is merged we can see how it affects the performance for all configs. |
I am a little confused by the last line of the .tt file. We don't use it when generating the tests, we use the loop for that, so what is it there for? I assume the plan is to instead of walking by eight bytes in the for loop we instead iterate over the array and use the values there to decide on the different sizes for the copying. If not, we should reduce the number of testcases that we generate across the different sizes and only use sizes we believe will give us meaningful differences between them. |
You are right, I was using the array at the last line before - but then I wanted to do some extra testing locally and forgot to revert the change. Updated and reduced the number of tests. |
@adamsitnik @DrewScoggins PTAL one more time - I believe I addressed all the suggestions. |
…er limit used on x86/x64)
6e11a0e
to
e8e8f14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @echesakovMSFT !
This adds benchmark for
GT_STORE_BLK
node in the JIT that responsibly for two operations in .NET - InitBlock and CopyBlock.