-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT: Understand the idiomatic rotate bits #4519
Comments
Yes a good suggestion and rotate is supported efficiently on all modern ISA. Most C++ compiler will do this, though supporting all sizes, combinations, and styles of writing the sequence requires care. For example I've seen the sequence also written this way (albeit likely by an overly paranoid programmer) in the past: (value << (64 - count)) | (value >> (count & 63)) Good news it doesn't take much compile time so would be suitable for a JIT. Byte swap is also closely related pattern. The JIT doesn't support rotate operators internally, so this would require a bit more plumbing work than simply matching the trees in the morpher. But doable. It would be great if there was an urgent real-world perf scenario or the like that could bump the priority of this. Otherwise I'll take it as a perf suggestion. Thanks. |
@cmckinsey Actually this comes from a very real-world perf scenario. For each page we write to disk, we need at least a weak hash to ensure that when we read data has not being corrupted by things like rude stops, power loss, etc. Furthermore, we use hashing for a replicated file system implementation, where 128 bits hashes are to be used. We moved to XXHash for 32 and 64 bits because it is far faster than CRC32 for low and fair guarantee modes and to Metro128 for stronger collision guarantees (we dont need crypto hashes for this). We also heavily use a LZ4 implementation for journal saving which does lots of bit trickery internally (hashing included) that could be improve with this optimization. This is what we use (Metro128 is still in development but it is quite heavy on the rotation side): https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Hashing.cs Most of our read and write workload is IO and Hashing dependent to improve performance and/or reliability :) Just to give you an idea, a 5% improvement in memory copy had an insane 1.3X throughput improvement. This is potentially in the same league. |
By the way, as long as there is a "sactioned" way to implement the idiom in such a way that can be picked up by the JIT it is better than not having it ... I dont care to write JIT aware code to get better performance ( in fact I do already :D ). |
You also could add an intrinsic function. Make a class
If the instruction is available make it fast. Otherwise, emulate it. Also expose some way to detect hardware features. For example each instruction group could be represented by a flags enum member. |
Although I do think that having intrinsic functions is the best solution, https://github.com/CosmosOS/ have some forms to give this kind of control. I know that It really smells like __asm{}, but I think that may be nice to bring their solution. https://github.com/CosmosOS/Cosmos/wiki/Intro%20to%20plugs#writing-x86-assembly-in-cosmos |
Roslyn uses FNVa for most of string hashing - both as used by compiler internally and as a backed-in hasher when compiling switch statements over strings. One issue with that is Murmur hashes are based on word rotations and there is a concern that we would not realize all the benefits if rotations are shift-emulated. |
@VSadov That would depend on what's your typical string length. If strings are typically less than 8 characters, you are more than OK with FNV1a. The other hash functions (Murmur3, XXHash, etc) start becoming waay faster when they hit the bulk hashing loop (usually 16 to 32 bytes down the road). These are benchmarks of XXHash32 (faster than Murmum3 see [1] ), XXHash64 and Metro128. // BenchmarkDotNet=v0.7.7.0 64 bytes (aka 32 characters string)
32 bytes (aka 16 characters string)
16 bytes (aka 8 characters string)
8 bytes (aka 4 characters string)
And this is with shift emulated rotations. If you guys can convince the JIT guys this optimization is the real deal ( poking @cmckinsey and @CarolEidt here :) ) this numbers should improve substantially. |
I'll work on this. |
Awesome |
In some cases, intrinsics are acceptable for representing concepts that are difficult to expression in source, or require overly sophisticated pattern/semantic recognition. But otherwise, having the compiler match the source sequences is typically better since any such improvement will apply to existing un-modified sources and yield a broader impact. So I think that's what makes the best sense for this case. |
@cmckinsey agreed. Matching a bitcount/popcount is extremely hard, though. There are many way to write that. Some with loops and some without. As a motivating example, popcount is useful in Hash array mapped tries which are used to implement some persistent collections in JVM languages. A persistent O(1) set is among them I think. |
@GSPP that particular case is being discussed @ https://github.com/dotnet/corefx/issues/2209 |
I just ran a benchmark on HMACSHA512 and was surprised to discover that rotations consume a significant part of the time: Source code here: http://referencesource.microsoft.com/#mscorlib/system/security/cryptography/sha512managed.cs,6 The rotations should map to machine instructions as discussed in this thread. Also, I think the
Is that not supposed to be inlined? Note, that Maybe the heuristics can be tweaked. Just throwing the idea out there: Increase inlining probability when the candidate has no control flow (for some practical definition of control flow). Also increase it if there is at least one constant argument. Also increase it if there is no memory access. |
@GSPP - The JIT inlining heuristics are indeed rather conservative. @AndyAyersMS is looking into improving them; perhaps this would be an interesting case to look at. @erozenfeld - I think you might already be looking at this case, but it would be good to make sure that you catch this one in your implementation. |
@redknightlois I created a PR for this: dotnet/coreclr#1830. The speed-up on a simple benchmark with 4 rotations in a tight loop was 2.8. Rotations were found and optimized in the following methods in mscorlib: @AndyAyersMS I saw that RotateRight was inlined into Sigma* on desktop but not on coreclr. |
@erozenfeld I have been trying to do some benchmarks with this optimization using a custom built CoreCLR (process is definitely not as streamlined and foolproof as I should have expected) but I am not able to see it happen at the disassembly window of Visual Studio. Any idea when it is going to be accesible on an unstable release of CoreCLR or DesktopCLR to try it out? EDIT: Just in case, before it is asked. Yes, I have both options (just my code and supress jit) unchecked to be able to see the raw optimized JIT assembly output. |
@redknightlois You can ask the jit to dump the disassembly for a method N.Test.Foo by setting |
@redknightlois setting COMPLUS_JitDisasm will only work with a Debug build of CoreCLR. |
@erozenfeld Ok, after a bit of work I was able to look over this. From my tests only only a single instance catches the optimization. It works: [MethodImpl(MethodImplOptions.NoInlining)]
public static ulong RotateRight64(ulong value, int count)
{
return (value >> count) | (value << (64 - count));
} It doesn't work: [MethodImpl(MethodImplOptions.NoInlining)]
public static long RotateRight64(long value, int count)
{
return (value >> count) | (value << (64 - count));
} None of the cases on Main catches the optimization either. Repro: https://gist.github.com/redknightlois/e7b9934b6cd2af2aca96 |
Don't you need to use unsigned to make it a rotation? |
…oreclr/issues/1619 is included in the Desktop RyuJIT and the CoreCLR runtime works properly. Added alternative implementations of PopCount for reference * Naive If based implementation * Parallel PopCount (two operations simultaneously avoiding data hazards).
@VSadov to be numerically consistent, yes; but at the assembly level (and it makes sense) there is no difference afaik (in fact that property is used in hash functions no matter if signed or not). In effect signed and unsigned should be equivalent for most practical purposes. I am more worried about the examples in the Main method though, because it is not clear why they are not catched either. |
Signed right shift treats the sign bit specially.
And can we match on both Does this match if the two sides of the I think there should be a test case for the case where the shift argument is not constant and is not constrained by a |
@GSPP makes sense, they are not entirely equivalent because the special treatment on the shift side (not on the ror side). That only keeps why the Main() operations with ulong are not optimized thought. |
@redknightlois As others pointed out only shifts on unsigned may be used for a rotation pattern. |
@erozenfeld Not in my very specific scenario, but it is a bit misleading that optimization is not applied when accessing fields directly but on the other hand is able to optimize this: public void Execute()
{
a = RotateRight64(a, 16);
}
private static ulong a;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong RotateRight64(ulong value, int count)
{
return (value >> count) | (value << (64 - count));
} More misleading though is that it is able to optimize this: public struct A { public ulong x; }
public A Execute()
{
Random rnd = new Random(100);
int count = 16;
A a;
a.x = (ulong)rnd.Next();
a.x = RotateRight64(a.x, count);
return a;
} But not this: public struct A { public ulong x; }
public A Execute()
{
Random rnd = new Random(100);
int count = 16;
A a;
a.x = (ulong)rnd.Next();
a.x = (a.x >> count) | (a.x << (64 - count));
return a;
} From what it's worth, if you are able to push this onto the DesktopCLR too earlier without applying the extra field optimization, I am all in with defering it. :D If that is the case, probably would be worth at least document the behavior in a blog post. |
Couldn't the rotate matching be moved after a compiler pass that optimizes duplicate memory reads? Such a pass surely does exist already. |
@redknightlois @GSPP I implemented some improvements for rotation matching based on your feedback: it will now work when instance fields, static fields, or ref params are involved. It will also work when ^ is used instead of |. |
Great!!! Will give it a try during the week. Any idea when this optimization will hit the desktop CLR? |
Hi @redknightlois, just to clarify we have two products, one open (this one) and one closed (full framework on windows). In general requests about the servicing and timeframes should be directed towards Microsoft CSS/Dev support channels. We typically try to avoid adding new optimization work such as this in servicing/minor releases of .NET (e.g. .NET 4.6.1) in order to avoid destabilizing the product. In some circumstances we have considered adding changes like this into the product on a case by case basis and depending upon the customer urgency. Again that requires support channels. Otherwise you can expect this in the next major release of Full .NET and of course .NET Core RC2. |
This kind of code is pretty usual in hot paths for Hash Function implementations like XXHash, Metro128, FarmHash, etc.
Now the JIT (both legacy and RyuJIT) generate the following code:
While a more efficient code could be generated if such an idiom is detected:
The same applies for different sizes (8, 16, 32 and 64) and also the RotateLeft direction. Needless to say the throughput of such functions would improve by more than 2X.
The text was updated successfully, but these errors were encountered: