Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Understand the idiomatic rotate bits #4519

Closed
redknightlois opened this issue Sep 24, 2015 · 29 comments
Closed

RyuJIT: Understand the idiomatic rotate bits #4519

redknightlois opened this issue Sep 24, 2015 · 29 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization

Comments

@redknightlois
Copy link

This kind of code is pretty usual in hot paths for Hash Function implementations like XXHash, Metro128, FarmHash, etc.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong RotateRight64(ulong value, int count)
{
    return (value >> count) | (value << (64 - count));
}

Now the JIT (both legacy and RyuJIT) generate the following code:

mov         rdx,rax  
shr         rdx,1  
shl         rax,3Fh  
or          eax,edx  

While a more efficient code could be generated if such an idiom is detected:

ror         rax,1  

The same applies for different sizes (8, 16, 32 and 64) and also the RotateLeft direction. Needless to say the throughput of such functions would improve by more than 2X.

@cmckinsey
Copy link
Contributor

Yes a good suggestion and rotate is supported efficiently on all modern ISA. Most C++ compiler will do this, though supporting all sizes, combinations, and styles of writing the sequence requires care. For example I've seen the sequence also written this way (albeit likely by an overly paranoid programmer) in the past:

(value << (64 - count)) | (value >> (count & 63))

Good news it doesn't take much compile time so would be suitable for a JIT. Byte swap is also closely related pattern.

The JIT doesn't support rotate operators internally, so this would require a bit more plumbing work than simply matching the trees in the morpher. But doable.

It would be great if there was an urgent real-world perf scenario or the like that could bump the priority of this. Otherwise I'll take it as a perf suggestion. Thanks.

@redknightlois
Copy link
Author

@cmckinsey Actually this comes from a very real-world perf scenario.

For each page we write to disk, we need at least a weak hash to ensure that when we read data has not being corrupted by things like rude stops, power loss, etc. Furthermore, we use hashing for a replicated file system implementation, where 128 bits hashes are to be used. We moved to XXHash for 32 and 64 bits because it is far faster than CRC32 for low and fair guarantee modes and to Metro128 for stronger collision guarantees (we dont need crypto hashes for this). We also heavily use a LZ4 implementation for journal saving which does lots of bit trickery internally (hashing included) that could be improve with this optimization.

This is what we use (Metro128 is still in development but it is quite heavy on the rotation side): https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Hashing.cs

Most of our read and write workload is IO and Hashing dependent to improve performance and/or reliability :)

Just to give you an idea, a 5% improvement in memory copy had an insane 1.3X throughput improvement. This is potentially in the same league.

@redknightlois
Copy link
Author

By the way, as long as there is a "sactioned" way to implement the idiom in such a way that can be picked up by the JIT it is better than not having it ... I dont care to write JIT aware code to get better performance ( in fact I do already :D ).

@GSPP
Copy link

GSPP commented Sep 27, 2015

You also could add an intrinsic function. Make a class BitOperations and expose methods like:

  • Rotate
  • AddWithCarry
  • Bitcount
  • ByteSwap
  • Anything else that is supported on x86/ARM (the union of both)

If the instruction is available make it fast. Otherwise, emulate it.

Also expose some way to detect hardware features. For example each instruction group could be represented by a flags enum member.

@xunilrj
Copy link

xunilrj commented Sep 29, 2015

Although I do think that having intrinsic functions is the best solution, https://github.com/CosmosOS/ have some forms to give this kind of control. I know that It really smells like __asm{}, but I think that may be nice to bring their solution.

https://github.com/CosmosOS/Cosmos/wiki/Intro%20to%20plugs#writing-x86-assembly-in-cosmos

@VSadov
Copy link
Member

VSadov commented Sep 30, 2015

Roslyn uses FNVa for most of string hashing - both as used by compiler internally and as a backed-in hasher when compiling switch statements over strings.
FNV is nice in its simplicity, but there could be some benefits in switching to Murmur3 as it is reported to be noticeably faster.

One issue with that is Murmur hashes are based on word rotations and there is a concern that we would not realize all the benefits if rotations are shift-emulated.

@redknightlois
Copy link
Author

@VSadov That would depend on what's your typical string length. If strings are typically less than 8 characters, you are more than OK with FNV1a. The other hash functions (Murmur3, XXHash, etc) start becoming waay faster when they hit the bulk hashing loop (usually 16 to 32 bytes down the road).

These are benchmarks of XXHash32 (faster than Murmum3 see [1] ), XXHash64 and Metro128.

// BenchmarkDotNet=v0.7.7.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
// Host CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=HashingPerf Mode=Throughput Platform=X64 Jit=RyuJit .NET=HostFramework

64 bytes (aka 32 characters string)

Method AvrTime StdDev op/s
FNV 295.29 ns 1.18 ns 3386537.85
Metro128 210.00 ns 0.842 ns 4761835.81
XXHash32 169.33 ns 0.859 ns 5905467.78
XXHash64 171.73 ns 2.20 ns 5823017.01

32 bytes (aka 16 characters string)

Method AvrTime StdDev op/s
FNV 167.18 ns 0.736 ns 5981414.99
Metro128 189.66 ns 1.43 ns 5272616.49
XXHash32 116.62 ns 0.420 ns 8574868.82
XXHash64 144.56 ns 0.670 ns 6917313.64

16 bytes (aka 8 characters string)

Method AvrTime StdDev op/s
FNV 85.50 ns 0.478 ns 11696251.04
Metro128 158.74 ns 0.793 ns 6299658.12
XXHash32 90.58 ns 0.498 ns 11039676.1
XXHash64 97.74 ns 0.538 ns 10231013.45

8 bytes (aka 4 characters string)

Method AvrTime StdDev op/s
FNV 52.62 ns 2.15 ns 19003989.76
Metro128 149.03 ns 0.893 ns 6710168.28
XXHash32 59.24 ns 0.228 ns 16881179.41
XXHash64 82.36 ns 0.864 ns 12141168.59

And this is with shift emulated rotations. If you guys can convince the JIT guys this optimization is the real deal ( poking @cmckinsey and @CarolEidt here :) ) this numbers should improve substantially.

[1] https://code.google.com/p/xxhash/

@erozenfeld
Copy link
Member

I'll work on this.

@redknightlois
Copy link
Author

Awesome

@cmckinsey
Copy link
Contributor

In some cases, intrinsics are acceptable for representing concepts that are difficult to expression in source, or require overly sophisticated pattern/semantic recognition. But otherwise, having the compiler match the source sequences is typically better since any such improvement will apply to existing un-modified sources and yield a broader impact. So I think that's what makes the best sense for this case.

@GSPP
Copy link

GSPP commented Oct 6, 2015

@cmckinsey agreed. Matching a bitcount/popcount is extremely hard, though. There are many way to write that. Some with loops and some without.

As a motivating example, popcount is useful in Hash array mapped tries which are used to implement some persistent collections in JVM languages. A persistent O(1) set is among them I think.

@redknightlois
Copy link
Author

@GSPP that particular case is being discussed @ https://github.com/dotnet/corefx/issues/2209

@GSPP
Copy link

GSPP commented Oct 19, 2015

I just ran a benchmark on HMACSHA512 and was surprised to discover that rotations consume a significant part of the time:

sha512

Source code here: http://referencesource.microsoft.com/#mscorlib/system/security/cryptography/sha512managed.cs,6

The rotations should map to machine instructions as discussed in this thread.

Also, I think the sigma* functions should be inlined. And given that the rotations are now not intrinsic they should have been inlined as well. Here's the definition:

return (((x) >> (n)) | ((x) << (64-(n))));

Is that not supposed to be inlined? Note, that n always is a constant argument.

Maybe the heuristics can be tweaked. Just throwing the idea out there: Increase inlining probability when the candidate has no control flow (for some practical definition of control flow). Also increase it if there is at least one constant argument. Also increase it if there is no memory access.

@CarolEidt
Copy link
Contributor

@GSPP - The JIT inlining heuristics are indeed rather conservative. @AndyAyersMS is looking into improving them; perhaps this would be an interesting case to look at. @erozenfeld - I think you might already be looking at this case, but it would be good to make sure that you catch this one in your implementation.

@erozenfeld
Copy link
Member

@redknightlois I created a PR for this: dotnet/coreclr#1830.

The speed-up on a simple benchmark with 4 rotations in a tight loop was 2.8.

Rotations were found and optimized in the following methods in mscorlib:
System.Security.Cryptography.SHA256Managed::RotateRight
System.Security.Cryptography.SHA384Managed::RotateRight
System.Security.Cryptography.SHA512Managed::RotateRight
System.Security.Cryptography.RIPEMD160Managed:MDTransform (320 instances!)
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Rol1
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Rol5
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Rol30
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Drain
(9 instances of Sha1ForNonSecretPurposes::Rol* inlined)

@AndyAyersMS I saw that RotateRight was inlined into Sigma* on desktop but not on coreclr.
@redknightlois Let us know if this change speeds up your scenarios.

@redknightlois
Copy link
Author

@erozenfeld I have been trying to do some benchmarks with this optimization using a custom built CoreCLR (process is definitely not as streamlined and foolproof as I should have expected) but I am not able to see it happen at the disassembly window of Visual Studio. Any idea when it is going to be accesible on an unstable release of CoreCLR or DesktopCLR to try it out?

EDIT: Just in case, before it is asked. Yes, I have both options (just my code and supress jit) unchecked to be able to see the raw optimized JIT assembly output.

@erozenfeld
Copy link
Member

@redknightlois You can ask the jit to dump the disassembly for a method N.Test.Foo by setting
set COMPLUS_JitDisasm=N.Test::Foo
and then running the app via corerun.exe (e.g., corerun.exe test.exe)
If you don't see the expected ror or rol instructions please share a repro.

@erozenfeld
Copy link
Member

@redknightlois setting COMPLUS_JitDisasm will only work with a Debug build of CoreCLR.

@redknightlois
Copy link
Author

@erozenfeld Ok, after a bit of work I was able to look over this. From my tests only only a single instance catches the optimization.

It works:

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static ulong RotateRight64(ulong value, int count)
        {
            return (value >> count) | (value << (64 - count));
        }

It doesn't work:

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static long RotateRight64(long value, int count)
        {
            return (value >> count) | (value << (64 - count));
        }

None of the cases on Main catches the optimization either.

Repro: https://gist.github.com/redknightlois/e7b9934b6cd2af2aca96

@VSadov
Copy link
Member

VSadov commented Oct 27, 2015

Don't you need to use unsigned to make it a rotation?

redknightlois referenced this issue in redknightlois/BenchmarkDotNet Oct 27, 2015
…oreclr/issues/1619 is included in the Desktop RyuJIT and the CoreCLR runtime works properly.

Added alternative implementations of PopCount for reference
 * Naive If based implementation
 * Parallel PopCount (two operations simultaneously avoiding data hazards).
@redknightlois
Copy link
Author

@VSadov to be numerically consistent, yes; but at the assembly level (and it makes sense) there is no difference afaik (in fact that property is used in hash functions no matter if signed or not). In effect signed and unsigned should be equivalent for most practical purposes.

I am more worried about the examples in the Main method though, because it is not clear why they are not catched either.

@GSPP
Copy link

GSPP commented Oct 27, 2015

Signed right shift treats the sign bit specially. int.MinValue >> 31 is -1. So 10000... became 11111.... This would be invalid to optimize.

a should not be caught but b should be.

And can we match on both | as well as ^? I have seen both.

Does this match if the two sides of the | are flipped? No matter whether left or right shift is done on the left side this should match.

I think there should be a test case for the case where the shift argument is not constant and is not constrained by a & 0x1f. The C# compiler never generates that pattern AFAIK but it exists at the CLR level. The test would need to use IL or generate IL at runtime.

@redknightlois
Copy link
Author

@GSPP makes sense, they are not entirely equivalent because the special treatment on the shift side (not on the ror side). That only keeps why the Main() operations with ulong are not optimized thought.

@erozenfeld
Copy link
Member

@redknightlois As others pointed out only shifts on unsigned may be used for a rotation pattern.
The reason Main() operations with ulong aren't optimized is because the optimization is currently not enabled when fields (static or instance) are involved in the expression. If this is important for your scenarios I can look into enabling it.
@GSPP Good point about ^. That should be easy to add.
Yes, the current implementation will match if the two sides of the | are flipped.
The case of the shift index being non-const and having no mask will work. Yes, it can't be exercised from C# and would require IL.

@redknightlois
Copy link
Author

@erozenfeld Not in my very specific scenario, but it is a bit misleading that optimization is not applied when accessing fields directly but on the other hand is able to optimize this:

public void Execute()
{
    a = RotateRight64(a, 16);
}

private static ulong a;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong RotateRight64(ulong value, int count)
{
    return (value >> count) | (value << (64 - count));
}

More misleading though is that it is able to optimize this:

public struct A { public ulong x; }

public A Execute()
{
    Random rnd = new Random(100);

    int count = 16;            

    A a;
    a.x = (ulong)rnd.Next();
    a.x = RotateRight64(a.x, count);

    return a;
}

But not this:

public struct A { public ulong x; }

public A Execute()
{
    Random rnd = new Random(100);

    int count = 16;            

    A a;
    a.x = (ulong)rnd.Next();            
    a.x = (a.x >> count) | (a.x << (64 - count));

    return a;
}    

From what it's worth, if you are able to push this onto the DesktopCLR too earlier without applying the extra field optimization, I am all in with defering it. :D

If that is the case, probably would be worth at least document the behavior in a blog post.

@GSPP
Copy link

GSPP commented Oct 28, 2015

Couldn't the rotate matching be moved after a compiler pass that optimizes duplicate memory reads? Such a pass surely does exist already.

@erozenfeld
Copy link
Member

@redknightlois @GSPP I implemented some improvements for rotation matching based on your feedback: it will now work when instance fields, static fields, or ref params are involved. It will also work when ^ is used instead of |.
Pull request is here: dotnet/coreclr#2027

@redknightlois
Copy link
Author

Great!!! Will give it a try during the week. Any idea when this optimization will hit the desktop CLR?

@cmckinsey
Copy link
Contributor

Hi @redknightlois, just to clarify we have two products, one open (this one) and one closed (full framework on windows). In general requests about the servicing and timeframes should be directed towards Microsoft CSS/Dev support channels. We typically try to avoid adding new optimization work such as this in servicing/minor releases of .NET (e.g. .NET 4.6.1) in order to avoid destabilizing the product. In some circumstances we have considered adding changes like this into the product on a case by case basis and depending upon the customer urgency. Again that requires support channels. Otherwise you can expect this in the next major release of Full .NET and of course .NET Core RC2.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

7 participants