RyuJIT: Understand the idiomatic rotate bits #4519

redknightlois · 2015-09-24T17:45:51Z

This kind of code is pretty usual in hot paths for Hash Function implementations like XXHash, Metro128, FarmHash, etc.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong RotateRight64(ulong value, int count)
{
    return (value >> count) | (value << (64 - count));
}

Now the JIT (both legacy and RyuJIT) generate the following code:

mov         rdx,rax  
shr         rdx,1  
shl         rax,3Fh  
or          eax,edx

While a more efficient code could be generated if such an idiom is detected:

ror         rax,1

The same applies for different sizes (8, 16, 32 and 64) and also the RotateLeft direction. Needless to say the throughput of such functions would improve by more than 2X.

cmckinsey · 2015-09-24T19:57:15Z

Yes a good suggestion and rotate is supported efficiently on all modern ISA. Most C++ compiler will do this, though supporting all sizes, combinations, and styles of writing the sequence requires care. For example I've seen the sequence also written this way (albeit likely by an overly paranoid programmer) in the past:

(value << (64 - count)) | (value >> (count & 63))

Good news it doesn't take much compile time so would be suitable for a JIT. Byte swap is also closely related pattern.

The JIT doesn't support rotate operators internally, so this would require a bit more plumbing work than simply matching the trees in the morpher. But doable.

It would be great if there was an urgent real-world perf scenario or the like that could bump the priority of this. Otherwise I'll take it as a perf suggestion. Thanks.

redknightlois · 2015-09-24T20:15:47Z

@cmckinsey Actually this comes from a very real-world perf scenario.

For each page we write to disk, we need at least a weak hash to ensure that when we read data has not being corrupted by things like rude stops, power loss, etc. Furthermore, we use hashing for a replicated file system implementation, where 128 bits hashes are to be used. We moved to XXHash for 32 and 64 bits because it is far faster than CRC32 for low and fair guarantee modes and to Metro128 for stronger collision guarantees (we dont need crypto hashes for this). We also heavily use a LZ4 implementation for journal saving which does lots of bit trickery internally (hashing included) that could be improve with this optimization.

This is what we use (Metro128 is still in development but it is quite heavy on the rotation side): https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Hashing.cs

Most of our read and write workload is IO and Hashing dependent to improve performance and/or reliability :)

Just to give you an idea, a 5% improvement in memory copy had an insane 1.3X throughput improvement. This is potentially in the same league.

redknightlois · 2015-09-24T20:30:30Z

By the way, as long as there is a "sactioned" way to implement the idiom in such a way that can be picked up by the JIT it is better than not having it ... I dont care to write JIT aware code to get better performance ( in fact I do already :D ).

GSPP · 2015-09-27T21:26:17Z

You also could add an intrinsic function. Make a class BitOperations and expose methods like:

Rotate
AddWithCarry
Bitcount
ByteSwap
Anything else that is supported on x86/ARM (the union of both)

If the instruction is available make it fast. Otherwise, emulate it.

Also expose some way to detect hardware features. For example each instruction group could be represented by a flags enum member.

xunilrj · 2015-09-29T14:36:29Z

Although I do think that having intrinsic functions is the best solution, https://github.com/CosmosOS/ have some forms to give this kind of control. I know that It really smells like __asm{}, but I think that may be nice to bring their solution.

https://github.com/CosmosOS/Cosmos/wiki/Intro%20to%20plugs#writing-x86-assembly-in-cosmos

VSadov · 2015-09-30T16:49:35Z

Roslyn uses FNVa for most of string hashing - both as used by compiler internally and as a backed-in hasher when compiling switch statements over strings.
FNV is nice in its simplicity, but there could be some benefits in switching to Murmur3 as it is reported to be noticeably faster.

One issue with that is Murmur hashes are based on word rotations and there is a concern that we would not realize all the benefits if rotations are shift-emulated.

redknightlois · 2015-09-30T19:25:47Z

@VSadov That would depend on what's your typical string length. If strings are typically less than 8 characters, you are more than OK with FNV1a. The other hash functions (Murmur3, XXHash, etc) start becoming waay faster when they hit the bulk hashing loop (usually 16 to 32 bytes down the road).

These are benchmarks of XXHash32 (faster than Murmum3 see [1] ), XXHash64 and Metro128.

// BenchmarkDotNet=v0.7.7.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
// Host CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=HashingPerf Mode=Throughput Platform=X64 Jit=RyuJit .NET=HostFramework

64 bytes (aka 32 characters string)

Method	AvrTime	StdDev	op/s
FNV	295.29 ns	1.18 ns	3386537.85
Metro128	210.00 ns	0.842 ns	4761835.81
XXHash32	169.33 ns	0.859 ns	5905467.78
XXHash64	171.73 ns	2.20 ns	5823017.01

32 bytes (aka 16 characters string)

Method	AvrTime	StdDev	op/s
FNV	167.18 ns	0.736 ns	5981414.99
Metro128	189.66 ns	1.43 ns	5272616.49
XXHash32	116.62 ns	0.420 ns	8574868.82
XXHash64	144.56 ns	0.670 ns	6917313.64

16 bytes (aka 8 characters string)

Method	AvrTime	StdDev	op/s
FNV	85.50 ns	0.478 ns	11696251.04
Metro128	158.74 ns	0.793 ns	6299658.12
XXHash32	90.58 ns	0.498 ns	11039676.1
XXHash64	97.74 ns	0.538 ns	10231013.45

8 bytes (aka 4 characters string)

Method	AvrTime	StdDev	op/s
FNV	52.62 ns	2.15 ns	19003989.76
Metro128	149.03 ns	0.893 ns	`6710168`.28
XXHash32	59.24 ns	0.228 ns	16881179.41
XXHash64	82.36 ns	0.864 ns	12141168.59

And this is with shift emulated rotations. If you guys can convince the JIT guys this optimization is the real deal ( poking @cmckinsey and @CarolEidt here :) ) this numbers should improve substantially.

[1] https://code.google.com/p/xxhash/

erozenfeld · 2015-10-02T23:56:57Z

I'll work on this.

redknightlois · 2015-10-03T00:26:45Z

Awesome

cmckinsey · 2015-10-06T01:31:21Z

In some cases, intrinsics are acceptable for representing concepts that are difficult to expression in source, or require overly sophisticated pattern/semantic recognition. But otherwise, having the compiler match the source sequences is typically better since any such improvement will apply to existing un-modified sources and yield a broader impact. So I think that's what makes the best sense for this case.

GSPP · 2015-10-06T12:11:33Z

@cmckinsey agreed. Matching a bitcount/popcount is extremely hard, though. There are many way to write that. Some with loops and some without.

As a motivating example, popcount is useful in Hash array mapped tries which are used to implement some persistent collections in JVM languages. A persistent O(1) set is among them I think.

redknightlois · 2015-10-06T13:46:27Z

@GSPP that particular case is being discussed @ https://github.com/dotnet/corefx/issues/2209

GSPP · 2015-10-19T10:13:11Z

I just ran a benchmark on HMACSHA512 and was surprised to discover that rotations consume a significant part of the time:

Source code here: http://referencesource.microsoft.com/#mscorlib/system/security/cryptography/sha512managed.cs,6

The rotations should map to machine instructions as discussed in this thread.

Also, I think the sigma* functions should be inlined. And given that the rotations are now not intrinsic they should have been inlined as well. Here's the definition:

return (((x) >> (n)) | ((x) << (64-(n))));

Is that not supposed to be inlined? Note, that n always is a constant argument.

Maybe the heuristics can be tweaked. Just throwing the idea out there: Increase inlining probability when the candidate has no control flow (for some practical definition of control flow). Also increase it if there is at least one constant argument. Also increase it if there is no memory access.

CarolEidt · 2015-10-19T14:48:39Z

@GSPP - The JIT inlining heuristics are indeed rather conservative. @AndyAyersMS is looking into improving them; perhaps this would be an interesting case to look at. @erozenfeld - I think you might already be looking at this case, but it would be good to make sure that you catch this one in your implementation.

erozenfeld · 2015-10-22T02:05:16Z

@redknightlois I created a PR for this: dotnet/coreclr#1830.

The speed-up on a simple benchmark with 4 rotations in a tight loop was 2.8.

Rotations were found and optimized in the following methods in mscorlib:
System.Security.Cryptography.SHA256Managed::RotateRight
System.Security.Cryptography.SHA384Managed::RotateRight
System.Security.Cryptography.SHA512Managed::RotateRight
System.Security.Cryptography.RIPEMD160Managed:MDTransform (320 instances!)
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Rol1
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Rol5
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Rol30
System.Diagnostics.Tracing.EventSource.Sha1ForNonSecretPurposes::Drain
(9 instances of Sha1ForNonSecretPurposes::Rol* inlined)

@AndyAyersMS I saw that RotateRight was inlined into Sigma* on desktop but not on coreclr.
@redknightlois Let us know if this change speeds up your scenarios.

redknightlois · 2015-10-26T21:28:13Z

@erozenfeld I have been trying to do some benchmarks with this optimization using a custom built CoreCLR (process is definitely not as streamlined and foolproof as I should have expected) but I am not able to see it happen at the disassembly window of Visual Studio. Any idea when it is going to be accesible on an unstable release of CoreCLR or DesktopCLR to try it out?

EDIT: Just in case, before it is asked. Yes, I have both options (just my code and supress jit) unchecked to be able to see the raw optimized JIT assembly output.

erozenfeld · 2015-10-26T22:39:15Z

@redknightlois You can ask the jit to dump the disassembly for a method N.Test.Foo by setting
set COMPLUS_JitDisasm=N.Test::Foo
and then running the app via corerun.exe (e.g., corerun.exe test.exe)
If you don't see the expected ror or rol instructions please share a repro.

erozenfeld · 2015-10-26T23:49:37Z

@redknightlois setting COMPLUS_JitDisasm will only work with a Debug build of CoreCLR.

redknightlois · 2015-10-27T17:56:45Z

@erozenfeld Ok, after a bit of work I was able to look over this. From my tests only only a single instance catches the optimization.

It works:

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static ulong RotateRight64(ulong value, int count)
        {
            return (value >> count) | (value << (64 - count));
        }

It doesn't work:

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static long RotateRight64(long value, int count)
        {
            return (value >> count) | (value << (64 - count));
        }

None of the cases on Main catches the optimization either.

Repro: https://gist.github.com/redknightlois/e7b9934b6cd2af2aca96

VSadov · 2015-10-27T18:46:05Z

Don't you need to use unsigned to make it a rotation?

…oreclr/issues/1619 is included in the Desktop RyuJIT and the CoreCLR runtime works properly. Added alternative implementations of PopCount for reference * Naive If based implementation * Parallel PopCount (two operations simultaneously avoiding data hazards).

redknightlois · 2015-10-27T18:56:03Z

@VSadov to be numerically consistent, yes; but at the assembly level (and it makes sense) there is no difference afaik (in fact that property is used in hash functions no matter if signed or not). In effect signed and unsigned should be equivalent for most practical purposes.

I am more worried about the examples in the Main method though, because it is not clear why they are not catched either.

GSPP · 2015-10-27T18:57:47Z

Signed right shift treats the sign bit specially. int.MinValue >> 31 is -1. So 10000... became 11111.... This would be invalid to optimize.

a should not be caught but b should be.

And can we match on both | as well as ^? I have seen both.

Does this match if the two sides of the | are flipped? No matter whether left or right shift is done on the left side this should match.

I think there should be a test case for the case where the shift argument is not constant and is not constrained by a & 0x1f. The C# compiler never generates that pattern AFAIK but it exists at the CLR level. The test would need to use IL or generate IL at runtime.

redknightlois · 2015-10-27T19:03:45Z

@GSPP makes sense, they are not entirely equivalent because the special treatment on the shift side (not on the ror side). That only keeps why the Main() operations with ulong are not optimized thought.

erozenfeld · 2015-10-27T20:48:05Z

@redknightlois As others pointed out only shifts on unsigned may be used for a rotation pattern.
The reason Main() operations with ulong aren't optimized is because the optimization is currently not enabled when fields (static or instance) are involved in the expression. If this is important for your scenarios I can look into enabling it.
@GSPP Good point about ^. That should be easy to add.
Yes, the current implementation will match if the two sides of the | are flipped.
The case of the shift index being non-const and having no mask will work. Yes, it can't be exercised from C# and would require IL.

redknightlois · 2015-10-28T00:37:03Z

@erozenfeld Not in my very specific scenario, but it is a bit misleading that optimization is not applied when accessing fields directly but on the other hand is able to optimize this:

public void Execute()
{
    a = RotateRight64(a, 16);
}

private static ulong a;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong RotateRight64(ulong value, int count)
{
    return (value >> count) | (value << (64 - count));
}

More misleading though is that it is able to optimize this:

public struct A { public ulong x; }

public A Execute()
{
    Random rnd = new Random(100);

    int count = 16;            

    A a;
    a.x = (ulong)rnd.Next();
    a.x = RotateRight64(a.x, count);

    return a;
}

But not this:

public struct A { public ulong x; }

public A Execute()
{
    Random rnd = new Random(100);

    int count = 16;            

    A a;
    a.x = (ulong)rnd.Next();            
    a.x = (a.x >> count) | (a.x << (64 - count));

    return a;
}

From what it's worth, if you are able to push this onto the DesktopCLR too earlier without applying the extra field optimization, I am all in with defering it. :D

If that is the case, probably would be worth at least document the behavior in a blog post.

GSPP · 2015-10-28T07:21:07Z

Couldn't the rotate matching be moved after a compiler pass that optimizes duplicate memory reads? Such a pass surely does exist already.

erozenfeld · 2015-11-12T03:24:20Z

@redknightlois @GSPP I implemented some improvements for rotation matching based on your feedback: it will now work when instance fields, static fields, or ref params are involved. It will also work when ^ is used instead of |.
Pull request is here: dotnet/coreclr#2027

redknightlois · 2015-11-18T15:26:16Z

Great!!! Will give it a try during the week. Any idea when this optimization will hit the desktop CLR?

cmckinsey · 2015-11-18T21:03:27Z

Hi @redknightlois, just to clarify we have two products, one open (this one) and one closed (full framework on windows). In general requests about the servicing and timeframes should be directed towards Microsoft CSS/Dev support channels. We typically try to avoid adding new optimization work such as this in servicing/minor releases of .NET (e.g. .NET 4.6.1) in order to avoid destabilizing the product. In some circumstances we have considered adding changes like this into the product on a case by case basis and depending upon the customer urgency. Again that requires support channels. Otherwise you can expect this in the next major release of Full .NET and of course .NET Core RC2.

dotnet-bot assigned erozenfeld Oct 22, 2015

erozenfeld closed this as completed in dotnet/coreclr@7b4508b Oct 22, 2015

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

ghost locked as resolved and limited conversation to collaborators Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RyuJIT: Understand the idiomatic rotate bits #4519

RyuJIT: Understand the idiomatic rotate bits #4519

redknightlois commented Sep 24, 2015

cmckinsey commented Sep 24, 2015

redknightlois commented Sep 24, 2015

redknightlois commented Sep 24, 2015

GSPP commented Sep 27, 2015

xunilrj commented Sep 29, 2015

VSadov commented Sep 30, 2015

redknightlois commented Sep 30, 2015

erozenfeld commented Oct 2, 2015

redknightlois commented Oct 3, 2015

cmckinsey commented Oct 6, 2015

GSPP commented Oct 6, 2015

redknightlois commented Oct 6, 2015

GSPP commented Oct 19, 2015

CarolEidt commented Oct 19, 2015

erozenfeld commented Oct 22, 2015

redknightlois commented Oct 26, 2015

erozenfeld commented Oct 26, 2015

erozenfeld commented Oct 26, 2015

redknightlois commented Oct 27, 2015

VSadov commented Oct 27, 2015

redknightlois commented Oct 27, 2015

GSPP commented Oct 27, 2015

redknightlois commented Oct 27, 2015

erozenfeld commented Oct 27, 2015

redknightlois commented Oct 28, 2015

GSPP commented Oct 28, 2015

erozenfeld commented Nov 12, 2015

redknightlois commented Nov 18, 2015

cmckinsey commented Nov 18, 2015

RyuJIT: Understand the idiomatic rotate bits #4519

RyuJIT: Understand the idiomatic rotate bits #4519

Comments

redknightlois commented Sep 24, 2015

cmckinsey commented Sep 24, 2015

redknightlois commented Sep 24, 2015

redknightlois commented Sep 24, 2015

GSPP commented Sep 27, 2015

xunilrj commented Sep 29, 2015

VSadov commented Sep 30, 2015

redknightlois commented Sep 30, 2015

erozenfeld commented Oct 2, 2015

redknightlois commented Oct 3, 2015

cmckinsey commented Oct 6, 2015

GSPP commented Oct 6, 2015

redknightlois commented Oct 6, 2015

GSPP commented Oct 19, 2015

CarolEidt commented Oct 19, 2015

erozenfeld commented Oct 22, 2015

redknightlois commented Oct 26, 2015

erozenfeld commented Oct 26, 2015

erozenfeld commented Oct 26, 2015

redknightlois commented Oct 27, 2015

VSadov commented Oct 27, 2015

redknightlois commented Oct 27, 2015

GSPP commented Oct 27, 2015

redknightlois commented Oct 27, 2015

erozenfeld commented Oct 27, 2015

redknightlois commented Oct 28, 2015

GSPP commented Oct 28, 2015

erozenfeld commented Nov 12, 2015

redknightlois commented Nov 18, 2015

cmckinsey commented Nov 18, 2015