Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take advantage of LLVM funnel shift intrinsics (llvm.fshl/llvm.fshr) for fast rotation of integer types #11841

Closed
ssvb opened this issue Feb 20, 2022 · 8 comments · Fixed by #12307

Comments

@ssvb
Copy link

ssvb commented Feb 20, 2022

First please consider the following C code:

#include <stdint.h>
uint64_t rotl(uint64_t v, uint64_t rot) { return (v << rot) | (v >> (64 - rot)); }

It gets compiled by Clang with -O3 optimizations on an x86-64 system into pretty much a single ROL instruction:

0000000000000000 <rotl>:
   0:	48 89 f1             	mov    %rsi,%rcx
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 d3 c0             	rol    %cl,%rax
   9:	c3                   	ret    

Now let's look at what happens to the equivalent Crystal code:

@[NoInline]
def rotl(v : UInt64, rot : UInt64) : UInt64
  (v << rot) | (v >> (64 - rot))
end
p rotl(123, 4)
It gets compiled by crystal build --release --emit obj test.cr && objdump -d test.o into something really horrible
000000000002e7a0 <*rotl<UInt64, UInt64>:UInt64>:
 2e7a0:       41 56                   push   %r14
 2e7a2:       53                      push   %rbx
 2e7a3:       50                      push   %rax
 2e7a4:       b0 01                   mov    $0x1,%al
 2e7a6:       84 c0                   test   %al,%al
 2e7a8:       74 77                   je     2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7aa:       48 89 f3                mov    %rsi,%rbx
 2e7ad:       31 c0                   xor    %eax,%eax
 2e7af:       b9 40 00 00 80          mov    $0x80000040,%ecx
 2e7b4:       48 29 f1                sub    %rsi,%rcx
 2e7b7:       48 19 c0                sbb    %rax,%rax
 2e7ba:       83 e0 01                and    $0x1,%eax
 2e7bd:       48 0f a4 c8 20          shld   $0x20,%rcx,%rax
 2e7c2:       75 5d                   jne    2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7c4:       49 89 fe                mov    %rdi,%r14
 2e7c7:       b9 40 00 00 00          mov    $0x40,%ecx
 2e7cc:       29 d9                   sub    %ebx,%ecx
 2e7ce:       78 0d                   js     2e7dd <*rotl<UInt64, UInt64>:UInt64+0x3d>
 2e7d0:       83 f9 40                cmp    $0x40,%ecx
 2e7d3:       7d 1d                   jge    2e7f2 <*rotl<UInt64, UInt64>:UInt64+0x52>
 2e7d5:       4c 89 f0                mov    %r14,%rax
 2e7d8:       48 d3 e8                shr    %cl,%rax
 2e7db:       eb 27                   jmp    2e804 <*rotl<UInt64, UInt64>:UInt64+0x64>
 2e7dd:       f7 d9                   neg    %ecx
 2e7df:       70 40                   jo     2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7e1:       85 c9                   test   %ecx,%ecx
 2e7e3:       78 11                   js     2e7f6 <*rotl<UInt64, UInt64>:UInt64+0x56>
 2e7e5:       83 f9 40                cmp    $0x40,%ecx
 2e7e8:       7d 08                   jge    2e7f2 <*rotl<UInt64, UInt64>:UInt64+0x52>
 2e7ea:       4c 89 f0                mov    %r14,%rax
 2e7ed:       48 d3 e0                shl    %cl,%rax
 2e7f0:       eb 12                   jmp    2e804 <*rotl<UInt64, UInt64>:UInt64+0x64>
 2e7f2:       31 c0                   xor    %eax,%eax
 2e7f4:       eb 0e                   jmp    2e804 <*rotl<UInt64, UInt64>:UInt64+0x64>
 2e7f6:       f7 d9                   neg    %ecx
 2e7f8:       70 27                   jo     2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7fa:       4c 89 f7                mov    %r14,%rdi
 2e7fd:       89 ce                   mov    %ecx,%esi
 2e7ff:       e8 5c ff fd ff          call   e760 <*UInt64@Int#>><Int32>:UInt64>
 2e804:       89 d9                   mov    %ebx,%ecx
 2e806:       49 d3 e6                shl    %cl,%r14
 2e809:       31 c9                   xor    %ecx,%ecx
 2e80b:       48 83 fb 40             cmp    $0x40,%rbx
 2e80f:       49 0f 42 ce             cmovb  %r14,%rcx
 2e813:       48 09 c1                or     %rax,%rcx
 2e816:       48 89 c8                mov    %rcx,%rax
 2e819:       48 83 c4 08             add    $0x8,%rsp
 2e81d:       5b                      pop    %rbx
 2e81e:       41 5e                   pop    %r14
 2e820:       c3                      ret    
 2e821:       e8 00 00 00 00          call   2e826 <*rotl<UInt64, UInt64>:UInt64+0x86>
 2e826:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
 2e82d:       00 00 00 

I'm not sure what's going on and why can't Crystal do the same optimization as Clang, but right now using llvm.fshl intrinsic appears to be a possible workaround/solution:

# https://llvm.org/docs/LangRef.html#llvm-fshl-intrinsic
lib LibIntrinsics
  fun fshl64 = "llvm.fshl.i64"(a : UInt64, b : UInt64, c : UInt64) : UInt64
end
@[NoInline]
def rotl_fast(v : UInt64, rot : UInt64) : UInt64
  LibIntrinsics.fshl64(v, v, rot)
end
p rotl(123, 4)

Which compiles into

000000000002e8a0 <*rotl_fast<UInt64, UInt64>:UInt64>:
   2e8a0:       48 89 f1                mov    %rsi,%rcx
   2e8a3:       48 89 f8                mov    %rdi,%rax
   2e8a6:       48 d3 c0                rol    %cl,%rax
   2e8a9:       c3                      ret    

I've been experimenting with this simple benchmark (with and without NoInline): https://gist.github.com/ssvb/49ee46ed5c9e481dc6c6ecf8b05b4d6b

As for the practical usage, I stumbled upon performance problems when trying to experiment with changing the Crystal's hash function to AHash. Incidentally, @funny-falcon (the author of the current hash function) mentioned the same performance problem too.

@ssvb
Copy link
Author

ssvb commented Feb 20, 2022

To sum it up, my suggestion is to implement some kind of "rotate left" and "rotate right" methods for basic integer types. Which may take advantage of LLVM's funnel shift intrinsics under the hood. Then even the existing hash function can use them instead of https://github.com/crystal-lang/crystal/blob/1.3.2/src/crystal/hasher.cr#L91-L93 (which doesn't seem to compile to efficient code by the current Crystal version 1.3.2 either).

@asterite
Copy link
Member

What happens in C if you right shift a value too much, like shift by 65? And what happens in Crystal?

@HertzDevil
Copy link
Contributor

This is because the C and Crystal versions aren't the same; #<< handles out-of-bound shift amounts, and #- raises on out-of-bound arguments. The unsafe counterpart lowers to rol:

@[NoInline]
def unsafe_rotl(x : UInt64, rot : Int32) : UInt64
  x.unsafe_shl(rot) | x.unsafe_shr(64 &- rot)
end
"*unsafe_rotl<UInt64, Int32>:UInt64":
        movl    %esi, %ecx
        movq    %rdi, %rax
        rolq    %cl, %rax
        retq

@ssvb
Copy link
Author

ssvb commented Feb 20, 2022

This is because the C and Crystal versions aren't the same; #<< handles out-of-bound shift amounts, and #- raises on out-of-bound arguments

Oh, thank you very much. This seems to be a "missing documentation" bug then. Because I can't see the #<< and #>> operators documented at https://crystal-lang.org/reference/1.3/syntax_and_semantics/operators.html

PS. I did try &<< and &>> before reporting this issue.

@HertzDevil
Copy link
Contributor

HertzDevil commented Feb 20, 2022

A faster "safe" version, i.e. defined over all shift counts, is:

struct UInt64
  def rotate_left(count : Int) : self
    unsafe_rotl(count.to_i! & 63)
  end
end

struct Int64
  def unsafe_rotl(count : Int32) : self
    to_u64!.unsafe_rotl(count).to_i64!
  end

  def rotate_left(count : Int) : self
    to_u64!.rotate_left(count).to_i64!
  end
end

Ditto for the other primitive integer types. BigInt has no true upper bound, so rotations don't make sense there.

What happens in C if you right shift a value too much, like shift by 65?

It is undefined behaviour (6.5.7):

If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

@konovod
Copy link
Contributor

konovod commented Feb 20, 2022

Intrinsic still could be useful because without it

  • no guarantee that llvm will always optimize it, this could change in future llvm versions
  • it won't optimize in debug mode
    Funnel shift is supported starting from LLVM 7, that seems to be minimal actually supported version anyway (according to LLVM supported versions #8725 (comment))

@ssvb
Copy link
Author

ssvb commented Feb 20, 2022

What happens in C if you right shift a value too much, like shift by 65? And what happens in Crystal?

That's a very good question. C doesn't matter, but it would be great if Crystal language designers could provide the answers in the documentation about how Crystal shift operators are supposed to behave. Another issue can be probably reported about this.

As for the proposed new rotate methods for the integer types, I suggest to just follow the LLVM documentation, which says:

The ‘llvm.fshl’ family of intrinsic functions performs a funnel shift left: the first two values are concatenated as { %a : %b } (%a is the most significant bits of the wide value), the combined value is shifted left, and the most significant bits are extracted to produce a result that is the same size as the original arguments. If the first 2 arguments are identical, this is equivalent to a rotate left operation. For vector types, the operation occurs for each element of the vector. The shift argument is treated as an unsigned amount modulo the element size of the arguments.

The Intel documentation also says the following about ROR/ROR instructions:

Shifts (rotates) the bits of the first operand (destination operand) the number of bit positions specified in the
second operand (count operand) and stores the result in the destination operand. The destination operand can be
a register or a memory location; the count operand is an unsigned integer that can be an immediate or a value in
the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W = 1).

I'm trying to convert the following Rust code to Crystal:

    fn finish(&self) -> u64 {
        let rot = (self.pad & 63) as u32;
        self.buffer.rotate_left(rot)
    }

Ideally all of this should be translated to just a single x86-64 instruction ROL without relying on any kind of undefined behaviour in the Crystal source code. Is .unsafe_shl/.unsafe_shr really the preferred method or something better can be done?

I'm currently using the llvm.fshl intrinsic and so far it seems to work great. Here's a sample code of a benchmark comparing FunnyHash with FxHash and AHash: https://gist.github.com/ssvb/a9f0fc9c8ff9d01a11829dcc7356fb90

@asterite
Copy link
Member

but it would be great if Crystal language designers could provide the answers in the documentation about how Crystal shift operators are supposed to behave

Here are the docs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants