Take advantage of LLVM funnel shift intrinsics (llvm.fshl/llvm.fshr) for fast rotation of integer types #11841

ssvb · 2022-02-20T08:40:02Z

First please consider the following C code:

#include <stdint.h>
uint64_t rotl(uint64_t v, uint64_t rot) { return (v << rot) | (v >> (64 - rot)); }

It gets compiled by Clang with -O3 optimizations on an x86-64 system into pretty much a single ROL instruction:

0000000000000000 <rotl>:
   0:	48 89 f1             	mov    %rsi,%rcx
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 d3 c0             	rol    %cl,%rax
   9:	c3                   	ret

Now let's look at what happens to the equivalent Crystal code:

@[NoInline]
def rotl(v : UInt64, rot : UInt64) : UInt64
  (v << rot) | (v >> (64 - rot))
end
p rotl(123, 4)

It gets compiled by crystal build --release --emit obj test.cr && objdump -d test.o into something really horrible

000000000002e7a0 <*rotl<UInt64, UInt64>:UInt64>:
 2e7a0:       41 56                   push   %r14
 2e7a2:       53                      push   %rbx
 2e7a3:       50                      push   %rax
 2e7a4:       b0 01                   mov    $0x1,%al
 2e7a6:       84 c0                   test   %al,%al
 2e7a8:       74 77                   je     2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7aa:       48 89 f3                mov    %rsi,%rbx
 2e7ad:       31 c0                   xor    %eax,%eax
 2e7af:       b9 40 00 00 80          mov    $0x80000040,%ecx
 2e7b4:       48 29 f1                sub    %rsi,%rcx
 2e7b7:       48 19 c0                sbb    %rax,%rax
 2e7ba:       83 e0 01                and    $0x1,%eax
 2e7bd:       48 0f a4 c8 20          shld   $0x20,%rcx,%rax
 2e7c2:       75 5d                   jne    2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7c4:       49 89 fe                mov    %rdi,%r14
 2e7c7:       b9 40 00 00 00          mov    $0x40,%ecx
 2e7cc:       29 d9                   sub    %ebx,%ecx
 2e7ce:       78 0d                   js     2e7dd <*rotl<UInt64, UInt64>:UInt64+0x3d>
 2e7d0:       83 f9 40                cmp    $0x40,%ecx
 2e7d3:       7d 1d                   jge    2e7f2 <*rotl<UInt64, UInt64>:UInt64+0x52>
 2e7d5:       4c 89 f0                mov    %r14,%rax
 2e7d8:       48 d3 e8                shr    %cl,%rax
 2e7db:       eb 27                   jmp    2e804 <*rotl<UInt64, UInt64>:UInt64+0x64>
 2e7dd:       f7 d9                   neg    %ecx
 2e7df:       70 40                   jo     2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7e1:       85 c9                   test   %ecx,%ecx
 2e7e3:       78 11                   js     2e7f6 <*rotl<UInt64, UInt64>:UInt64+0x56>
 2e7e5:       83 f9 40                cmp    $0x40,%ecx
 2e7e8:       7d 08                   jge    2e7f2 <*rotl<UInt64, UInt64>:UInt64+0x52>
 2e7ea:       4c 89 f0                mov    %r14,%rax
 2e7ed:       48 d3 e0                shl    %cl,%rax
 2e7f0:       eb 12                   jmp    2e804 <*rotl<UInt64, UInt64>:UInt64+0x64>
 2e7f2:       31 c0                   xor    %eax,%eax
 2e7f4:       eb 0e                   jmp    2e804 <*rotl<UInt64, UInt64>:UInt64+0x64>
 2e7f6:       f7 d9                   neg    %ecx
 2e7f8:       70 27                   jo     2e821 <*rotl<UInt64, UInt64>:UInt64+0x81>
 2e7fa:       4c 89 f7                mov    %r14,%rdi
 2e7fd:       89 ce                   mov    %ecx,%esi
 2e7ff:       e8 5c ff fd ff          call   e760 <*UInt64@Int#>><Int32>:UInt64>
 2e804:       89 d9                   mov    %ebx,%ecx
 2e806:       49 d3 e6                shl    %cl,%r14
 2e809:       31 c9                   xor    %ecx,%ecx
 2e80b:       48 83 fb 40             cmp    $0x40,%rbx
 2e80f:       49 0f 42 ce             cmovb  %r14,%rcx
 2e813:       48 09 c1                or     %rax,%rcx
 2e816:       48 89 c8                mov    %rcx,%rax
 2e819:       48 83 c4 08             add    $0x8,%rsp
 2e81d:       5b                      pop    %rbx
 2e81e:       41 5e                   pop    %r14
 2e820:       c3                      ret    
 2e821:       e8 00 00 00 00          call   2e826 <*rotl<UInt64, UInt64>:UInt64+0x86>
 2e826:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
 2e82d:       00 00 00

I'm not sure what's going on and why can't Crystal do the same optimization as Clang, but right now using llvm.fshl intrinsic appears to be a possible workaround/solution:

# https://llvm.org/docs/LangRef.html#llvm-fshl-intrinsic
lib LibIntrinsics
  fun fshl64 = "llvm.fshl.i64"(a : UInt64, b : UInt64, c : UInt64) : UInt64
end
@[NoInline]
def rotl_fast(v : UInt64, rot : UInt64) : UInt64
  LibIntrinsics.fshl64(v, v, rot)
end
p rotl(123, 4)

Which compiles into

000000000002e8a0 <*rotl_fast<UInt64, UInt64>:UInt64>:
   2e8a0:       48 89 f1                mov    %rsi,%rcx
   2e8a3:       48 89 f8                mov    %rdi,%rax
   2e8a6:       48 d3 c0                rol    %cl,%rax
   2e8a9:       c3                      ret

I've been experimenting with this simple benchmark (with and without NoInline): https://gist.github.com/ssvb/49ee46ed5c9e481dc6c6ecf8b05b4d6b

As for the practical usage, I stumbled upon performance problems when trying to experiment with changing the Crystal's hash function to AHash. Incidentally, @funny-falcon (the author of the current hash function) mentioned the same performance problem too.

The text was updated successfully, but these errors were encountered:

ssvb · 2022-02-20T09:35:42Z

To sum it up, my suggestion is to implement some kind of "rotate left" and "rotate right" methods for basic integer types. Which may take advantage of LLVM's funnel shift intrinsics under the hood. Then even the existing hash function can use them instead of https://github.com/crystal-lang/crystal/blob/1.3.2/src/crystal/hasher.cr#L91-L93 (which doesn't seem to compile to efficient code by the current Crystal version 1.3.2 either).

asterite · 2022-02-20T09:49:27Z

What happens in C if you right shift a value too much, like shift by 65? And what happens in Crystal?

HertzDevil · 2022-02-20T09:50:06Z

This is because the C and Crystal versions aren't the same; #<< handles out-of-bound shift amounts, and #- raises on out-of-bound arguments. The unsafe counterpart lowers to rol:

@[NoInline]
def unsafe_rotl(x : UInt64, rot : Int32) : UInt64
  x.unsafe_shl(rot) | x.unsafe_shr(64 &- rot)
end

"*unsafe_rotl<UInt64, Int32>:UInt64":
        movl    %esi, %ecx
        movq    %rdi, %rax
        rolq    %cl, %rax
        retq

ssvb · 2022-02-20T09:57:17Z

This is because the C and Crystal versions aren't the same; #<< handles out-of-bound shift amounts, and #- raises on out-of-bound arguments

Oh, thank you very much. This seems to be a "missing documentation" bug then. Because I can't see the #<< and #>> operators documented at https://crystal-lang.org/reference/1.3/syntax_and_semantics/operators.html

PS. I did try &<< and &>> before reporting this issue.

HertzDevil · 2022-02-20T10:02:48Z

A faster "safe" version, i.e. defined over all shift counts, is:

struct UInt64
  def rotate_left(count : Int) : self
    unsafe_rotl(count.to_i! & 63)
  end
end

struct Int64
  def unsafe_rotl(count : Int32) : self
    to_u64!.unsafe_rotl(count).to_i64!
  end

  def rotate_left(count : Int) : self
    to_u64!.rotate_left(count).to_i64!
  end
end

Ditto for the other primitive integer types. BigInt has no true upper bound, so rotations don't make sense there.

What happens in C if you right shift a value too much, like shift by 65?

It is undefined behaviour (6.5.7):

If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

konovod · 2022-02-20T10:15:35Z

Intrinsic still could be useful because without it

no guarantee that llvm will always optimize it, this could change in future llvm versions
it won't optimize in debug mode
Funnel shift is supported starting from LLVM 7, that seems to be minimal actually supported version anyway (according to LLVM supported versions #8725 (comment))

ssvb · 2022-02-20T21:48:22Z

What happens in C if you right shift a value too much, like shift by 65? And what happens in Crystal?

That's a very good question. C doesn't matter, but it would be great if Crystal language designers could provide the answers in the documentation about how Crystal shift operators are supposed to behave. Another issue can be probably reported about this.

As for the proposed new rotate methods for the integer types, I suggest to just follow the LLVM documentation, which says:

The ‘llvm.fshl’ family of intrinsic functions performs a funnel shift left: the first two values are concatenated as { %a : %b } (%a is the most significant bits of the wide value), the combined value is shifted left, and the most significant bits are extracted to produce a result that is the same size as the original arguments. If the first 2 arguments are identical, this is equivalent to a rotate left operation. For vector types, the operation occurs for each element of the vector. The shift argument is treated as an unsigned amount modulo the element size of the arguments.

The Intel documentation also says the following about ROR/ROR instructions:

Shifts (rotates) the bits of the first operand (destination operand) the number of bit positions specified in the
second operand (count operand) and stores the result in the destination operand. The destination operand can be
a register or a memory location; the count operand is an unsigned integer that can be an immediate or a value in
the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W = 1).

I'm trying to convert the following Rust code to Crystal:

    fn finish(&self) -> u64 {
        let rot = (self.pad & 63) as u32;
        self.buffer.rotate_left(rot)
    }

Ideally all of this should be translated to just a single x86-64 instruction ROL without relying on any kind of undefined behaviour in the Crystal source code. Is .unsafe_shl/.unsafe_shr really the preferred method or something better can be done?

I'm currently using the llvm.fshl intrinsic and so far it seems to work great. Here's a sample code of a benchmark comparing FunnyHash with FxHash and AHash: https://gist.github.com/ssvb/a9f0fc9c8ff9d01a11829dcc7356fb90

asterite · 2022-02-21T11:58:27Z

but it would be great if Crystal language designers could provide the answers in the documentation about how Crystal shift operators are supposed to behave

Here are the docs:

ssvb added the kind:feature label Feb 20, 2022

HertzDevil added the topic:stdlib:numeric label Feb 20, 2022

HertzDevil mentioned this issue Jul 22, 2022

Add #rotate_left and #rotate_right for primitive integers #12307

Merged

straight-shoota closed this as completed in #12307 Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take advantage of LLVM funnel shift intrinsics (llvm.fshl/llvm.fshr) for fast rotation of integer types #11841

Take advantage of LLVM funnel shift intrinsics (llvm.fshl/llvm.fshr) for fast rotation of integer types #11841

ssvb commented Feb 20, 2022

ssvb commented Feb 20, 2022

asterite commented Feb 20, 2022

HertzDevil commented Feb 20, 2022

ssvb commented Feb 20, 2022

HertzDevil commented Feb 20, 2022 •

edited

Loading

konovod commented Feb 20, 2022 •

edited

Loading

ssvb commented Feb 20, 2022

asterite commented Feb 21, 2022

Take advantage of LLVM funnel shift intrinsics (llvm.fshl/llvm.fshr) for fast rotation of integer types #11841

Take advantage of LLVM funnel shift intrinsics (llvm.fshl/llvm.fshr) for fast rotation of integer types #11841

Comments

ssvb commented Feb 20, 2022

ssvb commented Feb 20, 2022

asterite commented Feb 20, 2022

HertzDevil commented Feb 20, 2022

ssvb commented Feb 20, 2022

HertzDevil commented Feb 20, 2022 • edited Loading

konovod commented Feb 20, 2022 • edited Loading

ssvb commented Feb 20, 2022

asterite commented Feb 21, 2022

HertzDevil commented Feb 20, 2022 •

edited

Loading

konovod commented Feb 20, 2022 •

edited

Loading