Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CLMUL instruction set #318

Closed
newpavlov opened this issue Feb 9, 2018 · 5 comments
Closed

Implement CLMUL instruction set #318

newpavlov opened this issue Feb 9, 2018 · 5 comments

Comments

@newpavlov
Copy link
Contributor

Carry-less Multiplication (CLMUL) allows to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2^k) multiplication. Another application is the fast calculation of CRC values, including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush. (wiki)

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 9, 2018

Would _mm_clmulepi64_si128 be enough ?

From the clang docs:

/// \brief Multiplies two 64-bit integer values, which are selected from source
/// operands using the immediate-value operand. The multiplication is a
/// carry-less multiplication, and the 128-bit integer product is stored in
/// the destination.
///
/// \headerfile <x86intrin.h>
///
/// \code
/// __m128i _mm_clmulepi64_si128(__m128i __X, __m128i __Y, const int __I);
/// \endcode
///
/// This intrinsic corresponds to the VPCLMULQDQ instruction.
///
/// \param __X
/// A 128-bit vector of [2 x i64] containing one of the source operands.
/// \param __Y
/// A 128-bit vector of [2 x i64] containing one of the source operands.
/// \param __I
/// An immediate value specifying which 64-bit values to select from the
/// operands. Bit 0 is used to select a value from operand \a __X, and bit
/// 4 is used to select a value from operand \a __Y: \n
/// Bit[0]=0 indicates that bits[63:0] of operand \a __X are used. \n
/// Bit[0]=1 indicates that bits[127:64] of operand \a __X are used. \n
/// Bit[4]=0 indicates that bits[63:0] of operand \a __Y are used. \n
/// Bit[4]=1 indicates that bits[127:64] of operand \a __Y are used.
/// \returns The 128-bit integer vector containing the result of the carry-less
/// multiplication of the selected 64-bit values.

This would need:

  • adding run-time feature detection for pclmulqdq and vpclmulqdq in coresimd:
    • pclmulqdq: EAX=1 => ECX:1 (after sse3)
    • vpclmulqdq: EAX=7 => ECX:10 (after vaes), requires OS support for saving AVX registers
  • whitelist both features in rustc
  • add the intrinsic and tests to coresimd, probably in its own module: i{586,686} or x86_64?
  • Resolve inconsistency between clang docs and Intel Intrinsics guide.

Inconsistency: The Intel spec states this intrinsic lowers to pclmulqdq but clang docs state that the LLVM intrinsic lowers to vpclmulqdq. LLVM supports both features as independent features and llvm/hosts.cpp (used by target-cpu=native) detects them as independent features as well. So the following question must be answered:

  • To which instruction does _mm_clmulepi64_si128 lower and when?

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 9, 2018

From PCLMULQDQ it looks like this might lower to vpclmulqdq if AVX is enabled and the target os supports it, and to pclmulqdq otherwise.

@newpavlov
Copy link
Contributor Author

newpavlov commented Feb 9, 2018

Would _mm_clmulepi64_si128 be enough ?

If I understand everything correctly, yes, as other 4 mnemonics are equivalent to using PCLMULQDQ with imm equal to 0x00, 0x01, 0x10 and 0x11.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 9, 2018

So if somebody wants to give this a try I can mentor. It gives a pretty good overview of stdsimd and is an easy issue to solve.

I am bad at making time estimates, but 90% of the work can be probably be done in 1-2 hours (maybe less, depending on how much Rust experience one has). The last 10% of the work involves letting the travis build bots run (they take a while) and fixing any potential issues on the i586/i686 targets, if any.

newpavlov added a commit to newpavlov/rust that referenced this issue Feb 10, 2018
@newpavlov
Copy link
Contributor Author

newpavlov commented Feb 10, 2018

I've wrote a draft PRs (linked above). I've used aes as an example, so it does not handle AVX specific stuff and _mm_clmulepi64_si128 simply links to llvm.x86.pclmulqdq. If I understand correctly LLVM uses pclmul feature and not pclmulqdq or clmul. So currently we get a small naming inconsistency with intrinsic is named as _mm_clmulepi64_si128 and feature as pclmul.

kennytm added a commit to kennytm/rust that referenced this issue Feb 12, 2018
…hitelist pclmulqdq x86 feature flag Relevant `stdsimd` [issue](rust-lang/stdarch#318).
kennytm added a commit to kennytm/rust that referenced this issue Feb 13, 2018
Whitelist pclmulqdq x86 feature flag

Relevant `stdsimd` [issue](rust-lang/stdarch#318).
kennytm added a commit to kennytm/rust that referenced this issue Feb 14, 2018
Whitelist pclmulqdq x86 feature flag

Relevant `stdsimd` [issue](rust-lang/stdarch#318).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants