Merge branch 'master' of github.com:scarv/riscv-crypto

riscv · May 11, 2020 · fe64ad7 · fe64ad7
2 parents 1b8825b + 510353a
commit fe64ad7
Show file tree

Hide file tree

Showing 6 changed files with 328 additions and 27 deletions.
diff --git a/doc/README.md b/doc/README.md
@@ -22,8 +22,8 @@ This directory contains two types of information:
   page.
 
 
-- [Supplementary information](supplementary-info.adoc),
-  in ascii doc form.
+- [Supplementary information](supp/supplementary-info.adoc),
+  in [AsciiDoc](https://asciidoctor.org/) format.
   This contains various recommendations, discussions and design
   rationale which we have developed in conjunction to the specification.
 
diff --git a/doc/supp/gcm-mode-cmul.adoc b/doc/supp/gcm-mode-cmul.adoc
@@ -0,0 +1,105 @@
+==  Galois/Counter Mode (GCM) with Carryless Multiply
+
+The Galois/Counter Mode (GCM) specified in
+https://doi.org/10.6028/NIST.SP.800-38D[NIST SP 800-38D] is a prominent
+Authenticated Encryption with Associated Data (AEAD) mechanism. It is
+the only cipher mode mandated as *MUST* for all
+https://www.rfc-editor.org/rfc/rfc8446.html[TLS 1.3] implementations
+and also frequently used in government and military applications.
+
+Here we'll briefly discuss implementation aspects of AES-GCM using the
+https://github.com/riscv/riscv-bitmanip[bitmanip] extension `B`.
+The instructions relevant to GCM are the Carry-Less Multiply instructions
+`CMUL[H][W]` and also the Generalized Reverse `GREV[W]`.
+The `[W]` suffix indicates a 32-bit word size variant on RV64.
+An attempt should be made to pair `CMULH` immediately followed by `CMUL`,
+as is done with `MULH`/`MUL`, although there is less of a performance
+advantage in this case.
+
+=== Finite Field Arithmetic in GF(2^128^)
+
+While message confidentiality in GCM is provided by a block cipher (AES)
+in counter mode (a CTR variant), authentication is based on a GHASH, a
+universal hash defined over the binary field GF(2^128^).
+Without custom instruction support GCM, just like AES itself, is either
+very slow or susceptible to cache timing attacks.
+
+Whether or not authenticating ciphertext or associated data, the main
+operation of GCM is the GHASH multiplication between a block of
+authentication data and a secret generator `H`. The addition in the
+field is trivial; just two or four XORs, depending on whether RV32 or RV64
+implementation is used.
+
+The finite field is defined to be the ring of binary polynomials modulo
+the primitive pentanomial
+latexmath:[$R(x) = x^{128} + x^7 + x^2 + x + 1.$]
+The field encoding is slightly unusual, with the multiplicative identity
+(i.e. one -- "1") being encoded as a byte sequence `0x80, 0x00, .., 0x00`.
+Converting to little-endian encoding involves inverting bits in each byte;
+the `GREV[W]` instruction with constant 7 (pseudo-instruction `rev`)
+accomplishes this.
+
+The multiplication itself can be asymptotically sped up with the Karatsuba
+method, which works even better in binary fields than it does with integers.
+This reduces the number of `CMUL`/`CMULH` pairs on RV64 from 4 to 3 and
+the on RV32 from 16 to 9, with the cost of many XORs.
+
+
+=== Reduction via Shifts or with Multiplication
+
+The second arithmetic step to consider is the polynomial reduction of the
+255-bit ring product down to 128 bits (the field) again. The best way of
+doing reduction depends on _how_ _fast_ the carry-less multiplication
+instructions `CMUL[H][W]` are in relation to shifts and XORs.
+
+We consider two alternative reduction methods:
+
+1. **Shift reduction**: Based on the low Hamming weight of the
+polynomial R.
+2. **Multiplication reduction**: Analogous to Montgomery and Barrett
+methods -- albeit simpler because we're working in characteristic 2.
+
+
+=== Determining the Fastest Method
+
+Examining the multiplication implementations in micro benchmarks
+we obtain the following  arithmetic counts:
+
+[cols="1,1,1,1,1,1,1,1", options="header"]
+.Instruction Counts
+|===
+| **Arch** | **Karatsuba**  | **Reduce**    | `GREV` | `XOR` | `S[L/R]L` | `CLMUL` | `CLMULH`
+| RV32B |   no  |   mul |   4   |   36  |   0   |   20  |   20
+| RV32B |   no  | shift |   4   |   56  |   24  |   16  |   16
+| RV32B |   yes |   mul |   4   |   52  |   0   |   13  |   13
+| RV32B |   yes | shift |   4   |   72  |   24  |   9   |   9
+| RV64B |   no  |   mul |   2   |   10  |   0   |   6   |   6
+| RV64B |   no  | shift |   2   |   20  |   12  |   4   |   4
+| RV64B |   yes |   mul |   2   |   14  |   0   |   5   |   5
+| RV64B |   yes | shift |   2   |   24  |   12  |   3   |   3
+|===
+
+
+We can see that the best selection of algorithms depends on the relative
+cost of multiplication. Assuming that other instructions have unit cost
+and multiply instructions require a multiple of it, and ignoring loops etc,
+we have:
+
+[cols="1,1,1,1,1,1,1", options="header"]
+.Clock Counts
+|===
+| **Arch** | **Karatsuba**  | **Reduce**    | **MUL=1** | **MUL=2** | **MUL=3** | **MUL=6**
+| RV32B |   no  |   mul | **80**    |   120     |   160     | 280
+| RV32B |   no  | shift |   116     |   148     |   180     | 276
+| RV32B |   yes |   mul |   82      |   **108** | **134**   | 212
+| RV32B |   yes | shift |   118     |   136     |   154     | **208**
+| RV64B |   no  |   mul | **24**    |   **36**  |   48      | 84
+| RV64B |   no  | shift |   42      |   50      |   58      | 82
+| RV64B |   yes |   mul |   26      |   **36**  | **46**    | 76
+| RV64B |   yes | shift |   44      |   50      |   56      | **74**
+|===
+
+We see that if `CLMUL[H][W]` takes twice the time of XOR and shifts,
+or more, then Karatsuba is worthwhile. If these multiplication instructions
+are six times slower, or more, then it is worthwhile to convert the reduction multiplications to shifts and XORs.
+
diff --git a/doc/supp/indexed-loads-stores.adoc b/doc/supp/indexed-loads-stores.adoc
@@ -0,0 +1,32 @@
+== On Indexed Loads and Stores
+
+The RISC-V architecture is a pure load-and-store achitecture. The perform
+a an _indexed_ load or store, the index must generally be scaled and added
+to a base pointer, and only then a separate load operation can be performed.
+
+**Try to avoid table lookups with secret-derived indexes.**
+One prominent use of indexed loads in cryptography is to implement S-Boxes
+found in various secret-key algorithms. Secret-dependant loads (table lookups)
+are problematic in cryptography because of their high potential for cache
+timing attacks. We hope that our instruction extensions remove this problem
+for AES and other relevant cryptographic algorithms; they often allow
+lookup-free implementation without resorting to "bit slicing" the entire
+algorithm.
+
+**In RISC-V pointers are generally used for loop control.**
+For (arithmetic) loops the lack of indexing becomes less of a problem when
+one understands that RISC-V also makes no distinction between index and
+address registers and hence one can use pointers as indexes and loop
+counters. This is a transformation that a compiler usually does
+automatically; when your code is iterating through some vector `v[i]`
+with `i=0..n` there is no trace of an "index" in the compiled code.
+A single register is both a pointer and index and iterates through
+`*v++` until it reaches a pre-computed v + n point.
+
+**Offset loads and stores are "free".**
+All loads and stores in RISC-V have a 12-bit signed byte offset built into
+the instruction word itself. The offset loads and stores allow an
+implementor to often easily unroll a few steps to reduce the relative
+time spent on loop control and to potentially parallelize the operation
+in superscalar architectures without relying on branch prediction.
+
diff --git a/doc/supp/rbr-arithmetic.adoc b/doc/supp/rbr-arithmetic.adoc
@@ -0,0 +1,159 @@
+==  Notes on RISC-V Cryptographic Arithmetic
+
+Implementors of cryptographic large-integer arithmetic on RISC-V are
+initially faced with the biggest single issue that differentiates this architecture from many others; Lack of carry bits and overflow detection.
+This section discusses typical implementation techniques used for
+constant-time implementation of large-integer arithmetic for cryptography.
+
+=== Redundant Binary Representation (RBR)
+
+A natural RISC-V approach is to use https://en.wikipedia.org/wiki/Redundant_binary_representation[Redundant Binary Representation] (RBR)
+for cryptographic big-integer arithmetic.
+
+Each XLEN-wide word carries d significant bits and r=XLEN-d
+additional redundancy bits. The numerical value X represented by
+little-endian vector of n words `x[n]` is therefore:
+
+[latexmath]
+++++
+X = \sum_{i=0}^{n-1} 2^{id} x[i]
+++++
+
+This representation is redundant (not unique) since each word `x[i]` may
+still have numerical values up to 2^XLEN^-1 if unsigned. Also note
+that sometimes it is preferable to use signed `x[i]`.
+
+RBR algorithms are often used even when carry is available, since it (a) allows effective parallelization (even in SIMD and Vector architectures)
+and (b) allows easier implementation of constant-time arithmetic as there
+are no variable-length carry chains. Constant-time implementation is
+very important in cryptographic applications.
+
+
+=== Redundancy bits
+
+It is easy to see that addition and subtraction become fully parallel vector operations up to saturation; when implementing a sequence of arithmetic operations one can analyze where an overflow becomes possible and carry reduction is potentially required. Fortunately carry reduction can also be usually parallelized.
+
+For the convenience of serialization and deserialization, we often choose redundancy of r = 8 bits, leaving d=24 or d=56 non-redundant bits for each word.
+
+One should try to complete a larger cryptographic operation such as elliptic curve scalar multiplication or RSA exponentiation entirely in the RBR domain, apart from quantities that benefit from canonical or other representation -- such as exponents.
+
+Often these algorithms require additional representation tricks such as Montgomery form (to avoid modular remaindering) or Projective, Jacobian,... coordinates (to avoid division with Elliptic Curves). Most of such techniques apply to RBR equally well as they do to non-redundant representations.
+
+
+=== Parallel carry and Semi-Redundant Form (SRBR)
+
+There are two kinds of carry-reduction operations, one which is
+parallelizable and another which is not.
+
+In the following I'll use `DMASK` to denote the bit mask 2^d^-1
+that cuts a number to `d` bits, e.g. `0x00FFFFFF` or `0x00FFFFFFFFFFFFFF`.
+
+In parallel carry  we simultaneously replace all `x[i]` with `x'[i]`:
+
+----
+    x'[0] = x[0] & DMASK
+    for all i, 0 < i < n:
+        x'[i] = (x[i] & DMASK) + (x[i-1] >> d)
+    end
+----
+
+
+In unsigned case this puts each word `w=x[i]` in range
+latexmath:[$0 \le w < 2^d + 2^r$], with bits `w[XLEN-1:d+1]=0`
+and a relatively small probability that bit `w[d]` is nonzero.
+Vectors satisfying this condition are considered to have
+Semi-Redundant Binary Representation (SRBR). In vector format such
+a semi-reduction involves only pairs of vector elements.
+
+
+=== Full carry and Non-Redundant Form (NRBR)
+
+Full carry is usually only required for serialization and numeric
+comparisons. In non-redundant NRBR form, each word `w=x[i]` is in range
+latexmath:[$0 \le w < 2^d$].
+Note that the r redundancy bits are still there but they're zeroes;
+`w[XLEN-1:d] = 0`.
+For negative numbers a NRBR convention may be adopted where the
+highest-order word (only) has redundancy equivalent to -1 (all 1 bits):
+`w[XLEN-1:d] = 111..1`.
+
+NRBR reduction can be implemented as a loop that proceeds word-by-word
+from least significant towards more significant words:
+
+----
+    c = 0                               //  (or carry-in)
+    for i = 0, 1, .. n-1 in sequence:
+        c = c + x[i]
+        x'[i] = c & DMASK
+        c = c >> d
+    end
+----
+
+Serialization to some fully canonical little- or big-endian wire formatting
+is an application matter and not discussed here.
+
+
+=== (Parallel) Multiplication and Input Prepping
+
+When multiplying two RBR numbers `x[0..n-1]` and `y[0..m-1]` the
+product `xy[0..n + m -1]` can be formed by starting with
+`xy[0..n+m-1] = 0` and computing (in parallel or in any order!) the sums:
+
+----
+    for all (i,j), 0 <= i < n, 0 <= j < m:
+        t = x[i] * y[i]
+        k = i + j
+        xy[k] = xy[k] + (t & DMASK)         //  <1>
+        xy[k + 1] = xy[k + 1] + (t >> d)    //  <2>
+    end
+----
+<1> The first addition uses bits `t[d-1:0]` of the product.
+<2> Second addition uses bits `t[XLEN+d-1:d]` of the product.
+
+The standard RISC-V instructions `MUL` and `MULH[[S]U]` return bits
+`t[XLEN-1:0]` and `t[2*XLEN-1:XLEN]` of the product, which would seem
+not necessitate a couple of additional few shifts and an XOR for each
+step.
+
+An easy input-prep trick is to left-shift both (SRBR-format) inputs left
+by `r/2` bits before starting the operation (typically 4 positions).
+As result the product is shifted left  by `r` bits and hence `MULH[[S]U]`
+directly returns the desired value and the lower word needs to be
+right shifted by `r` bits.
+
+We see that the high `r` bits `t[2*XLEN-1:XLEN+d]` of the sub-product
+are discarded -- this style of implementation assumes that the
+inputs are SRBR (or similar) rather than general RBR.
+An alternative approach would be to apply `DMASK` to `xy[k + 1]` too, and
+add `(t >> (2*d))` to `xy[k + 2]`. However, it would seem to be
+always easier to SBRB-reduce inputs first. As a general strategy any easy
+O(n) or one-step parallel input prepping is worthwhile since the main body
+of multiplication is superlinear, often up to O(n^2^).
+
+
+Together with input prepping we have:
+----
+    for all i, 0 <= i < n:
+        x'[i] = x[i] << (r / 2)
+    end
+
+    for all j, 0 <= j < m:
+        y'[j] = y[j] << (r / 2)
+    end
+
+    for all (i,j) with 0 <= i < n, 0 <= j < m:
+        t = x'[i] * y'[i]
+        xy[i + j] = xy[i + j] + (t[XLEN-1:0] >> r)
+        xy[i + j + 1] = xy[i + j + 1] + t[2*XLEN-1:XLEN];
+    end
+----
+
+For RV32 the choice of unsigned redundancy r=8 allows multiplication of
+24*256 = 6144-bit numbers (12288-bit product) without a carry reduction
+step, and 56*256 = 14336 for RV64, which is sufficient for most current
+cryptographic applications. However one may easily introduce intermediate
+reduction steps. One may also use signed representation, which makes
+https://en.wikipedia.org/wiki/Karatsuba_algorithm[Karatsuba] - style
+multiplication formulas easier to implement to asymptotically reduce
+the overall number of multiplication instructions for very large numbers.
+
diff --git a/doc/supp/supplementary-info.adoc b/doc/supp/supplementary-info.adoc
@@ -0,0 +1,30 @@
+= RISC-V Crypto Extension: Supplementary Information
+:doctype: article
+:encoding: utf-8
+:lang: en
+:toc: left
+:numbered:
+:stem: latexmath
+:le: &#8804;
+:ge: &#8805;
+:ne: &#8800;
+:inf: &#8734;
+
+_Work In Progress_
+
+== Introduction
+
+This document contains additional information for engineers implementing
+or using the RISC-V cryptography extension.
+
+NOTE:   The following discussion is not a _requirement_. We are simply
+providing supplementary information that is intended to be helpful for
+implementation and optimization of certain tasks, and provides partial
+rationale for certain ISA features.
+
+include::indexed-loads-stores.adoc[]
+
+include::rbr-arithmetic.adoc[]
+
+include::gcm-mode-cmul.adoc[]
+
diff --git a/doc/supplementary-info.adoc b/doc/supplementary-info.adoc