From 29e6172746cac70564739576ee15ceb92b8dc9b5 Mon Sep 17 00:00:00 2001
From: "Markku-Juhani O. Saarinen" <mjos@iki.fi>
Date: Sat, 9 May 2020 18:26:09 +0100
Subject: [PATCH 1/4] starting supplementary info document

---
 doc/supp/indexed-loads-stores.adoc     |  35 ++++++
 doc/supp/rbr-arithmetic.adoc           | 156 +++++++++++++++++++++++++
 doc/{ => supp}/supplementary-info.adoc |  10 +-
 3 files changed, 197 insertions(+), 4 deletions(-)
 create mode 100644 doc/supp/indexed-loads-stores.adoc
 create mode 100644 doc/supp/rbr-arithmetic.adoc
 rename doc/{ => supp}/supplementary-info.adoc (52%)

diff --git a/doc/supp/indexed-loads-stores.adoc b/doc/supp/indexed-loads-stores.adoc
new file mode 100644
index 00000000..49e1502e
--- /dev/null
+++ b/doc/supp/indexed-loads-stores.adoc
@@ -0,0 +1,35 @@
+== On Indexed Loads and Stores
+
+The RISC-V architecture is a pure load-and-store achitecture. The perform
+a an _indexed_ load or store, the index must generally be scaled and added
+to a base pointer, and only then a separate load operation can be performed.
+
+TIP:    Try to avoid table lookups with secret-derived indexes.
+
+One prominent use of indexed loads in cryptography is to implement S-Boxes
+found in various secret-key algorithms. Secret-dependant loads (table lookups)
+are problematic in cryptography because of their high potential for cache
+timing attacks. We hope that our instruction extensions remove this problem
+for AES and other relevant cryptographic algorithms; they often allow
+lookup-free implementation without resorting to "bit slicing" the entire
+algorithm.
+
+TIP:    In RISC-V pointers are generally used for loop control.
+
+For (arithmetic) loops the lack of indexing becomes less of a problem when
+one understands that RISC-V also makes no distinction between index and
+address registers and hence one can use pointers as indexes and loop
+counters. This is a transformation that a compiler usually does
+automatically; when your code is iterating through some vector `v[i]`
+with `i=0..n` there is no trace of an "index" in the compiled code.
+A single register is both a pointer and index and iterates through
+`*v++` until it reaches a pre-computed v + n point.
+
+TIP:    Offset loads and stores are "free".
+
+All loads and stores in RISC-V have a 12-bit signed byte offset built into
+the instruction word itself. The offset loads and stores allow an
+implementor to often easily unroll a few steps to reduce the relative
+time spent on loop control and to potentially parallelize the operation
+in superscalar architectures without relying on branch prediction.
+
diff --git a/doc/supp/rbr-arithmetic.adoc b/doc/supp/rbr-arithmetic.adoc
new file mode 100644
index 00000000..24520aae
--- /dev/null
+++ b/doc/supp/rbr-arithmetic.adoc
@@ -0,0 +1,156 @@
+==  Notes on RISC-V Cryptographic Arithmetic
+
+Implementors of cryptographic large-integer arithmetic on RISC-V are
+initially faced with the biggest single issue that differentiates this architecture
+from many others; Lack of carry bits and overflow detection and lack of
+indexed load and store. This section discusses typical implementation
+techniques used for constant-time implementation of cryptographic
+large-integer arithmetic.
+
+=== Redundant Binary Representation
+
+A natural RISC-V approach is to use https://en.wikipedia.org/wiki/Redundant_binary_representation[Redundant Binary Representation] (RBR)
+for cryptographic big-integer arithmetic.
+
+Each XLEN-wide word carries d significant bits and r=XLEN-d
+additional redundancy bits. The numerical value X represented by
+little-endian vector of n words `x[n]` is therefore:
+
+[asciimath]
+++++
+X = sum_(i=0)^(n-1) 2^(i*d) * x[i]
+++++
+
+This representation is redundant (not unique) since each `x[i]` may still
+have numerical values up to 2^XLEN^-1 if unsigned. Also note
+that sometimes it is preferable to use signed `x[i]`.
+
+RBR algorithms are often used even when carry is available, since it (a) allows effective parallelization (even in SIMD and Vector architectures)
+and (b) allows easier implementation of constant-time arithmetic as there
+are no variable-length carry chains. Constant-time implementation is
+very important in cryptographic applications.
+
+
+==== Usage of RBR
+
+It is easy to see that addition and subtraction become fully parallel vector operations up to saturation; when implementing a sequence of arithmetic operations one can analyze where an overflow becomes possible and carry reduction is potentially required. Fortunately carry reduction can also be usually parallelized.
+
+For the convenience of serialization and deserialization, we often choose redundancy of r = 8 bits, leaving d=24 or d=56 non-redundant bits for each word.
+
+One should try to complete a larger cryptographic operation such as elliptic curve scalar multiplication or RSA exponentiation entirely in the RBR domain, apart from quantities that benefit from canonical or other representation -- such as exponents.
+
+Often these algorithms require additional representation tricks such as Montgomery form (to avoid modular remaindering) or Projective, Jacobian,... coordinates (to avoid division with Elliptic Curves). Most of such techniques apply to RBR equally well as they do to non-redundant representations.
+
+
+==== Parallel carry and Semi-Redundant Binary Representation SRBR
+
+There are two kinds of carry-reduction operations, one which is
+parallelizable and another which is not.
+
+In the following I'll use DMASK to denote the bit mask 2^d^-1
+that cuts a number to `d` bits, e.g. `0x00FFFFFF` or `0x00FFFFFFFFFFFFFF`.
+
+In parallel carry  we simultaneously replace all `x[i]` with `x'[i]`:
+
+----
+    x'[0] = x[0] & DMASK
+    for all i, 0 < i < n:
+        x'[i] = (x[i] & DMASK) + (x[i-1] >> d)
+    end
+----
+
+
+In unsigned case this puts each word `w=x[i]` in range
+asciimath:[0 le w < 2^d + 2^r], with bits `w[XLEN-1:d+1]=0`
+and a relatively small probability that bit `w[d]` is nonzero.
+Vectors satisfying this condition are considered to have
+Semi-Redundant Binary Representation (SRBR). In vector format such
+a semi-reduction involves only pairs of vector elements.
+
+
+====  Full carry and Non-Redundant Binary Representation (NRBR)
+
+Full carry is usually only required for serialization and numeric
+comparisons. In non-redundant NRBR form, each word `w=x[i]` is in range
+asciimath:[0 le w < 2^d].
+Note that the r redundancy bits are still there but they're zeroes;
+`w[XLEN-1:d] = 0`.
+For negative numbers a convention may be adopted where the highest-order
+word has redundancy -1.
+
+NRBR reduction can be implemented as a loop that proceeds word-by-word
+from least significant towards more significant words:
+
+----
+    c = 0  (or carry-in)
+    for i = 0, 1, .. n-1 in sequence:
+        c = c + x[i]
+        x'[i] = c & DMASK
+        c = c >> d
+    end
+----
+
+Serialization to some fully canonical little- or big-endian wire formatting
+is an application matter and not discussed here.
+
+
+====  (Parallel) Multiplication and Input Prepping
+
+When multiplying two RBR numbers `x[0..n-1]` and `y[0..m-1]` the
+product `xy[0..n + m -1]` can be formed by starting with
+`xy[0..n+m-1] = 0` and computing (in parallel or in any order!) the sums:
+
+----
+    for all (i,j), 0 <= i < n, 0 <= j < m:
+        t = x[i] * y[i]
+        xy[i + j] = xy[i + j] + (t & DMASK)
+        xy[i + j + 1] = xy[i + j + 1] + (t >> d)
+    end
+----
+
+We see that the high r bits `t[2 * XLEN-1 : r + 2 * d]` of the sub-product
+are discarded -- this style of implementation assumes that the
+inputs are SRBR (or similar) rather than general RBR. An alternative
+approach would be to apply DMASK to `xy[i + j + 1]` too, and
+add `(t >> (2*d))` to `xy[i + j + 2]`. However, it would seem to be
+always easier to SBRB-reduce inputs first. As a general strategy any easy
+O(n) or one-step parallel input prepping is worthwhile since the main body
+of multiplication is superlinear, often up to O(n^2^).
+
+The first addition requires bits `t[d-1 : 0]` of the product and the
+second addition requires `t[r + 2*d - 1 : d]`. The standard RISC-V
+instructions MUL and `MULH[[S]]U]` return bits `t[XLEN-1:0]`
+and `t[2*XLEN-1:XLEN]` of the product, which would seem not necessitate
+a couple of additional few shifts and an XOR for each step.
+
+An easy input-prep trick is to left-shift both (SRBR-format) inputs left
+by r/2 bits before starting the operation (typically 4 positioins).
+As result the product is shifted left  by r bits and hence `MULH[[S]U]`
+directly returns the desired value and the lower word needs to be
+shifted right by r steps logically (no masking):
+
+----
+    for all i, 0 <= i < n:
+        x'[i] = x[i] << (r / 2)
+    end
+
+    for all j, 0 <= j < m:
+        y'[j] = y[j] << (r / 2)
+    end
+
+    for all (i,j) with 0 <= i < n, 0 <= j < m:
+        t = x'[i] * y'[i]
+        xy[i + j] = xy[i + j] + (t[XLEN-1:0] >> r)
+        xy[i + j + 1] = xy[i + j + 1] + t[2*XLEN-1:XLEN];
+    end
+----
+
+For RV32 the choice of unsigned redundancy r=8 allows multiplication of
+24*256 = 6144-bit numbers (12288-bit product) without a carry reduction
+step, and 56*256 = 14336 for RV64, which is sufficient for most current
+cryptographic applications. However one may easily introduce intermediate
+reduction steps. One may also use signed representation, which makes
+https://en.wikipedia.org/wiki/Karatsuba_algorithm[Karatsuba] - style
+multiplication formulas easier to implement to asymptotically reduce
+the overall number of multiplication instructions for very large numbers.
+
diff --git a/doc/supplementary-info.adoc b/doc/supp/supplementary-info.adoc
similarity index 52%
rename from doc/supplementary-info.adoc
rename to doc/supp/supplementary-info.adoc
index 8e306429..caf17a86 100644
--- a/doc/supplementary-info.adoc
+++ b/doc/supp/supplementary-info.adoc
@@ -16,10 +16,12 @@ _Work In Progress_
 
 This document contains additional information for engineers implementing
 or using the RISC-V cryptography extension.
-None of this document is a specification _requirement_.
-Rather, it represents recommendations or guides which the designers
-of the extension have created in the course of developing the
-specification.
 
+NOTE:   The following discussion is not a _requirement_. We are simply
+providing supplementary information that is intended to be helpful for
+implementation and optimization of certain tasks, and provides partial
+rationale for certain ISA features.
 
+include::rbr-arithmetic.adoc[]
+include::indexed-loads-stores.adoc[]
 

From fe943ed1c6745310f36b790a7841e2b33ee915f3 Mon Sep 17 00:00:00 2001
From: "Markku-Juhani O. Saarinen" <mjos@iki.fi>
Date: Sat, 9 May 2020 18:29:13 +0100
Subject: [PATCH 2/4] asciidoc moved out of way to doc/supp

---
 doc/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/README.md b/doc/README.md
index 24ec9633..2bceb144 100644
--- a/doc/README.md
+++ b/doc/README.md
@@ -22,8 +22,8 @@ This directory contains two types of information:
   page.
 
 
-- [Supplementary information](supplementary-info.adoc),
-  in ascii doc form.
+- [Supplementary information](supp/supplementary-info.adoc),
+  in [AsciiDoc](https://asciidoctor.org/) format.
   This contains various recommendations, discussions and design
   rationale which we have developed in conjunction to the specification.
 

From d88af100c4e5eda7360b3535e4475f8b1cdfa197 Mon Sep 17 00:00:00 2001
From: "Markku-Juhani O. Saarinen" <mjos@iki.fi>
Date: Sun, 10 May 2020 22:39:37 +0100
Subject: [PATCH 3/4] cleanup, add gcm

---
 doc/supp/gcm-mode-cmul.adoc        | 105 +++++++++++++++++++++++++++++
 doc/supp/indexed-loads-stores.adoc |   9 +--
 doc/supp/rbr-arithmetic.adoc       |  77 +++++++++++----------
 doc/supp/supplementary-info.adoc   |   5 +-
 4 files changed, 152 insertions(+), 44 deletions(-)
 create mode 100644 doc/supp/gcm-mode-cmul.adoc

diff --git a/doc/supp/gcm-mode-cmul.adoc b/doc/supp/gcm-mode-cmul.adoc
new file mode 100644
index 00000000..b12cc685
--- /dev/null
+++ b/doc/supp/gcm-mode-cmul.adoc
@@ -0,0 +1,105 @@
+==  Galois/Counter Mode (GCM) with Carryless Multiply
+
+The Galois/Counter Mode (GCM) specified in
+https://doi.org/10.6028/NIST.SP.800-38D[NIST SP 800-38D] is a prominent
+Authenticated Encryption with Associated Data (AEAD) mechanism. It is
+the only cipher mode mandated as *MUST* for all
+https://www.rfc-editor.org/rfc/rfc8446.html[TLS 1.3] implementations
+and also frequently used in government and military applications.
+
+Here we'll briefly discuss implementation aspects of AES-GCM using the
+https://github.com/riscv/riscv-bitmanip[bitmanip] extension `B`.
+The instructions relevant to GCM are the Carry-Less Multiply instructions
+`CMUL[H][W]` and also the Generalized Reverse `GREV[W]`.
+The `[W]` suffix indicates a 32-bit word size variant on RV64.
+An attempt should be made to pair `CMULH` immediately followed by `CMUL`,
+as is done with `MULH`/`MUL`, although there is less of a performance
+advantage in this case.
+
+=== Finite Field Arithmetic in GF(2^128^)
+
+While message confidentiality in GCM is provided by a block cipher (AES)
+in counter mode (a CTR variant), authentication is based on a GHASH, a
+universal hash defined over the binary field GF(2^128^).
+Without custom instruction support GCM, just like AES itself, is either
+very slow or susceptible to cache timing attacks.
+
+Whether or not authenticating ciphertext or associated data, the main
+operation of GCM is the GHASH multiplication between a block of
+authentication data and a secret generator `H`. The addition in the
+field is trivial; just two or four XORs, depending on whether RV32 or RV64
+implementation is used.
+
+The finite field is defined to be the ring of binary polynomials modulo
+the primitive pentanomial
+latexmath:[$R(x) = x^{128} + x^7 + x^2 + x + 1.$]
+The field encoding is slightly unusual, with the multiplicative identity
+(i.e. one -- "1") being encoded as a byte sequence `0x80, 0x00, .., 0x00`.
+Converting to little-endian encoding involves inverting bits in each byte;
+the `GREV[W]` instruction with constant 7 (pseudo-instruction `rev`)
+accomplishes this.
+
+The multiplication itself can be asymptotically sped up with the Karatsuba
+method, which works even better in binary fields than it does with integers.
+This reduces the number of `CMUL`/`CMULH` pairs on RV64 from 4 to 3 and
+the on RV32 from 16 to 9, with the cost of many XORs.
+
+
+=== Reduction via Shifts or with Multiplication
+
+The second arithmetic step to consider is the polynomial reduction of the
+255-bit ring product down to 128 bits (the field) again. The best way of
+doing reduction depends on _how_ _fast_ the carry-less multiplication
+instructions `CMUL[H][W]` are in relation to shifts and XORs.
+
+We consider two alternative reduction methods:
+
+1. **Shift reduction**: Based on the low Hamming weight of the
+polynomial R.
+2. **Multiplication reduction**: Analogous to Montgomery and Barrett
+methods -- albeit simpler because we're working in characteristic 2.
+
+
+=== Determining the Fastest Method
+
+Examining the multiplication implementations in micro benchmarks
+we obtain the following  arithmetic counts:
+
+[cols="1,1,1,1,1,1,1,1", options="header"]
+.Instruction Counts
+|===
+| **Arch** | **Karatsuba**  | **Reduce**    | `GREV` | `XOR` | `S[L/R]L` | `CLMUL` | `CLMULH`
+| RV32B |   no  |   mul |   4   |   36  |   0   |   20  |   20
+| RV32B |   no  | shift |   4   |   56  |   24  |   16  |   16
+| RV32B |   yes |   mul |   4   |   52  |   0   |   13  |   13
+| RV32B |   yes | shift |   4   |   72  |   24  |   9   |   9
+| RV64B |   no  |   mul |   2   |   10  |   0   |   6   |   6
+| RV64B |   no  | shift |   2   |   20  |   12  |   4   |   4
+| RV64B |   yes |   mul |   2   |   14  |   0   |   5   |   5
+| RV64B |   yes | shift |   2   |   24  |   12  |   3   |   3
+|===
+
+
+We can see that the best selection of algorithms depends on the relative
+cost of multiplication. Assuming that other instructions have unit cost
+and multiply instructions require a multiple of it, and ignoring loops etc,
+we have:
+
+[cols="1,1,1,1,1,1,1", options="header"]
+.Clock Counts
+|===
+| **Arch** | **Karatsuba**  | **Reduce**    | **MUL=1** | **MUL=2** | **MUL=3** | **MUL=6**
+| RV32B |   no  |   mul | **80**    |   120     |   160     | 280
+| RV32B |   no  | shift |   116     |   148     |   180     | 276
+| RV32B |   yes |   mul |   82      |   **108** | **134**   | 212
+| RV32B |   yes | shift |   118     |   136     |   154     | **208**
+| RV64B |   no  |   mul | **24**    |   **36**  |   48      | 84
+| RV64B |   no  | shift |   42      |   50      |   58      | 82
+| RV64B |   yes |   mul |   26      |   **36**  | **46**    | 76
+| RV64B |   yes | shift |   44      |   50      |   56      | **74**
+|===
+
+We see that if `CLMUL[H][W]` takes twice the time of XOR and shifts,
+or more, then Karatsuba is worthwhile. If these multiplication instructions
+are six times slower, or more, then it is worthwhile to convert the reduction multiplications to shifts and XORs.
+
diff --git a/doc/supp/indexed-loads-stores.adoc b/doc/supp/indexed-loads-stores.adoc
index 49e1502e..2e406c14 100644
--- a/doc/supp/indexed-loads-stores.adoc
+++ b/doc/supp/indexed-loads-stores.adoc
@@ -4,8 +4,7 @@ The RISC-V architecture is a pure load-and-store achitecture. The perform
 a an _indexed_ load or store, the index must generally be scaled and added
 to a base pointer, and only then a separate load operation can be performed.
 
-TIP:    Try to avoid table lookups with secret-derived indexes.
-
+**Try to avoid table lookups with secret-derived indexes.**
 One prominent use of indexed loads in cryptography is to implement S-Boxes
 found in various secret-key algorithms. Secret-dependant loads (table lookups)
 are problematic in cryptography because of their high potential for cache
@@ -14,8 +13,7 @@ for AES and other relevant cryptographic algorithms; they often allow
 lookup-free implementation without resorting to "bit slicing" the entire
 algorithm.
 
-TIP:    In RISC-V pointers are generally used for loop control.
-
+**In RISC-V pointers are generally used for loop control.**
 For (arithmetic) loops the lack of indexing becomes less of a problem when
 one understands that RISC-V also makes no distinction between index and
 address registers and hence one can use pointers as indexes and loop
@@ -25,8 +23,7 @@ with `i=0..n` there is no trace of an "index" in the compiled code.
 A single register is both a pointer and index and iterates through
 `*v++` until it reaches a pre-computed v + n point.
 
-TIP:    Offset loads and stores are "free".
-
+**Offset loads and stores are "free".**
 All loads and stores in RISC-V have a 12-bit signed byte offset built into
 the instruction word itself. The offset loads and stores allow an
 implementor to often easily unroll a few steps to reduce the relative
diff --git a/doc/supp/rbr-arithmetic.adoc b/doc/supp/rbr-arithmetic.adoc
index 24520aae..4a0bb253 100644
--- a/doc/supp/rbr-arithmetic.adoc
+++ b/doc/supp/rbr-arithmetic.adoc
@@ -1,13 +1,11 @@
 ==  Notes on RISC-V Cryptographic Arithmetic
 
 Implementors of cryptographic large-integer arithmetic on RISC-V are
-initially faced with the biggest single issue that differentiates this architecture
-from many others; Lack of carry bits and overflow detection and lack of
-indexed load and store. This section discusses typical implementation
-techniques used for constant-time implementation of cryptographic
-large-integer arithmetic.
+initially faced with the biggest single issue that differentiates this architecture from many others; Lack of carry bits and overflow detection.
+This section discusses typical implementation techniques used for
+constant-time implementation of large-integer arithmetic for cryptography.
 
-=== Redundant Binary Representation
+=== Redundant Binary Representation (RBR)
 
 A natural RISC-V approach is to use https://en.wikipedia.org/wiki/Redundant_binary_representation[Redundant Binary Representation] (RBR)
 for cryptographic big-integer arithmetic.
@@ -16,13 +14,13 @@ Each XLEN-wide word carries d significant bits and r=XLEN-d
 additional redundancy bits. The numerical value X represented by
 little-endian vector of n words `x[n]` is therefore:
 
-[asciimath]
+[latexmath]
 ++++
-X = sum_(i=0)^(n-1) 2^(i*d) * x[i]
+X = \sum_{i=0}^{n-1} 2^{id} x[i]
 ++++
 
-This representation is redundant (not unique) since each `x[i]` may still
-have numerical values up to 2^XLEN^-1 if unsigned. Also note
+This representation is redundant (not unique) since each word `x[i]` may
+still have numerical values up to 2^XLEN^-1 if unsigned. Also note
 that sometimes it is preferable to use signed `x[i]`.
 
 RBR algorithms are often used even when carry is available, since it (a) allows effective parallelization (even in SIMD and Vector architectures)
@@ -31,7 +29,7 @@ are no variable-length carry chains. Constant-time implementation is
 very important in cryptographic applications.
 
 
-==== Usage of RBR
+=== Usage of RBR
 
 It is easy to see that addition and subtraction become fully parallel vector operations up to saturation; when implementing a sequence of arithmetic operations one can analyze where an overflow becomes possible and carry reduction is potentially required. Fortunately carry reduction can also be usually parallelized.
 
@@ -42,12 +40,12 @@ One should try to complete a larger cryptographic operation such as elliptic cur
 Often these algorithms require additional representation tricks such as Montgomery form (to avoid modular remaindering) or Projective, Jacobian,... coordinates (to avoid division with Elliptic Curves). Most of such techniques apply to RBR equally well as they do to non-redundant representations.
 
 
-==== Parallel carry and Semi-Redundant Binary Representation SRBR
+=== Parallel carry and Semi-Redundant Binary Representation SRBR
 
 There are two kinds of carry-reduction operations, one which is
 parallelizable and another which is not.
 
-In the following I'll use DMASK to denote the bit mask 2^d^-1
+In the following I'll use `DMASK` to denote the bit mask 2^d^-1
 that cuts a number to `d` bits, e.g. `0x00FFFFFF` or `0x00FFFFFFFFFFFFFF`.
 
 In parallel carry  we simultaneously replace all `x[i]` with `x'[i]`:
@@ -61,28 +59,29 @@ In parallel carry  we simultaneously replace all `x[i]` with `x'[i]`:
 
 
 In unsigned case this puts each word `w=x[i]` in range
-asciimath:[0 le w < 2^d + 2^r], with bits `w[XLEN-1:d+1]=0`
+latexmath:[$0 \le w < 2^d + 2^r$], with bits `w[XLEN-1:d+1]=0`
 and a relatively small probability that bit `w[d]` is nonzero.
 Vectors satisfying this condition are considered to have
 Semi-Redundant Binary Representation (SRBR). In vector format such
 a semi-reduction involves only pairs of vector elements.
 
 
-====  Full carry and Non-Redundant Binary Representation (NRBR)
+=== Full carry and Non-Redundant Binary Representation (NRBR)
 
 Full carry is usually only required for serialization and numeric
 comparisons. In non-redundant NRBR form, each word `w=x[i]` is in range
-asciimath:[0 le w < 2^d].
+latexmath:[$0 \le w < 2^d$].
 Note that the r redundancy bits are still there but they're zeroes;
 `w[XLEN-1:d] = 0`.
-For negative numbers a convention may be adopted where the highest-order
-word has redundancy -1.
+For negative numbers a NRBR convention may be adopted where the
+highest-order word (only) has redundancy equivalent to -1 (all 1 bits):
+`w[XLEN-1:d] = 111..1`.
 
 NRBR reduction can be implemented as a loop that proceeds word-by-word
 from least significant towards more significant words:
 
 ----
-    c = 0  (or carry-in)
+    c = 0                               //  (or carry-in)
     for i = 0, 1, .. n-1 in sequence:
         c = c + x[i]
         x'[i] = c & DMASK
@@ -94,7 +93,7 @@ Serialization to some fully canonical little- or big-endian wire formatting
 is an application matter and not discussed here.
 
 
-====  (Parallel) Multiplication and Input Prepping
+=== (Parallel) Multiplication and Input Prepping
 
 When multiplying two RBR numbers `x[0..n-1]` and `y[0..m-1]` the
 product `xy[0..n + m -1]` can be formed by starting with
@@ -103,32 +102,36 @@ product `xy[0..n + m -1]` can be formed by starting with
 ----
     for all (i,j), 0 <= i < n, 0 <= j < m:
         t = x[i] * y[i]
-        xy[i + j] = xy[i + j] + (t & DMASK)
-        xy[i + j + 1] = xy[i + j + 1] + (t >> d)
+        k = i + j
+        xy[k] = xy[k] + (t & DMASK)         //  <1>
+        xy[k + 1] = xy[k + 1] + (t >> d)    //  <2>
     end
 ----
+<1> The first addition uses bits `t[d-1:0]` of the product.
+<2> Second addition uses bits `t[XLEN+d-1:d]` of the product.
 
-We see that the high r bits `t[2 * XLEN-1 : r + 2 * d]` of the sub-product
+The standard RISC-V instructions `MUL` and `MULH[[S]U]` return bits
+`t[XLEN-1:0]` and `t[2*XLEN-1:XLEN]` of the product, which would seem
+not necessitate a couple of additional few shifts and an XOR for each
+step.
+
+An easy input-prep trick is to left-shift both (SRBR-format) inputs left
+by `r/2` bits before starting the operation (typically 4 positions).
+As result the product is shifted left  by `r` bits and hence `MULH[[S]U]`
+directly returns the desired value and the lower word needs to be
+right shifted by `r` bits.
+
+We see that the high `r` bits `t[2*XLEN-1:XLEN+d]` of the sub-product
 are discarded -- this style of implementation assumes that the
-inputs are SRBR (or similar) rather than general RBR. An alternative
-approach would be to apply DMASK to `xy[i + j + 1]` too, and
-add `(t >> (2*d))` to `xy[i + j + 2]`. However, it would seem to be
+inputs are SRBR (or similar) rather than general RBR.
+An alternative approach would be to apply `DMASK` to `xy[k + 1]` too, and
+add `(t >> (2*d))` to `xy[k + 2]`. However, it would seem to be
 always easier to SBRB-reduce inputs first. As a general strategy any easy
 O(n) or one-step parallel input prepping is worthwhile since the main body
 of multiplication is superlinear, often up to O(n^2^).
 
-The first addition requires bits `t[d-1 : 0]` of the product and the
-second addition requires `t[r + 2*d - 1 : d]`. The standard RISC-V
-instructions MUL and `MULH[[S]]U]` return bits `t[XLEN-1:0]`
-and `t[2*XLEN-1:XLEN]` of the product, which would seem not necessitate
-a couple of additional few shifts and an XOR for each step.
-
-An easy input-prep trick is to left-shift both (SRBR-format) inputs left
-by r/2 bits before starting the operation (typically 4 positioins).
-As result the product is shifted left  by r bits and hence `MULH[[S]U]`
-directly returns the desired value and the lower word needs to be
-shifted right by r steps logically (no masking):
 
+Together with input prepping we have:
 ----
     for all i, 0 <= i < n:
         x'[i] = x[i] << (r / 2)
diff --git a/doc/supp/supplementary-info.adoc b/doc/supp/supplementary-info.adoc
index caf17a86..aead0348 100644
--- a/doc/supp/supplementary-info.adoc
+++ b/doc/supp/supplementary-info.adoc
@@ -22,6 +22,9 @@ providing supplementary information that is intended to be helpful for
 implementation and optimization of certain tasks, and provides partial
 rationale for certain ISA features.
 
-include::rbr-arithmetic.adoc[]
 include::indexed-loads-stores.adoc[]
 
+include::rbr-arithmetic.adoc[]
+
+include::gcm-mode-cmul.adoc[]
+

From 1fb525311e93b39e51f7564380c5a9a9931bc79d Mon Sep 17 00:00:00 2001
From: "Markku-Juhani O. Saarinen" <mjos@iki.fi>
Date: Sun, 10 May 2020 22:48:32 +0100
Subject: [PATCH 4/4] section headers

---
 doc/supp/rbr-arithmetic.adoc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/supp/rbr-arithmetic.adoc b/doc/supp/rbr-arithmetic.adoc
index 4a0bb253..27f70f9c 100644
--- a/doc/supp/rbr-arithmetic.adoc
+++ b/doc/supp/rbr-arithmetic.adoc
@@ -29,7 +29,7 @@ are no variable-length carry chains. Constant-time implementation is
 very important in cryptographic applications.
 
 
-=== Usage of RBR
+=== Redundancy bits
 
 It is easy to see that addition and subtraction become fully parallel vector operations up to saturation; when implementing a sequence of arithmetic operations one can analyze where an overflow becomes possible and carry reduction is potentially required. Fortunately carry reduction can also be usually parallelized.
 
@@ -40,7 +40,7 @@ One should try to complete a larger cryptographic operation such as elliptic cur
 Often these algorithms require additional representation tricks such as Montgomery form (to avoid modular remaindering) or Projective, Jacobian,... coordinates (to avoid division with Elliptic Curves). Most of such techniques apply to RBR equally well as they do to non-redundant representations.
 
 
-=== Parallel carry and Semi-Redundant Binary Representation SRBR
+=== Parallel carry and Semi-Redundant Form (SRBR)
 
 There are two kinds of carry-reduction operations, one which is
 parallelizable and another which is not.
@@ -66,7 +66,7 @@ Semi-Redundant Binary Representation (SRBR). In vector format such
 a semi-reduction involves only pairs of vector elements.
 
 
-=== Full carry and Non-Redundant Binary Representation (NRBR)
+=== Full carry and Non-Redundant Form (NRBR)
 
 Full carry is usually only required for serialization and numeric
 comparisons. In non-redundant NRBR form, each word `w=x[i]` is in range