Skip to content

Commit

Permalink
Merge pull request #11 from scarv/dev/supp
Browse files Browse the repository at this point in the history
dev/supp -- supplementary docs
  • Loading branch information
ben-marshall authored May 11, 2020
2 parents 4d1f200 + 1fb5253 commit 510353a
Show file tree
Hide file tree
Showing 6 changed files with 328 additions and 27 deletions.
4 changes: 2 additions & 2 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ This directory contains two types of information:
page.


- [Supplementary information](supplementary-info.adoc),
in ascii doc form.
- [Supplementary information](supp/supplementary-info.adoc),
in [AsciiDoc](https://asciidoctor.org/) format.
This contains various recommendations, discussions and design
rationale which we have developed in conjunction to the specification.

105 changes: 105 additions & 0 deletions doc/supp/gcm-mode-cmul.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
== Galois/Counter Mode (GCM) with Carryless Multiply

The Galois/Counter Mode (GCM) specified in
https://doi.org/10.6028/NIST.SP.800-38D[NIST SP 800-38D] is a prominent
Authenticated Encryption with Associated Data (AEAD) mechanism. It is
the only cipher mode mandated as *MUST* for all
https://www.rfc-editor.org/rfc/rfc8446.html[TLS 1.3] implementations
and also frequently used in government and military applications.

Here we'll briefly discuss implementation aspects of AES-GCM using the
https://github.com/riscv/riscv-bitmanip[bitmanip] extension `B`.
The instructions relevant to GCM are the Carry-Less Multiply instructions
`CMUL[H][W]` and also the Generalized Reverse `GREV[W]`.
The `[W]` suffix indicates a 32-bit word size variant on RV64.
An attempt should be made to pair `CMULH` immediately followed by `CMUL`,
as is done with `MULH`/`MUL`, although there is less of a performance
advantage in this case.

=== Finite Field Arithmetic in GF(2^128^)

While message confidentiality in GCM is provided by a block cipher (AES)
in counter mode (a CTR variant), authentication is based on a GHASH, a
universal hash defined over the binary field GF(2^128^).
Without custom instruction support GCM, just like AES itself, is either
very slow or susceptible to cache timing attacks.

Whether or not authenticating ciphertext or associated data, the main
operation of GCM is the GHASH multiplication between a block of
authentication data and a secret generator `H`. The addition in the
field is trivial; just two or four XORs, depending on whether RV32 or RV64
implementation is used.

The finite field is defined to be the ring of binary polynomials modulo
the primitive pentanomial
latexmath:[$R(x) = x^{128} + x^7 + x^2 + x + 1.$]
The field encoding is slightly unusual, with the multiplicative identity
(i.e. one -- "1") being encoded as a byte sequence `0x80, 0x00, .., 0x00`.
Converting to little-endian encoding involves inverting bits in each byte;
the `GREV[W]` instruction with constant 7 (pseudo-instruction `rev`)
accomplishes this.

The multiplication itself can be asymptotically sped up with the Karatsuba
method, which works even better in binary fields than it does with integers.
This reduces the number of `CMUL`/`CMULH` pairs on RV64 from 4 to 3 and
the on RV32 from 16 to 9, with the cost of many XORs.


=== Reduction via Shifts or with Multiplication

The second arithmetic step to consider is the polynomial reduction of the
255-bit ring product down to 128 bits (the field) again. The best way of
doing reduction depends on _how_ _fast_ the carry-less multiplication
instructions `CMUL[H][W]` are in relation to shifts and XORs.

We consider two alternative reduction methods:

1. **Shift reduction**: Based on the low Hamming weight of the
polynomial R.
2. **Multiplication reduction**: Analogous to Montgomery and Barrett
methods -- albeit simpler because we're working in characteristic 2.


=== Determining the Fastest Method

Examining the multiplication implementations in micro benchmarks
we obtain the following arithmetic counts:

[cols="1,1,1,1,1,1,1,1", options="header"]
.Instruction Counts
|===
| **Arch** | **Karatsuba** | **Reduce** | `GREV` | `XOR` | `S[L/R]L` | `CLMUL` | `CLMULH`
| RV32B | no | mul | 4 | 36 | 0 | 20 | 20
| RV32B | no | shift | 4 | 56 | 24 | 16 | 16
| RV32B | yes | mul | 4 | 52 | 0 | 13 | 13
| RV32B | yes | shift | 4 | 72 | 24 | 9 | 9
| RV64B | no | mul | 2 | 10 | 0 | 6 | 6
| RV64B | no | shift | 2 | 20 | 12 | 4 | 4
| RV64B | yes | mul | 2 | 14 | 0 | 5 | 5
| RV64B | yes | shift | 2 | 24 | 12 | 3 | 3
|===


We can see that the best selection of algorithms depends on the relative
cost of multiplication. Assuming that other instructions have unit cost
and multiply instructions require a multiple of it, and ignoring loops etc,
we have:

[cols="1,1,1,1,1,1,1", options="header"]
.Clock Counts
|===
| **Arch** | **Karatsuba** | **Reduce** | **MUL=1** | **MUL=2** | **MUL=3** | **MUL=6**
| RV32B | no | mul | **80** | 120 | 160 | 280
| RV32B | no | shift | 116 | 148 | 180 | 276
| RV32B | yes | mul | 82 | **108** | **134** | 212
| RV32B | yes | shift | 118 | 136 | 154 | **208**
| RV64B | no | mul | **24** | **36** | 48 | 84
| RV64B | no | shift | 42 | 50 | 58 | 82
| RV64B | yes | mul | 26 | **36** | **46** | 76
| RV64B | yes | shift | 44 | 50 | 56 | **74**
|===

We see that if `CLMUL[H][W]` takes twice the time of XOR and shifts,
or more, then Karatsuba is worthwhile. If these multiplication instructions
are six times slower, or more, then it is worthwhile to convert the reduction multiplications to shifts and XORs.

32 changes: 32 additions & 0 deletions doc/supp/indexed-loads-stores.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
== On Indexed Loads and Stores

The RISC-V architecture is a pure load-and-store achitecture. The perform
a an _indexed_ load or store, the index must generally be scaled and added
to a base pointer, and only then a separate load operation can be performed.

**Try to avoid table lookups with secret-derived indexes.**
One prominent use of indexed loads in cryptography is to implement S-Boxes
found in various secret-key algorithms. Secret-dependant loads (table lookups)
are problematic in cryptography because of their high potential for cache
timing attacks. We hope that our instruction extensions remove this problem
for AES and other relevant cryptographic algorithms; they often allow
lookup-free implementation without resorting to "bit slicing" the entire
algorithm.

**In RISC-V pointers are generally used for loop control.**
For (arithmetic) loops the lack of indexing becomes less of a problem when
one understands that RISC-V also makes no distinction between index and
address registers and hence one can use pointers as indexes and loop
counters. This is a transformation that a compiler usually does
automatically; when your code is iterating through some vector `v[i]`
with `i=0..n` there is no trace of an "index" in the compiled code.
A single register is both a pointer and index and iterates through
`*v++` until it reaches a pre-computed v + n point.

**Offset loads and stores are "free".**
All loads and stores in RISC-V have a 12-bit signed byte offset built into
the instruction word itself. The offset loads and stores allow an
implementor to often easily unroll a few steps to reduce the relative
time spent on loop control and to potentially parallelize the operation
in superscalar architectures without relying on branch prediction.

159 changes: 159 additions & 0 deletions doc/supp/rbr-arithmetic.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
== Notes on RISC-V Cryptographic Arithmetic

Implementors of cryptographic large-integer arithmetic on RISC-V are
initially faced with the biggest single issue that differentiates this architecture from many others; Lack of carry bits and overflow detection.
This section discusses typical implementation techniques used for
constant-time implementation of large-integer arithmetic for cryptography.

=== Redundant Binary Representation (RBR)

A natural RISC-V approach is to use https://en.wikipedia.org/wiki/Redundant_binary_representation[Redundant Binary Representation] (RBR)
for cryptographic big-integer arithmetic.

Each XLEN-wide word carries d significant bits and r=XLEN-d
additional redundancy bits. The numerical value X represented by
little-endian vector of n words `x[n]` is therefore:

[latexmath]
++++
X = \sum_{i=0}^{n-1} 2^{id} x[i]
++++

This representation is redundant (not unique) since each word `x[i]` may
still have numerical values up to 2^XLEN^-1 if unsigned. Also note
that sometimes it is preferable to use signed `x[i]`.

RBR algorithms are often used even when carry is available, since it (a) allows effective parallelization (even in SIMD and Vector architectures)
and (b) allows easier implementation of constant-time arithmetic as there
are no variable-length carry chains. Constant-time implementation is
very important in cryptographic applications.


=== Redundancy bits

It is easy to see that addition and subtraction become fully parallel vector operations up to saturation; when implementing a sequence of arithmetic operations one can analyze where an overflow becomes possible and carry reduction is potentially required. Fortunately carry reduction can also be usually parallelized.

For the convenience of serialization and deserialization, we often choose redundancy of r = 8 bits, leaving d=24 or d=56 non-redundant bits for each word.

One should try to complete a larger cryptographic operation such as elliptic curve scalar multiplication or RSA exponentiation entirely in the RBR domain, apart from quantities that benefit from canonical or other representation -- such as exponents.

Often these algorithms require additional representation tricks such as Montgomery form (to avoid modular remaindering) or Projective, Jacobian,... coordinates (to avoid division with Elliptic Curves). Most of such techniques apply to RBR equally well as they do to non-redundant representations.


=== Parallel carry and Semi-Redundant Form (SRBR)

There are two kinds of carry-reduction operations, one which is
parallelizable and another which is not.

In the following I'll use `DMASK` to denote the bit mask 2^d^-1
that cuts a number to `d` bits, e.g. `0x00FFFFFF` or `0x00FFFFFFFFFFFFFF`.

In parallel carry we simultaneously replace all `x[i]` with `x'[i]`:

----
x'[0] = x[0] & DMASK
for all i, 0 < i < n:
x'[i] = (x[i] & DMASK) + (x[i-1] >> d)
end
----


In unsigned case this puts each word `w=x[i]` in range
latexmath:[$0 \le w < 2^d + 2^r$], with bits `w[XLEN-1:d+1]=0`
and a relatively small probability that bit `w[d]` is nonzero.
Vectors satisfying this condition are considered to have
Semi-Redundant Binary Representation (SRBR). In vector format such
a semi-reduction involves only pairs of vector elements.


=== Full carry and Non-Redundant Form (NRBR)

Full carry is usually only required for serialization and numeric
comparisons. In non-redundant NRBR form, each word `w=x[i]` is in range
latexmath:[$0 \le w < 2^d$].
Note that the r redundancy bits are still there but they're zeroes;
`w[XLEN-1:d] = 0`.
For negative numbers a NRBR convention may be adopted where the
highest-order word (only) has redundancy equivalent to -1 (all 1 bits):
`w[XLEN-1:d] = 111..1`.

NRBR reduction can be implemented as a loop that proceeds word-by-word
from least significant towards more significant words:

----
c = 0 // (or carry-in)
for i = 0, 1, .. n-1 in sequence:
c = c + x[i]
x'[i] = c & DMASK
c = c >> d
end
----

Serialization to some fully canonical little- or big-endian wire formatting
is an application matter and not discussed here.


=== (Parallel) Multiplication and Input Prepping

When multiplying two RBR numbers `x[0..n-1]` and `y[0..m-1]` the
product `xy[0..n + m -1]` can be formed by starting with
`xy[0..n+m-1] = 0` and computing (in parallel or in any order!) the sums:

----
for all (i,j), 0 <= i < n, 0 <= j < m:
t = x[i] * y[i]
k = i + j
xy[k] = xy[k] + (t & DMASK) // <1>
xy[k + 1] = xy[k + 1] + (t >> d) // <2>
end
----
<1> The first addition uses bits `t[d-1:0]` of the product.
<2> Second addition uses bits `t[XLEN+d-1:d]` of the product.

The standard RISC-V instructions `MUL` and `MULH[[S]U]` return bits
`t[XLEN-1:0]` and `t[2*XLEN-1:XLEN]` of the product, which would seem
not necessitate a couple of additional few shifts and an XOR for each
step.

An easy input-prep trick is to left-shift both (SRBR-format) inputs left
by `r/2` bits before starting the operation (typically 4 positions).
As result the product is shifted left by `r` bits and hence `MULH[[S]U]`
directly returns the desired value and the lower word needs to be
right shifted by `r` bits.

We see that the high `r` bits `t[2*XLEN-1:XLEN+d]` of the sub-product
are discarded -- this style of implementation assumes that the
inputs are SRBR (or similar) rather than general RBR.
An alternative approach would be to apply `DMASK` to `xy[k + 1]` too, and
add `(t >> (2*d))` to `xy[k + 2]`. However, it would seem to be
always easier to SBRB-reduce inputs first. As a general strategy any easy
O(n) or one-step parallel input prepping is worthwhile since the main body
of multiplication is superlinear, often up to O(n^2^).


Together with input prepping we have:
----
for all i, 0 <= i < n:
x'[i] = x[i] << (r / 2)
end
for all j, 0 <= j < m:
y'[j] = y[j] << (r / 2)
end
for all (i,j) with 0 <= i < n, 0 <= j < m:
t = x'[i] * y'[i]
xy[i + j] = xy[i + j] + (t[XLEN-1:0] >> r)
xy[i + j + 1] = xy[i + j + 1] + t[2*XLEN-1:XLEN];
end
----

For RV32 the choice of unsigned redundancy r=8 allows multiplication of
24*256 = 6144-bit numbers (12288-bit product) without a carry reduction
step, and 56*256 = 14336 for RV64, which is sufficient for most current
cryptographic applications. However one may easily introduce intermediate
reduction steps. One may also use signed representation, which makes
https://en.wikipedia.org/wiki/Karatsuba_algorithm[Karatsuba] - style
multiplication formulas easier to implement to asymptotically reduce
the overall number of multiplication instructions for very large numbers.

30 changes: 30 additions & 0 deletions doc/supp/supplementary-info.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
= RISC-V Crypto Extension: Supplementary Information
:doctype: article
:encoding: utf-8
:lang: en
:toc: left
:numbered:
:stem: latexmath
:le: &#8804;
:ge: &#8805;
:ne: &#8800;
:inf: &#8734;

_Work In Progress_

== Introduction

This document contains additional information for engineers implementing
or using the RISC-V cryptography extension.

NOTE: The following discussion is not a _requirement_. We are simply
providing supplementary information that is intended to be helpful for
implementation and optimization of certain tasks, and provides partial
rationale for certain ISA features.

include::indexed-loads-stores.adoc[]

include::rbr-arithmetic.adoc[]

include::gcm-mode-cmul.adoc[]

25 changes: 0 additions & 25 deletions doc/supplementary-info.adoc

This file was deleted.

0 comments on commit 510353a

Please sign in to comment.