Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bernstein yang modular multiplicative inverter #2

Merged
merged 11 commits into from
Aug 29, 2023

Conversation

AlekseiVambol
Copy link

Replacing the multiplicative inversion based on the Little Fermat Theorem with Bernstein yang multiplicative inversion. The latter one turned out to be about 8.5 times faster on average.

@johntaiko
Copy link

Hello, aleksei, I found your code has not been formatted.
cargo fmt

Copy link

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only comments to improve in general.

A good addition would be an inversion benchmark here https://github.com/taikoxyz/halo2curves/blob/main/benches/bn256_field.rs

struct ChunkInt<const B:usize, const L:usize>(pub [u64; L]);

impl<const B:usize, const L:usize> ChunkInt<B,L> {
/// Mask, in which the B lowest bits are 1 and only they
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and only they"

Seems like the sentence is cut.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it means "they and only they are 1".

/// The ordering of the chunks in these arrays is little-endian.
/// The arithmetic operations for this type are wrapping ones.
#[derive(Clone)]
struct ChunkInt<const B:usize, const L:usize>(pub [u64; L]);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's usually called unsaturated integers see:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this term is rarely used. On the other hand, "saturated integer" is usually used to describe the type, for which the saturation arithmetic is implemented: https://en.wikipedia.org/wiki/Saturation_arithmetic A least, my first thought was about the saturation arithmetic, when I saw this term somewhere before. Thus, I will leave this comment "as is" to avoid the ambiguation.

data[i - 1] = self.0[i];
}
if self.is_negative() {
data[L - 1] = Self::MASK;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the mask in data[L - 2] need to be cancelled?

A comment is needed, at least to attract attention for a potential bug during review/audit and also test input design.

Edit: Ah I understand the representation now. All limbs are 2-complement and negated, not just the most significant word.

Copy link
Author

@AlekseiVambol AlekseiVambol Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To dispel any doubts, I perform this:
image

let (mut data, mut carry) = ([0; L], 0);
for i in 0..L {
let sum = self.0[i] + other.0[i] + carry;
data[i] = sum & ChunkInt::<B,L>::MASK;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the sum is negative? You remove the negative tag there.

Copy link
Author

@AlekseiVambol AlekseiVambol Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of "sum" is never negative, since actually I operate on 2 non-negative integers in [0 .. 2 ^ (B * L) - 1] splited into B-bit non-negative chunks, but any x in [2 ^ (B * L - 1) .. 2 ^ (B * L) - 1] is considered to be the representation of -|2 ^ (B * L) - x|. The Mul, Add and Sub algorithms for the two's complement code do not care about the sign, and this is the main reason for them being used by the majority of processors. Since ChunkInt has a fixed size, it may afford to use this convenient code.

impl<const B:usize, const L:usize> Sub for &ChunkInt<B,L> {
type Output = ChunkInt<B,L>;
fn sub(self, other: Self) -> Self::Output {
let (mut data, mut carry) = ([0; L], 1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment to explain that

-x = flip the bits of x and add 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This part of code has not been commented yet.

}

let mask = (1 << steps.min(1 - delta).min(4)) - 1;
let w = (g as i64).wrapping_mul(f.wrapping_mul(3) ^ 12) & mask;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment in the vein of

Find the multiple of f to add to cancel the bottom min(steps, 4) bits of g

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This part of code has not been commented yet; "f.wrapping_mul(3) ^ 12" will also been explained a bit.

(cd.shift(), ce.shift())
}

fn norm(&self, mut value: ChunkInt<B,L>, negate: bool) -> ChunkInt<B,L> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should indicate the input range and output range

AFAIK:

Compute a = sign*a (mod M)

with a in range (-2*M, M)
result in range [0, M)

also why is the value mutated and returned as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I was going to comment this part of the code.

also why is the value mutated and returned as well?

It is mutable in order to mutate it within the function without introducing a new temporary variable.
It is returned, because the function takes the ownership of the argument instead of borrowing it mutably.
It is returned instead of being changed using a mutable reference because I want all function to be pure: https://en.wikipedia.org/wiki/Pure_function

}

/// Returns the multiplicative inverse of the argument modulo 2^B. The implementation is based
/// on the Hurchalla's method for computing the multiplicative inverse modulo a power of two
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing link to paper

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I try to avoid overwhelming my code with direct links to papers easy to find, since I try to achieve better readability and aesthetics) However, I can add this one.

@@ -24,6 +24,11 @@ macro_rules! field_common {
$r2:ident,
$r3:ident
) => {
/// Bernstein-Yang modular multiplicative inverter created for the modulus equal to
/// the characteristic of the field to invert positive integers in the Montgomery form.
const BYINVERTOR: crate::bernsteinyang::BYInverter<62,6> =
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for BN254 and secp256k1, BYInverter<62,5> is sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I have checked this both experimentally (running the code for the bn254 Fq modulus with BYInverter<62,5>) and theoretically (deriving the formula 2 ^ (B * (L - 1) - 2) for the threshold for the values of the modulus and the argument).

let mut matrix;
while g != ChunkInt::ZERO {
(delta, matrix) = Self::step(&f, &g, delta);
(f, g) = Self::fg(f, g, matrix);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: fg updates can take a parameter "limbsLeft" to avoid iterating on the full L all the time.

This might be left as a future optimization.

See: https://github.com/mratsim/constantine/blob/f57d071f1192a4039979a3baf6c835b89841bcfa/constantine/math/arithmetic/limbs_exgcd.nim#L836-L839

Copy link
Author

@AlekseiVambol AlekseiVambol Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this during planing the implementation, but decided not to do this, because:

  • It will require some extra time to determine the number of unused limbs after each arithmetic operation;
  • Only the f,g-update can benefit from it;
  • For the two's complement code working with fixed-length register-like structure is much more convenient.

For much larger modulus this optimization makes sense, but I am not sure that it is needed here.

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." - Donald Knuth

@AlekseiVambol
Copy link
Author

AlekseiVambol commented Aug 28, 2023

Update summary:

  1. I have read that the Montgomery's formula (3 * x) xor 2 = 1 / x for an odd integer x in the two's complement code holds true not only (mod 16), but also (mod 32). Thus, I have used it to derive the formula (3 * x) xor 28 = -1 / x (mod 32) and integrated it into the method for computing the transition matrix in order to nullify the 5 lowest bits instead of the 4 ones. This increased the overall performance by 2-3%.
  2. The number of the Bernstein-Yang method's basic steps, which the transition matrix is computed for, has been hard coded as 62.
  3. The test bench for multiplicative inversion in the base field of the bn256 curve has been added.
  4. Commenting has been accomplished.
  5. Some minor changes proposed by the Clippy tool has been made.
  6. The standard code formatting has been done.

The current implementation of finite field multiplicative inversion is approximately 9.4 times faster than the initial one based on the Fermat's Little Theorem:
image

Copy link

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I think the src filename should be "invmod.rs" or "modinv.rs" instead of Bernstein-Yang (i.e. what it does instead of how it does it), but the PR is in good shape and it will likely take a very long-time to review for the PSE team so better merge now and address their concerns.

@mratsim mratsim merged commit ba112ff into main Aug 29, 2023
mratsim added a commit that referenced this pull request Oct 19, 2023
…s#83)

* Bernstein yang modular multiplicative inverter (#2)

* rename similar to privacy-scaling-explorations#95

---------

Co-authored-by: Aleksei Vambol <[email protected]>
mratsim added a commit that referenced this pull request Oct 25, 2023
…s#83)

* Bernstein yang modular multiplicative inverter (#2)

* rename similar to privacy-scaling-explorations#95

---------

Co-authored-by: Aleksei Vambol <[email protected]>
@mratsim mratsim deleted the Bernstein-Yang-modular-multiplicative-inverter branch November 2, 2023 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants