-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement batch inversion of field elements #16
Conversation
Code looks good, but is there are use case for this? I'd like to avoid code bloat if not necessary. |
It's useful to convert a batch of points from jacobian to affine coordinates with a single field inversion, e.g. the common case of precomputing points for a scalar multiplication in jacobian, then converting the entire table to affine (the scalar point multiplication of course being faster when one of the points is given in affine coordinates). Once we have a batch gej->ge method, secp256k1_ecmult_table_precomp_ge and secp256k1_ecmult_table_precomp_gej become somewhat redundant since a single pre comp method can have the best of both worlds. Okeya and Sakurai discuss a more intricate variation in their paper "Fast Multi-scalar Multiplication Methods on Elliptic Curves with Precomputation Strategy Using Montgomery Trick" (couldn't immediately find a link but I can forward you a copy if interested). They take advantage of batch inversion to perform multiple affine point additions in parallel (using the basic affine addition formulae directly, but with the inversion shared). Less significantly (depending how slow inversion actually is), one can think of batching multiple ECDSA operations so as to be able to share the inversion required by the final jacobian -> affine transformation. This would just use the same batch gej->ge method from above. Just leave this pull request open for the moment if you like, and I'll add some of that higher-level functionality to this branch, then you can weigh up the benefits once there's something useful to show. |
Interesting. That may actually provide a measurable speedup, if we do both the A and G precomputed multiples in affine coordinates. It may or may not weigh up to the cost of an extra inversion (which, IIRC, is around 10% of the total signature validation time when usin GMP + x86_64_asm field code). Thinking a bit further, one reason for using GMP is exactly because its modular inversion is so fast. If we can reduce the importance of inversions (for batch validation), perhaps it becomes doable to use a more naive self-implemented inversion. |
The changes are all in this branch now, so we can do some performance analysis. The G precomputation is an obvious win. 1000 * start/stop fe/ge/ecmult (--with-field=64bit): A very rough stab at the cost of an inversion then: 12M + 4S + I ~= 5 * (12M + 4S + 3M), since each I in building the table is replaced by 3M. Using S = 0.8M, then I ~ 76M. If S = 0.6M, then I ~ 73M. Probably should write a little benchmark to get an accurate value. To get some idea on the A precomputation, I give results for "bench". Note that to give a fair comparison I'm comparing to the branch from #20, since this multi-inv branch has the equivalent optimisation. bench (./configure --enable-endomorphism=no --with-field=64bit): bench (./configure --enable-endomorphism=yes --with-field=64bit): Unfortunately, I am not able to test the 64bit_asm field implementation (on OSX, for some reason the asm methods don't link correctly for me, perhaps something to do with the ABI), so I would certainly appreciate someone checking the 64bit_asm field, although it seems predictable it will make the inversion even more costly relatively speaking. So, it's not there yet for A, which sort of surprises me. Back of the envelope: with WINDOW_A == 5, there are 8 precomputed points. To convert the jacobian points to affine costs an extra 7 * 3M + I. Using affine values in the table should save 1S + 4M for each actual point addition in the main loop, so with I ~= 75M, the conversion should pay for itself after about 20 point additions - but there are over 40 point additions (for the A part) in a typical secp256k1_ecmult! I think my estimate of I based on the G tests above must be much too low. There ought to be a measurable saving when I < ~160M!! I think I will next look at a modular inversion implementation, since there appears to be interest in removing use of GMP anyway? In BouncyCastle we use Algorithm 12 from the paper "Software Implementation of the NIST Elliptic Curves Over Prime Fields" (Brown, Hankerson, Lopez, Menezes), called there "Binary inversion", so I'll take a stab at writing a version of that to work with the secp256k1_num_t type and we'll see how it stacks up. In BC, I ~= 100M for this curve. |
Correction: the cost for converting the table is actually 7 * (6M + S) + I (I was ignoring the cost of actually adjusting x/y after the batch inversion). So a batch inversion for the A precomp should pay off around I <~ 140M (but apparently doesn't). Revisiting the G calculation: 12M + 4S + I ~= 5 * (12M + 4S + 6M + S). So I is ~ 90 to 95 depending on the cost of S. There's still a bit of a gap between theory and results! |
Urgh, the G calculation once again (accounting that the existing version uses mixed additions): I have been measuring the A case some more and I think the problem is actually that secp256k1_fe_gej_add_ge isn't as fast (relative to ...gej_add) as it should be. The first thing that jumped out is a redundant normalize call for 'u1'. With that removed and the latest normalize method changes, this branch is now faster (i.e. with a batch inversion to get all affine precomputed points) even with the endomorphism enabled. |
Can you rebase? I benchmarked --disable-endomorphism --with-bignum=gmp --with-field=64bit_asm:
When benchmarking --with-bignum=openssl however, it seems slower. That seems reasonable, as the field inversion we use with OpenSSL's bignum is significantly slower than GMP's, so trading one extra inversion for the batch code for slightly faster additions may not be worth it. |
Perhaps we should be looking for a internal inversion, might be able to do one faster than GMP from specializing for our field? |
We have an internal one, which is faster than OpenSSL's, but slower than GMP's (even with the overhead of converting between formats). It's only used when GMP is not available. |
Oh. I ... somehow missed that. |
To clarify, the internal inversion code is a simple static (and constant time) exponentiation ladder (the current version of which was even contributed by Peter). GMP uses some extgcd based algorithm, I believe (which is variable time, and has dynamic memory overhead). Implementing our own variable time inversion that competes with GMP's would be nice. |
@gmaxwell A new inversion method is on my todo list. I have in mind to port across the basic alg we use in BouncyCastle which is an extgcd variant, using only shifts and adds. The special prime form for secp256k1 could be helpful, see e.g. http://pdf.aminer.org/000/087/131/low_power_elliptic_curve_cryptography_using_scaled_modular_arithmetic.pdf for ideas. @sipa Note that it is a simple matter to blind an inversion with a random fe multiplication as an alternative to constant time. |
Consuming randomness is a little unfortunate from an interface/portability perspective (e.g. using the library on a small embedded device which may lack strong sources of randomness). If the performance was close I'd prefer not to... obviously we don't need to blind anything that isn't operating on secret data, however. |
@gmaxwell Somewhat off-topic... but unfortunately it's not only secret data that warrants blinding! Differential side-channel analysis can make use of the fact that known data is being operated on; notably this includes the basepoint G. One possible countermeasure is point-blinding, which on secp256k1 (with curve parameter a=0) is very straightforward to implement and has low runtime cost, and because there are no points with x=0 or y=0, safe. I do appreciate there are issues around consuming randomness, but ironically it's those small embedded devices that have the most need for DPA countermeasures. I have some prototype code for point blinding already (via random curve isomorphism: see e.g. http://joye.site88.net/papers/TJ10coordblind.pdf); perhaps we can pick up the randomness discussion once I clean that up into a pull request? |
@peterdettman I'm not sure if you've tried actually measuring the power side-channel on small devices, but it seems hopeless. Even in code that is constant time can have drastically different power usage, even copying memory will leak data in some cases. I think we'd mostly be hoping to achieve timing sidechannel freeness. I'm not sure if we're on the same page wrt sidechannels, though. For example, Verification is not a secret operation in the applications we've cared about so far, so I do not care about side-channel immunity in verification. I do, however, care about it in multiplication with the generator in pubkey generation and in signing, and in the secret addition in signing. e.g. any process which handles secret data in some manner. — Though there are perhaps some nice non-bitcoin applications for this curve that might want better secrecy for other operations, so I suppose it would be nice if it were cheap. Blinding for the multiplies is straight forward though has a performance penalty (at least implemented in the simple way), and in signing we can both take a slowdown and can yield deterministic randomness as we already need to so for the nonce. So I'm not too concerned there, but in verification it's another matter. Sipa and I had already talked about using point blinding in pubkey generation for robustness reasons, e.g. computing blinded and non-blinding and making sure we get the same key. (This is a concern because a miscalculation of a public key in bitcoin can cause devastating irreversible losses) |
@sipa Writing a modular inversion competitive with GMP turns out to be fairly challenging (read: it will take me a while longer). If I remove the part of this PR that uses batch inversion in the A precomp (and just use it for the G precomp), do you think it could be merged in that state? |
@peterdettman that sounds fine to me. |
Patch reduced to only use the batch conversion in the G precomp, and rebased. Just to summarise the situation:
|
Code changes look good. Please use CHECK() instead of assert() in tests. |
Changed to use CHECK(). |
f16be77 Use batch inversion in G precomputation (Peter Dettman)
Adds secp256k1_fe_inv_all and secp256k1_fe_inv_all_var methods for batch inversion of field elements, implemented using "Montgomery's Trick", which uses only one actual field inversion, trading off the others for 3 field multiplications each.
Includes randomised tests covering the new methods and also coverage for the existing inv/inv_var methods.