Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify structure of bignum inline assembly #5706

Merged
merged 4 commits into from
Apr 19, 2022

Conversation

hanno-becker
Copy link

@hanno-becker hanno-becker commented Apr 7, 2022

Dependencies: Based on #5701 merged

This PR follows @mpg's patch in #5360 to simplify organization and use of our bignum inline assembly routines:

  • MULADDC_X1_[INIT|CORE|STOP] must be defined
  • MULADDC_X[2,4,8]_[INIT|CORE|STOP] can be defined if it offers speedup potential. If it is not defined, it is auto-derived from the next lower MULADDC macros.
  • Instances of MULADDC_HUIT are thus renamed to MULADDC_X8_CORE.
  • The x8-variants did not have their own STOP-macros previously, which made the code harder to read. This gets simpler now that every MULADDC_Xi_xxx family has their own INIT and STOP.
  • The core multiplication routine mpi_mul_hlp() is simplified to always call the 8-fold code and then the 1-fold code. Since MULADDC_X8_xxx is always defined (even if it's just auto-derived from MULADDC_X1_xxx), this makes the core more readable.

As a concrete performance improvement based on those changes, MULADDC_X2_xxx is introduced for M-profile MCUs implementing UMAAL.

@hanno-becker hanno-becker force-pushed the bn_mul_cleanup branch 2 times, most recently from 8b5cfc5 to be109c0 Compare April 7, 2022 09:11
@hanno-becker hanno-becker added component-crypto Crypto primitives and low-level interfaces needs-ci Needs to pass CI tests labels Apr 7, 2022
@hanno-becker
Copy link
Author

hanno-becker commented Apr 11, 2022

(Removed comment which was actually about parent PR)

@hanno-becker hanno-becker force-pushed the bn_mul_cleanup branch 2 times, most recently from ba71c38 to 85bb975 Compare April 11, 2022 12:47
@tom-cosgrove-arm tom-cosgrove-arm added the needs-preceding-pr Requires another PR to be merged first label Apr 12, 2022
@mpg mpg changed the base branch from development to coverity_scan April 15, 2022 08:27
@mpg mpg changed the base branch from coverity_scan to development April 15, 2022 08:27
@mpg
Copy link
Contributor

mpg commented Apr 15, 2022

(Changed the base branch twice in order to force-refresh github's commit list, now that #5701 has been merged.)

@mpg mpg removed the needs-preceding-pr Requires another PR to be merged first label Apr 15, 2022
mpg
mpg previously approved these changes Apr 15, 2022
Copy link
Contributor

@mpg mpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however the CI doesn't agree. Also, did you test the changes to the UMAAL-based assembly? (Including with an operand size that doesn't just exercise the X2 part but also X1.)

Hanno Becker added 3 commits April 17, 2022 06:16
Compilers are likely to generate shorter assembly for loops of the
form `while( cnt-- ) { ... }` rather than
`for( ; count >= X; count -= X ) { ... }`. (E.g. the latter needs
a subtract+compare+branch after each loop, while the former only
needs decrement+branch).

Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker
Copy link
Author

@mpg CI is happy now, I assume it failed because it was still based on an old version of the base PR #5701

@hanno-becker hanno-becker requested a review from mpg April 19, 2022 05:09
@mpg mpg added needs-review Every commit must be reviewed by at least two team members, and removed needs-ci Needs to pass CI tests labels Apr 19, 2022
@mpg
Copy link
Contributor

mpg commented Apr 19, 2022

Have you manually tested the changes to the UMAAL-based assembly? Because the CI didn't. I'm happy with the code but holding my approval until this has been clarified.

@hanno-becker
Copy link
Author

@mpg Yes, I manually tested the UMAAL assembly, both the x2 and the x1 form.

Copy link
Contributor

@mpg mpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hanno-becker
Copy link
Author

The x2-form is about 15% faster than x1 on Cortex-M4, which is in the expected range since we are saving 1/2 memory stall per UMAAL pair.

Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpg mpg added approved Design and code approved - may be waiting for CI or backports and removed needs-review Every commit must be reviewed by at least two team members, labels Apr 19, 2022
@mpg mpg merged commit 46435f0 into Mbed-TLS:development Apr 19, 2022

for( ; s_len > 0; s_len-- )
while( steps_x8-- )
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this resolve #1717, or are there further savings to be had?

Copy link
Contributor

@mpg mpg May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we could always save more code size by doing no unrolling at all, but that would come at a performance cost, at least for some platforms and some applications. I measured (code size, perf) 3 options on 3 platforms and my conclusion was that keeping 8-1 was a good compromise between code size, perf, and simplicity (there are platforms with optimized asm for 8 steps at once, so we need 8-1 for them, and it's simpler to have the same steps on all platforms).

We can always fine-tune things later, but my personal opinion is that most of the savings were had by removing the extra 16x loop, and we should leave it at that for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Design and code approved - may be waiting for CI or backports component-crypto Crypto primitives and low-level interfaces
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants