Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inline assembly for Cortex-M + DSP #5360

Closed
hanno-becker opened this issue Dec 22, 2021 · 9 comments
Closed

Improve inline assembly for Cortex-M + DSP #5360

hanno-becker opened this issue Dec 22, 2021 · 9 comments
Labels
component-crypto Crypto primitives and low-level interfaces enhancement

Comments

@hanno-becker
Copy link

hanno-becker commented Dec 22, 2021

Relates to: #4943

Suggested enhancement

This is about the inline assembly for MULADDC_CORE on MCUs with DSP extension:

#define MULADDC_CORE                            \
            "ldr    r0, [%0], #4        \n\t"   \
            "ldr    r1, [%1]            \n\t"   \
            "umaal  r1, %2, %3, r0      \n\t"   \
            "str    r1, [%1], #4        \n\t"

MULADDC_CORE is called many times in a row, which allows for the use of ldm/stm instructions to allow the CPU to merge consecutive loads into a single cycle provided the data path is wide enough (e.g. Cortex-M55).

@hanno-becker hanno-becker added enhancement component-crypto Crypto primitives and low-level interfaces labels Dec 22, 2021
@hanno-becker
Copy link
Author

On Cortex-M55, for example, it seems one can replace a pair MULADDC_CORE; MULADDC_CORE with the a single invocation of DOUBLE_MULADDC_CORE defined via

#define DOUBLE_MULADDC_CORE                     \
            "ldm    %0!, {r0, r2}    \n\t"   \
            "ldm    %1, {r1, r3}        \n\t"   \
            "umaal  r1, %2, %3, r0      \n\t"   \
            "umaal  r3, %2, %3, r2      \n\t"   \
            "stm   %1!, {r1,r3}    \n\t"

which leads to a cycle count reduction for mpi_montmul() form 37kCycles to 25kCycles for 2048-bit Montgomery multiplications.

This is likely also going to be faster on Cortex-M4 where, if I remember correctly, LDRs following a LDR have single-cycle latency, while LDRs following non-loads have a latency of 2 cycles.

@hanno-becker hanno-becker changed the title Improve inline assembly for Cortex-M Improve inline assembly for Cortex-M + DSP Dec 22, 2021
@mpg
Copy link
Contributor

mpg commented Dec 27, 2021

I see that mpi_mul_hlp() already does some loop unrolling, so it shouldn't be too hard to integrate that. We just need to use MULADDC_CORE_2 whenever there are two consecutive MULADDC_CORE, and provide the trivial fallback definition when we don't have an optimised one.

Btw, now I'm wondering how the "unrolling steps" were determined: 16 then 8 then down to 1. Clearly there's a trade-off between code size and performance here, especially when MULADDC_CORE expands to dozens of instructions (currently 29 on non-DSP Cortex-M cores), having 25 copies around is not insignificant. I'm not sure 16-8-1 is an optimum for this trade-off.

@mpg
Copy link
Contributor

mpg commented Dec 27, 2021

Btw, did you investigate if doing 4 steps is doable (or perhaps we'll start running out of usable registers?) and would also lead to a performance improvement? According to the M4 TRM, the cost of a LDM for N registers is 1 + N (and the cost of N consecutive LDR is at worse 2N, but the footnote implies it's lower than that in some cases, perhaps even down to 1 + N in cases where LDM could be used), so clearly it's our interest to group loads as much as we can.

Also, I'll note that one DOUBLE_MULADDC_CORE is smaller, in terms of code size, than two copies of MULADDC_CORE, so this is one of those nice cases where we can win on both fronts :)

@hanno-becker
Copy link
Author

I see that mpi_mul_hlp() already does some loop unrolling, so it shouldn't be too hard to integrate that. We just need to use MULADDC_CORE_2 whenever there are two consecutive MULADDC_CORE, and provide the trivial fallback definition when we don't have an optimised one.

Yes exactly, that's what I also did for the measurements. While there are certainly more optimizations, it would be easy to integrate this one without major changes to our organization of inline assembly.

@mpg
Copy link
Contributor

mpg commented Dec 28, 2021

Ok, I got curious about this unrolling thing in mpi_mul_hlp() and decided to try some measurements. In addition to the existing 16-8-1 version, I tried two variants: 8-1 and 4-1. The reasoning is that 8-1 is the closest to the existing while getting some serious size reduction; 4-1 reduces size further and I was curious about the performance impact on 384-bit numbers (which with 8-1 are 1*8 + 4*1, and with 4-1 are 3*4 + 0*1, on 32-bit platforms).

For benchmarking, I used full ECDH key exchange with P-256 and P-384 (the most common and second most common sizes), and full FFDH key exchange with 2048 and 3072 (same reasoning) except when memory was too low (turns our the only M0 boards I have can't run FFDH-2048, and I even had to tune the config for P-384 to fit).

Cortex-M0 - units: ms/operation (less is better) / bytes of bignum.o with -Os.

16-8-1 8-1 4-1
size 12551 11563 11331
256 3259 3233 3282
384 5935 5849 5615

The performance impact is negligible for P-256. As hypothesized, 4-1 is slightly better than 8-1 for P-384 but the difference remains low (about 4%).

The size impact is quite important, with 988 bytes saved by moving to 8-1 and another 232 (for a total of 1220) with 4-1.

Cortex-M4 - units: ms/operation (less is better) / bytes of bignum.o with -Os.

16-8-1 8-1 4-1
size 11415 11187 11127
256 1082 1079 1084
384 1524 1518 1500
2048 3680 3833 4394

The performance impact on ECC is negligible; for FFDH however we lose 4% by moving to 8-1 and a further 14% (for a total of 19%) by moving to 4-1. (However this can probably be more than compensated for by moving to DOUBLE_MULADDC_CORE; also it is debatable whether FFDH really matters on M-class - for example it's disabled by default in Mbed OS.)

The code size impact is not as important as for M0, since the M4 asm for MULADDC_CORE is considerably smaller than the M0 asm, but still 228 to 288 bytes can be saved.

x86-64 - units: operations/second (more is better) / bytes of bignum.o with -Os.

Measurements were repeated 3 times and the min-max range is given in order to convey the uncertainty

16-8-1 8-1 4-1
size 21087 20574 20450
256 324-398 412-429 407-430
384 214-233 243-252 242-253
2048 77-79 77-80 77-80
3072 25 26 25-26

The performance impact on FFDH is negligible. Removing the 16 loop actually improves performance for ECC, probably because the code fits better in cache; there's no measurable difference between 8-1 and 4-1.

Code size is probably less important for this platform, but the numbers behave as expected.

32-bit A-class: I didn't test on this platform but I expect the results to be somewhere between M4 (using the same asm) and x86-64 (depending on cache).

@mpg
Copy link
Contributor

mpg commented Dec 29, 2021

See #5373.

@mpg
Copy link
Contributor

mpg commented Dec 30, 2021

Ok, this is weird: I can't reproduce the performance improvement. I created a simple mbed-os project with the following main.cpp:

#include "mbed.h"

#include "mbedtls/ecdh.h"
#include "mbedtls/dhm.h"

Timer t;

#define TIMEIT(NAME, CODE)  \
    t.reset();              \
    t.start();              \
    CODE;                   \
    t.stop();               \
    printf("%10s: %5d ms\n", NAME, t.read_ms())

/* TEST only! */
int test_prng(void *ctx, unsigned char *output, size_t output_size)
{
    (void) ctx;
    for (unsigned i = 0; i < output_size; i++)
        output[i] = (uint8_t) rand();
    return 0;
}

#if 1
void ecdh(const char *name, mbedtls_ecp_group_id id)
{
    mbedtls_ecdh_context ecdh;
    unsigned char buf[100];
    size_t olen;

    mbedtls_ecdh_init( &ecdh );

    mbedtls_ecp_group_load( &ecdh.grp, id );

    TIMEIT( name,
        int ret = mbedtls_ecdh_make_public( &ecdh, &olen, buf, sizeof( buf), test_prng, NULL );
        if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
        ret = mbedtls_ecp_copy( &ecdh.Qp, &ecdh.Q );
        if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
        ret = mbedtls_ecdh_calc_secret( &ecdh, &olen, buf, sizeof( buf ), test_prng, NULL );
        if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
    );

    mbedtls_ecdh_free( &ecdh );
}
#endif

#if 1
void ffdh(const char *name,
          const unsigned char *p, size_t p_len,
          const unsigned char *g, size_t g_len)
{
    mbedtls_dhm_context dhm;
    unsigned char buf[400];
    size_t n, olen;
    int ret;

    mbedtls_dhm_init( &dhm );

    ret = mbedtls_mpi_read_binary( &dhm.P, p, p_len );
    if( ret != 0 ) printf("p: %d\n", ret);
    ret = mbedtls_mpi_read_binary( &dhm.G, g, g_len );
    if( ret != 0 ) printf("g: %d\n", ret);
    dhm.len = n = mbedtls_mpi_size( &dhm.P );

    TIMEIT( name,
        ret = mbedtls_dhm_make_public( &dhm, (int) n, buf, n, test_prng, NULL );
        if( ret != 0 ) printf("mp: %d = -0x%04x\n", ret, -ret);
        ret = mbedtls_mpi_copy( &dhm.GY, &dhm.GX );
        if( ret != 0 ) printf("cp: %d\n", ret);
        ret = mbedtls_dhm_calc_secret( &dhm, buf, sizeof( buf ), &olen, test_prng, NULL );
        if( ret != 0 ) printf("cs: %d = -0x%04x\n", ret, -ret);
    );

    mbedtls_dhm_free( &dhm );
}
#endif

int main()
{
#if 1
    ecdh("ECDH P-256", MBEDTLS_ECP_DP_SECP256R1);
    ecdh("ECDH P-384", MBEDTLS_ECP_DP_SECP384R1);
#endif

#if 1
    const unsigned char p[] = MBEDTLS_DHM_RFC7919_FFDHE2048_P_BIN;
    const unsigned char g[] = MBEDTLS_DHM_RFC7919_FFDHE2048_G_BIN;
    ffdh("FFDH 2048", p, sizeof p, g, sizeof g);
#endif
}

and the following patch applied to the mbed-os directory:

diff --git a/connectivity/mbedtls/include/mbedtls/bn_mul.h b/connectivity/mbedtls/include/mbedtls/bn_mul.h
index 17d057f3abe9..ff9b003b7350 100644
--- a/connectivity/mbedtls/include/mbedtls/bn_mul.h
+++ b/connectivity/mbedtls/include/mbedtls/bn_mul.h
@@ -676,10 +676,17 @@
             "umaal  r1, %2, %3, r0      \n\t"   \
             "str    r1, [%1], #4        \n\t"
 
+#define MULADDC_FAST2                           \
+            "ldm    %0!, {r0, r2}       \n\t"   \
+            "ldm    %1, {r1, r3}        \n\t"   \
+            "umaal  r1, %2, %3, r0      \n\t"   \
+            "umaal  r3, %2, %3, r2      \n\t"   \
+            "stm    %1!, {r1,r3}        \n\t"
+
 #define MULADDC_STOP                            \
          : "=r" (s),  "=r" (d), "=r" (c)        \
          : "r" (b), "0" (s), "1" (d), "2" (c)   \
-         : "r0", "r1", "memory"                 \
+         : "r0", "r1", "r2", "r3", "memory"     \
          );
 
 #else
diff --git a/connectivity/mbedtls/include/mbedtls/config.h b/connectivity/mbedtls/include/mbedtls/config.h
index 6201d9910c49..a5ca5eb4303e 100644
--- a/connectivity/mbedtls/include/mbedtls/config.h
+++ b/connectivity/mbedtls/include/mbedtls/config.h
@@ -2651,7 +2651,7 @@
  *             See dhm.h for more details.
  *
  */
-//#define MBEDTLS_DHM_C
+#define MBEDTLS_DHM_C
 
 /**
  * \def MBEDTLS_ECDH_C
diff --git a/connectivity/mbedtls/source/bignum.c b/connectivity/mbedtls/source/bignum.c
index 9cc5d66e3abf..c5092164fca3 100644
--- a/connectivity/mbedtls/source/bignum.c
+++ b/connectivity/mbedtls/source/bignum.c
@@ -1546,11 +1546,42 @@ void mpi_mul_hlp( size_t i, mbedtls_mpi_uint *s, mbedtls_mpi_uint *d, mbedtls_mp
 {
     mbedtls_mpi_uint c = 0, t = 0;
 
-#if defined(MULADDC_HUIT)
+    /*
+     * Unroll 8 times; this provides a reasonable compromise across platforms.
+     *
+     * Unrolling less hurts performance of FFDH/RSA on some platforms (for
+     * example, unrolling 4 rather than 8 times decreases perfomance by around
+     * 12% on Cortex-M4 cores). Unrolling more increase the code size linearly
+     * (for example, unrolling 16 rather than 8 times would increase the code
+     * size by around 250 bytes on Cortex-M0).
+     *
+     * Also, on 32-bit platforms, 256-bit numbers are 8 limbs, and this is a
+     * common size for ECC, widely used on constrained platforms.
+     *
+     * Use optimized 8/4/2-times version if available.
+     */
+#if 0 && defined(MULADDC_FAST2)
+#define MULADDC_2   MULADDC_FAST2
+#else
+#define MULADDC_2   MULADDC_CORE MULADDC_CORE
+#endif
+
+#if defined(MULADDC_FAST4)
+#define MULADDC_4   MULADDC_FAST4
+#else
+#define MULADDC_4   MULADDC_2 MULADDC_2
+#endif
+
+#if defined(MULADDC_FAST8)
+#define MULADDC_8   MULADDC_FAST8
+#else
+#define MULADDC_8   MULADDC_4 MULADDC_4
+#endif
+
     for( ; i >= 8; i -= 8 )
     {
         MULADDC_INIT
-        MULADDC_HUIT
+        MULADDC_8
         MULADDC_STOP
     }
 
@@ -1560,40 +1591,6 @@ void mpi_mul_hlp( size_t i, mbedtls_mpi_uint *s, mbedtls_mpi_uint *d, mbedtls_mp
         MULADDC_CORE
         MULADDC_STOP
     }
-#else /* MULADDC_HUIT */
-    for( ; i >= 16; i -= 16 )
-    {
-        MULADDC_INIT
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_STOP
-    }
-
-    for( ; i >= 8; i -= 8 )
-    {
-        MULADDC_INIT
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_STOP
-    }
-
-    for( ; i > 0; i-- )
-    {
-        MULADDC_INIT
-        MULADDC_CORE
-        MULADDC_STOP
-    }
-#endif /* MULADDC_HUIT */
 
     t++;
 

Then I compiled and ran with mbed compile -t GCC_ARM -m DISCO_L475VG_IOT01A --flash --sterm and got the following outputs before (with the #if 0 && disabling the optimisation) and after (with the optimisation enabled):

ECDH P-256:  1079 ms
ECDH P-384:  1520 ms
FFDH 2048:  3851 ms
ECDH P-256:  1079 ms
ECDH P-384:  1522 ms
FFDH 2048:  3903 ms

Notice how the optimisation changes almost nothing for small bignums but seems to slightly decrease FFDH performance.

I'm not sure what to think of this. I double-checked the generated code with arm-none-eabi-objdump -d BUILD/DISCO_L475VG_IOT01A/GCC_ARM/bn_mul.elf | grep -A2 -B2 umaal to make sure the code is compiled as I expect, and it is. (Also, I checked that the size figure printed by the mbed tool decreases as expected on the connectivity line when the optimization is enabled.)

Perhaps the improvements depend on the actual memory characteristics of the chip?

@hanno-becker
Copy link
Author

This may be a code alignment issue: We should try aligning the size of MULADDC_CORE to 4 bytes (e.g. by forcing 32-bit instruction variants) and emit an initial alignment directive inMULADDC_INIT.

@tom-cosgrove-arm
Copy link
Contributor

@hanno-arm @mpg can we close this now that #5701 and #5706 have been merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component-crypto Crypto primitives and low-level interfaces enhancement
Projects
None yet
Development

No branches or pull requests

3 participants