Improve inline assembly for Cortex-M + DSP #5360

hanno-becker · 2021-12-22T14:19:22Z

Relates to: #4943

Suggested enhancement

This is about the inline assembly for MULADDC_CORE on MCUs with DSP extension:

#define MULADDC_CORE                            \
            "ldr    r0, [%0], #4        \n\t"   \
            "ldr    r1, [%1]            \n\t"   \
            "umaal  r1, %2, %3, r0      \n\t"   \
            "str    r1, [%1], #4        \n\t"

MULADDC_CORE is called many times in a row, which allows for the use of ldm/stm instructions to allow the CPU to merge consecutive loads into a single cycle provided the data path is wide enough (e.g. Cortex-M55).

The text was updated successfully, but these errors were encountered:

hanno-becker · 2021-12-22T14:32:19Z

On Cortex-M55, for example, it seems one can replace a pair MULADDC_CORE; MULADDC_CORE with the a single invocation of DOUBLE_MULADDC_CORE defined via

#define DOUBLE_MULADDC_CORE                     \
            "ldm    %0!, {r0, r2}    \n\t"   \
            "ldm    %1, {r1, r3}        \n\t"   \
            "umaal  r1, %2, %3, r0      \n\t"   \
            "umaal  r3, %2, %3, r2      \n\t"   \
            "stm   %1!, {r1,r3}    \n\t"

which leads to a cycle count reduction for mpi_montmul() form 37kCycles to 25kCycles for 2048-bit Montgomery multiplications.

This is likely also going to be faster on Cortex-M4 where, if I remember correctly, LDRs following a LDR have single-cycle latency, while LDRs following non-loads have a latency of 2 cycles.

mpg · 2021-12-27T09:34:00Z

I see that mpi_mul_hlp() already does some loop unrolling, so it shouldn't be too hard to integrate that. We just need to use MULADDC_CORE_2 whenever there are two consecutive MULADDC_CORE, and provide the trivial fallback definition when we don't have an optimised one.

Btw, now I'm wondering how the "unrolling steps" were determined: 16 then 8 then down to 1. Clearly there's a trade-off between code size and performance here, especially when MULADDC_CORE expands to dozens of instructions (currently 29 on non-DSP Cortex-M cores), having 25 copies around is not insignificant. I'm not sure 16-8-1 is an optimum for this trade-off.

mpg · 2021-12-27T09:40:41Z

Btw, did you investigate if doing 4 steps is doable (or perhaps we'll start running out of usable registers?) and would also lead to a performance improvement? According to the M4 TRM, the cost of a LDM for N registers is 1 + N (and the cost of N consecutive LDR is at worse 2N, but the footnote implies it's lower than that in some cases, perhaps even down to 1 + N in cases where LDM could be used), so clearly it's our interest to group loads as much as we can.

Also, I'll note that one DOUBLE_MULADDC_CORE is smaller, in terms of code size, than two copies of MULADDC_CORE, so this is one of those nice cases where we can win on both fronts :)

hanno-becker · 2021-12-27T20:57:24Z

I see that mpi_mul_hlp() already does some loop unrolling, so it shouldn't be too hard to integrate that. We just need to use MULADDC_CORE_2 whenever there are two consecutive MULADDC_CORE, and provide the trivial fallback definition when we don't have an optimised one.

Yes exactly, that's what I also did for the measurements. While there are certainly more optimizations, it would be easy to integrate this one without major changes to our organization of inline assembly.

mpg · 2021-12-28T12:46:56Z

Ok, I got curious about this unrolling thing in mpi_mul_hlp() and decided to try some measurements. In addition to the existing 16-8-1 version, I tried two variants: 8-1 and 4-1. The reasoning is that 8-1 is the closest to the existing while getting some serious size reduction; 4-1 reduces size further and I was curious about the performance impact on 384-bit numbers (which with 8-1 are 1*8 + 4*1, and with 4-1 are 3*4 + 0*1, on 32-bit platforms).

For benchmarking, I used full ECDH key exchange with P-256 and P-384 (the most common and second most common sizes), and full FFDH key exchange with 2048 and 3072 (same reasoning) except when memory was too low (turns our the only M0 boards I have can't run FFDH-2048, and I even had to tune the config for P-384 to fit).

Cortex-M0 - units: ms/operation (less is better) / bytes of bignum.o with -Os.

	16-8-1	8-1	4-1
size	12551	11563	11331
256	3259	3233	3282
384	5935	5849	5615

The performance impact is negligible for P-256. As hypothesized, 4-1 is slightly better than 8-1 for P-384 but the difference remains low (about 4%).

The size impact is quite important, with 988 bytes saved by moving to 8-1 and another 232 (for a total of 1220) with 4-1.

Cortex-M4 - units: ms/operation (less is better) / bytes of bignum.o with -Os.

	16-8-1	8-1	4-1
size	11415	11187	11127
256	1082	1079	1084
384	1524	1518	1500
2048	3680	3833	4394

The performance impact on ECC is negligible; for FFDH however we lose 4% by moving to 8-1 and a further 14% (for a total of 19%) by moving to 4-1. (However this can probably be more than compensated for by moving to DOUBLE_MULADDC_CORE; also it is debatable whether FFDH really matters on M-class - for example it's disabled by default in Mbed OS.)

The code size impact is not as important as for M0, since the M4 asm for MULADDC_CORE is considerably smaller than the M0 asm, but still 228 to 288 bytes can be saved.

x86-64 - units: operations/second (more is better) / bytes of bignum.o with -Os.

Measurements were repeated 3 times and the min-max range is given in order to convey the uncertainty

	16-8-1	8-1	4-1
size	21087	20574	20450
256	324-398	412-429	407-430
384	214-233	243-252	242-253
2048	77-79	77-80	77-80
3072	25	26	25-26

The performance impact on FFDH is negligible. Removing the 16 loop actually improves performance for ECC, probably because the code fits better in cache; there's no measurable difference between 8-1 and 4-1.

Code size is probably less important for this platform, but the numbers behave as expected.

32-bit A-class: I didn't test on this platform but I expect the results to be somewhere between M4 (using the same asm) and x86-64 (depending on cache).

mpg · 2021-12-29T12:06:35Z

See #5373.

mpg · 2021-12-30T08:48:38Z

Ok, this is weird: I can't reproduce the performance improvement. I created a simple mbed-os project with the following main.cpp:

#include "mbed.h"

#include "mbedtls/ecdh.h"
#include "mbedtls/dhm.h"

Timer t;

#define TIMEIT(NAME, CODE)  \
    t.reset();              \
    t.start();              \
    CODE;                   \
    t.stop();               \
    printf("%10s: %5d ms\n", NAME, t.read_ms())

/* TEST only! */
int test_prng(void *ctx, unsigned char *output, size_t output_size)
{
    (void) ctx;
    for (unsigned i = 0; i < output_size; i++)
        output[i] = (uint8_t) rand();
    return 0;
}

#if 1
void ecdh(const char *name, mbedtls_ecp_group_id id)
{
    mbedtls_ecdh_context ecdh;
    unsigned char buf[100];
    size_t olen;

    mbedtls_ecdh_init( &ecdh );

    mbedtls_ecp_group_load( &ecdh.grp, id );

    TIMEIT( name,
        int ret = mbedtls_ecdh_make_public( &ecdh, &olen, buf, sizeof( buf), test_prng, NULL );
        if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
        ret = mbedtls_ecp_copy( &ecdh.Qp, &ecdh.Q );
        if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
        ret = mbedtls_ecdh_calc_secret( &ecdh, &olen, buf, sizeof( buf ), test_prng, NULL );
        if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
    );

    mbedtls_ecdh_free( &ecdh );
}
#endif

#if 1
void ffdh(const char *name,
          const unsigned char *p, size_t p_len,
          const unsigned char *g, size_t g_len)
{
    mbedtls_dhm_context dhm;
    unsigned char buf[400];
    size_t n, olen;
    int ret;

    mbedtls_dhm_init( &dhm );

    ret = mbedtls_mpi_read_binary( &dhm.P, p, p_len );
    if( ret != 0 ) printf("p: %d\n", ret);
    ret = mbedtls_mpi_read_binary( &dhm.G, g, g_len );
    if( ret != 0 ) printf("g: %d\n", ret);
    dhm.len = n = mbedtls_mpi_size( &dhm.P );

    TIMEIT( name,
        ret = mbedtls_dhm_make_public( &dhm, (int) n, buf, n, test_prng, NULL );
        if( ret != 0 ) printf("mp: %d = -0x%04x\n", ret, -ret);
        ret = mbedtls_mpi_copy( &dhm.GY, &dhm.GX );
        if( ret != 0 ) printf("cp: %d\n", ret);
        ret = mbedtls_dhm_calc_secret( &dhm, buf, sizeof( buf ), &olen, test_prng, NULL );
        if( ret != 0 ) printf("cs: %d = -0x%04x\n", ret, -ret);
    );

    mbedtls_dhm_free( &dhm );
}
#endif

int main()
{
#if 1
    ecdh("ECDH P-256", MBEDTLS_ECP_DP_SECP256R1);
    ecdh("ECDH P-384", MBEDTLS_ECP_DP_SECP384R1);
#endif

#if 1
    const unsigned char p[] = MBEDTLS_DHM_RFC7919_FFDHE2048_P_BIN;
    const unsigned char g[] = MBEDTLS_DHM_RFC7919_FFDHE2048_G_BIN;
    ffdh("FFDH 2048", p, sizeof p, g, sizeof g);
#endif
}

and the following patch applied to the mbed-os directory:

diff --git a/connectivity/mbedtls/include/mbedtls/bn_mul.h b/connectivity/mbedtls/include/mbedtls/bn_mul.h
index 17d057f3abe9..ff9b003b7350 100644
--- a/connectivity/mbedtls/include/mbedtls/bn_mul.h
+++ b/connectivity/mbedtls/include/mbedtls/bn_mul.h
@@ -676,10 +676,17 @@
             "umaal  r1, %2, %3, r0      \n\t"   \
             "str    r1, [%1], #4        \n\t"
 
+#define MULADDC_FAST2                           \
+            "ldm    %0!, {r0, r2}       \n\t"   \
+            "ldm    %1, {r1, r3}        \n\t"   \
+            "umaal  r1, %2, %3, r0      \n\t"   \
+            "umaal  r3, %2, %3, r2      \n\t"   \
+            "stm    %1!, {r1,r3}        \n\t"
+
 #define MULADDC_STOP                            \
          : "=r" (s),  "=r" (d), "=r" (c)        \
          : "r" (b), "0" (s), "1" (d), "2" (c)   \
-         : "r0", "r1", "memory"                 \
+         : "r0", "r1", "r2", "r3", "memory"     \
          );
 
 #else
diff --git a/connectivity/mbedtls/include/mbedtls/config.h b/connectivity/mbedtls/include/mbedtls/config.h
index 6201d9910c49..a5ca5eb4303e 100644
--- a/connectivity/mbedtls/include/mbedtls/config.h
+++ b/connectivity/mbedtls/include/mbedtls/config.h
@@ -2651,7 +2651,7 @@
  *             See dhm.h for more details.
  *
  */
-//#define MBEDTLS_DHM_C
+#define MBEDTLS_DHM_C
 
 /**
  * \def MBEDTLS_ECDH_C
diff --git a/connectivity/mbedtls/source/bignum.c b/connectivity/mbedtls/source/bignum.c
index 9cc5d66e3abf..c5092164fca3 100644
--- a/connectivity/mbedtls/source/bignum.c
+++ b/connectivity/mbedtls/source/bignum.c
@@ -1546,11 +1546,42 @@ void mpi_mul_hlp( size_t i, mbedtls_mpi_uint *s, mbedtls_mpi_uint *d, mbedtls_mp
 {
     mbedtls_mpi_uint c = 0, t = 0;
 
-#if defined(MULADDC_HUIT)
+    /*
+     * Unroll 8 times; this provides a reasonable compromise across platforms.
+     *
+     * Unrolling less hurts performance of FFDH/RSA on some platforms (for
+     * example, unrolling 4 rather than 8 times decreases perfomance by around
+     * 12% on Cortex-M4 cores). Unrolling more increase the code size linearly
+     * (for example, unrolling 16 rather than 8 times would increase the code
+     * size by around 250 bytes on Cortex-M0).
+     *
+     * Also, on 32-bit platforms, 256-bit numbers are 8 limbs, and this is a
+     * common size for ECC, widely used on constrained platforms.
+     *
+     * Use optimized 8/4/2-times version if available.
+     */
+#if 0 && defined(MULADDC_FAST2)
+#define MULADDC_2   MULADDC_FAST2
+#else
+#define MULADDC_2   MULADDC_CORE MULADDC_CORE
+#endif
+
+#if defined(MULADDC_FAST4)
+#define MULADDC_4   MULADDC_FAST4
+#else
+#define MULADDC_4   MULADDC_2 MULADDC_2
+#endif
+
+#if defined(MULADDC_FAST8)
+#define MULADDC_8   MULADDC_FAST8
+#else
+#define MULADDC_8   MULADDC_4 MULADDC_4
+#endif
+
     for( ; i >= 8; i -= 8 )
     {
         MULADDC_INIT
-        MULADDC_HUIT
+        MULADDC_8
         MULADDC_STOP
     }
 
@@ -1560,40 +1591,6 @@ void mpi_mul_hlp( size_t i, mbedtls_mpi_uint *s, mbedtls_mpi_uint *d, mbedtls_mp
         MULADDC_CORE
         MULADDC_STOP
     }
-#else /* MULADDC_HUIT */
-    for( ; i >= 16; i -= 16 )
-    {
-        MULADDC_INIT
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_STOP
-    }
-
-    for( ; i >= 8; i -= 8 )
-    {
-        MULADDC_INIT
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_CORE   MULADDC_CORE
-        MULADDC_STOP
-    }
-
-    for( ; i > 0; i-- )
-    {
-        MULADDC_INIT
-        MULADDC_CORE
-        MULADDC_STOP
-    }
-#endif /* MULADDC_HUIT */
 
     t++;

Then I compiled and ran with mbed compile -t GCC_ARM -m DISCO_L475VG_IOT01A --flash --sterm and got the following outputs before (with the #if 0 && disabling the optimisation) and after (with the optimisation enabled):

ECDH P-256:  1079 ms
ECDH P-384:  1520 ms
FFDH 2048:  3851 ms

ECDH P-256:  1079 ms
ECDH P-384:  1522 ms
FFDH 2048:  3903 ms

Notice how the optimisation changes almost nothing for small bignums but seems to slightly decrease FFDH performance.

I'm not sure what to think of this. I double-checked the generated code with arm-none-eabi-objdump -d BUILD/DISCO_L475VG_IOT01A/GCC_ARM/bn_mul.elf | grep -A2 -B2 umaal to make sure the code is compiled as I expect, and it is. (Also, I checked that the size figure printed by the mbed tool decreases as expected on the connectivity line when the optimization is enabled.)

Perhaps the improvements depend on the actual memory characteristics of the chip?

hanno-becker · 2022-01-12T07:38:54Z

This may be a code alignment issue: We should try aligning the size of MULADDC_CORE to 4 bytes (e.g. by forcing 32-bit instruction variants) and emit an initial alignment directive inMULADDC_INIT.

tom-cosgrove-arm · 2022-04-27T07:45:56Z

@hanno-arm @mpg can we close this now that #5701 and #5706 have been merged?

hanno-becker added enhancement component-crypto labels Dec 22, 2021

hanno-becker changed the title ~~Improve inline assembly for Cortex-M~~ Improve inline assembly for Cortex-M + DSP Dec 22, 2021

hanno-becker mentioned this issue Dec 28, 2021

Make bignum inline assembly replacement more flexible #5366

Open

mpg mentioned this issue Dec 29, 2021

Improve code/perf trade-off in bignum multiplication #5373

Closed

hanno-becker mentioned this issue Mar 28, 2022

Improve Montgomery multiplication strategy #5666

Open

hanno-becker mentioned this issue Apr 7, 2022

Simplify structure of bignum inline assembly #5706

Merged

mpg closed this as completed Apr 27, 2022

mpg mentioned this issue May 16, 2022

mpi_mul_hlp code size is huge #1717

Open

mratsim mentioned this issue Aug 6, 2022

Perf: Assembly code generator for ARM and ARM64 mratsim/constantine#200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inline assembly for Cortex-M + DSP #5360

Improve inline assembly for Cortex-M + DSP #5360

hanno-becker commented Dec 22, 2021 •

edited

Loading

hanno-becker commented Dec 22, 2021

mpg commented Dec 27, 2021

mpg commented Dec 27, 2021 •

edited

Loading

hanno-becker commented Dec 27, 2021

mpg commented Dec 28, 2021 •

edited

Loading

mpg commented Dec 29, 2021

mpg commented Dec 30, 2021

hanno-becker commented Jan 12, 2022

tom-cosgrove-arm commented Apr 27, 2022

Improve inline assembly for Cortex-M + DSP #5360

Improve inline assembly for Cortex-M + DSP #5360

Comments

hanno-becker commented Dec 22, 2021 • edited Loading

Suggested enhancement

hanno-becker commented Dec 22, 2021

mpg commented Dec 27, 2021

mpg commented Dec 27, 2021 • edited Loading

hanno-becker commented Dec 27, 2021

mpg commented Dec 28, 2021 • edited Loading

mpg commented Dec 29, 2021

mpg commented Dec 30, 2021

hanno-becker commented Jan 12, 2022

tom-cosgrove-arm commented Apr 27, 2022

hanno-becker commented Dec 22, 2021 •

edited

Loading

mpg commented Dec 27, 2021 •

edited

Loading

mpg commented Dec 28, 2021 •

edited

Loading