-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve inline assembly for Cortex-M + DSP #5360
Comments
On Cortex-M55, for example, it seems one can replace a pair #define DOUBLE_MULADDC_CORE \
"ldm %0!, {r0, r2} \n\t" \
"ldm %1, {r1, r3} \n\t" \
"umaal r1, %2, %3, r0 \n\t" \
"umaal r3, %2, %3, r2 \n\t" \
"stm %1!, {r1,r3} \n\t" which leads to a cycle count reduction for This is likely also going to be faster on Cortex-M4 where, if I remember correctly, |
I see that Btw, now I'm wondering how the "unrolling steps" were determined: 16 then 8 then down to 1. Clearly there's a trade-off between code size and performance here, especially when |
Btw, did you investigate if doing 4 steps is doable (or perhaps we'll start running out of usable registers?) and would also lead to a performance improvement? According to the M4 TRM, the cost of a LDM for N registers is 1 + N (and the cost of N consecutive LDR is at worse 2N, but the footnote implies it's lower than that in some cases, perhaps even down to 1 + N in cases where LDM could be used), so clearly it's our interest to group loads as much as we can. Also, I'll note that one |
Yes exactly, that's what I also did for the measurements. While there are certainly more optimizations, it would be easy to integrate this one without major changes to our organization of inline assembly. |
Ok, I got curious about this unrolling thing in For benchmarking, I used full ECDH key exchange with P-256 and P-384 (the most common and second most common sizes), and full FFDH key exchange with 2048 and 3072 (same reasoning) except when memory was too low (turns our the only M0 boards I have can't run FFDH-2048, and I even had to tune the config for P-384 to fit). Cortex-M0 - units: ms/operation (less is better) / bytes of
The performance impact is negligible for P-256. As hypothesized, 4-1 is slightly better than 8-1 for P-384 but the difference remains low (about 4%). The size impact is quite important, with 988 bytes saved by moving to 8-1 and another 232 (for a total of 1220) with 4-1. Cortex-M4 - units: ms/operation (less is better) / bytes of
The performance impact on ECC is negligible; for FFDH however we lose 4% by moving to 8-1 and a further 14% (for a total of 19%) by moving to 4-1. (However this can probably be more than compensated for by moving to The code size impact is not as important as for M0, since the M4 asm for x86-64 - units: operations/second (more is better) / bytes of Measurements were repeated 3 times and the min-max range is given in order to convey the uncertainty
The performance impact on FFDH is negligible. Removing the 16 loop actually improves performance for ECC, probably because the code fits better in cache; there's no measurable difference between 8-1 and 4-1. Code size is probably less important for this platform, but the numbers behave as expected. 32-bit A-class: I didn't test on this platform but I expect the results to be somewhere between M4 (using the same asm) and x86-64 (depending on cache). |
See #5373. |
Ok, this is weird: I can't reproduce the performance improvement. I created a simple mbed-os project with the following #include "mbed.h"
#include "mbedtls/ecdh.h"
#include "mbedtls/dhm.h"
Timer t;
#define TIMEIT(NAME, CODE) \
t.reset(); \
t.start(); \
CODE; \
t.stop(); \
printf("%10s: %5d ms\n", NAME, t.read_ms())
/* TEST only! */
int test_prng(void *ctx, unsigned char *output, size_t output_size)
{
(void) ctx;
for (unsigned i = 0; i < output_size; i++)
output[i] = (uint8_t) rand();
return 0;
}
#if 1
void ecdh(const char *name, mbedtls_ecp_group_id id)
{
mbedtls_ecdh_context ecdh;
unsigned char buf[100];
size_t olen;
mbedtls_ecdh_init( &ecdh );
mbedtls_ecp_group_load( &ecdh.grp, id );
TIMEIT( name,
int ret = mbedtls_ecdh_make_public( &ecdh, &olen, buf, sizeof( buf), test_prng, NULL );
if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
ret = mbedtls_ecp_copy( &ecdh.Qp, &ecdh.Q );
if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
ret = mbedtls_ecdh_calc_secret( &ecdh, &olen, buf, sizeof( buf ), test_prng, NULL );
if (ret != 0 ) printf("ret = %d = -0x%04x\n", ret, -ret);
);
mbedtls_ecdh_free( &ecdh );
}
#endif
#if 1
void ffdh(const char *name,
const unsigned char *p, size_t p_len,
const unsigned char *g, size_t g_len)
{
mbedtls_dhm_context dhm;
unsigned char buf[400];
size_t n, olen;
int ret;
mbedtls_dhm_init( &dhm );
ret = mbedtls_mpi_read_binary( &dhm.P, p, p_len );
if( ret != 0 ) printf("p: %d\n", ret);
ret = mbedtls_mpi_read_binary( &dhm.G, g, g_len );
if( ret != 0 ) printf("g: %d\n", ret);
dhm.len = n = mbedtls_mpi_size( &dhm.P );
TIMEIT( name,
ret = mbedtls_dhm_make_public( &dhm, (int) n, buf, n, test_prng, NULL );
if( ret != 0 ) printf("mp: %d = -0x%04x\n", ret, -ret);
ret = mbedtls_mpi_copy( &dhm.GY, &dhm.GX );
if( ret != 0 ) printf("cp: %d\n", ret);
ret = mbedtls_dhm_calc_secret( &dhm, buf, sizeof( buf ), &olen, test_prng, NULL );
if( ret != 0 ) printf("cs: %d = -0x%04x\n", ret, -ret);
);
mbedtls_dhm_free( &dhm );
}
#endif
int main()
{
#if 1
ecdh("ECDH P-256", MBEDTLS_ECP_DP_SECP256R1);
ecdh("ECDH P-384", MBEDTLS_ECP_DP_SECP384R1);
#endif
#if 1
const unsigned char p[] = MBEDTLS_DHM_RFC7919_FFDHE2048_P_BIN;
const unsigned char g[] = MBEDTLS_DHM_RFC7919_FFDHE2048_G_BIN;
ffdh("FFDH 2048", p, sizeof p, g, sizeof g);
#endif
} and the following patch applied to the diff --git a/connectivity/mbedtls/include/mbedtls/bn_mul.h b/connectivity/mbedtls/include/mbedtls/bn_mul.h
index 17d057f3abe9..ff9b003b7350 100644
--- a/connectivity/mbedtls/include/mbedtls/bn_mul.h
+++ b/connectivity/mbedtls/include/mbedtls/bn_mul.h
@@ -676,10 +676,17 @@
"umaal r1, %2, %3, r0 \n\t" \
"str r1, [%1], #4 \n\t"
+#define MULADDC_FAST2 \
+ "ldm %0!, {r0, r2} \n\t" \
+ "ldm %1, {r1, r3} \n\t" \
+ "umaal r1, %2, %3, r0 \n\t" \
+ "umaal r3, %2, %3, r2 \n\t" \
+ "stm %1!, {r1,r3} \n\t"
+
#define MULADDC_STOP \
: "=r" (s), "=r" (d), "=r" (c) \
: "r" (b), "0" (s), "1" (d), "2" (c) \
- : "r0", "r1", "memory" \
+ : "r0", "r1", "r2", "r3", "memory" \
);
#else
diff --git a/connectivity/mbedtls/include/mbedtls/config.h b/connectivity/mbedtls/include/mbedtls/config.h
index 6201d9910c49..a5ca5eb4303e 100644
--- a/connectivity/mbedtls/include/mbedtls/config.h
+++ b/connectivity/mbedtls/include/mbedtls/config.h
@@ -2651,7 +2651,7 @@
* See dhm.h for more details.
*
*/
-//#define MBEDTLS_DHM_C
+#define MBEDTLS_DHM_C
/**
* \def MBEDTLS_ECDH_C
diff --git a/connectivity/mbedtls/source/bignum.c b/connectivity/mbedtls/source/bignum.c
index 9cc5d66e3abf..c5092164fca3 100644
--- a/connectivity/mbedtls/source/bignum.c
+++ b/connectivity/mbedtls/source/bignum.c
@@ -1546,11 +1546,42 @@ void mpi_mul_hlp( size_t i, mbedtls_mpi_uint *s, mbedtls_mpi_uint *d, mbedtls_mp
{
mbedtls_mpi_uint c = 0, t = 0;
-#if defined(MULADDC_HUIT)
+ /*
+ * Unroll 8 times; this provides a reasonable compromise across platforms.
+ *
+ * Unrolling less hurts performance of FFDH/RSA on some platforms (for
+ * example, unrolling 4 rather than 8 times decreases perfomance by around
+ * 12% on Cortex-M4 cores). Unrolling more increase the code size linearly
+ * (for example, unrolling 16 rather than 8 times would increase the code
+ * size by around 250 bytes on Cortex-M0).
+ *
+ * Also, on 32-bit platforms, 256-bit numbers are 8 limbs, and this is a
+ * common size for ECC, widely used on constrained platforms.
+ *
+ * Use optimized 8/4/2-times version if available.
+ */
+#if 0 && defined(MULADDC_FAST2)
+#define MULADDC_2 MULADDC_FAST2
+#else
+#define MULADDC_2 MULADDC_CORE MULADDC_CORE
+#endif
+
+#if defined(MULADDC_FAST4)
+#define MULADDC_4 MULADDC_FAST4
+#else
+#define MULADDC_4 MULADDC_2 MULADDC_2
+#endif
+
+#if defined(MULADDC_FAST8)
+#define MULADDC_8 MULADDC_FAST8
+#else
+#define MULADDC_8 MULADDC_4 MULADDC_4
+#endif
+
for( ; i >= 8; i -= 8 )
{
MULADDC_INIT
- MULADDC_HUIT
+ MULADDC_8
MULADDC_STOP
}
@@ -1560,40 +1591,6 @@ void mpi_mul_hlp( size_t i, mbedtls_mpi_uint *s, mbedtls_mpi_uint *d, mbedtls_mp
MULADDC_CORE
MULADDC_STOP
}
-#else /* MULADDC_HUIT */
- for( ; i >= 16; i -= 16 )
- {
- MULADDC_INIT
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
-
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
- MULADDC_STOP
- }
-
- for( ; i >= 8; i -= 8 )
- {
- MULADDC_INIT
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
-
- MULADDC_CORE MULADDC_CORE
- MULADDC_CORE MULADDC_CORE
- MULADDC_STOP
- }
-
- for( ; i > 0; i-- )
- {
- MULADDC_INIT
- MULADDC_CORE
- MULADDC_STOP
- }
-#endif /* MULADDC_HUIT */
t++;
Then I compiled and ran with
Notice how the optimisation changes almost nothing for small bignums but seems to slightly decrease FFDH performance. I'm not sure what to think of this. I double-checked the generated code with Perhaps the improvements depend on the actual memory characteristics of the chip? |
This may be a code alignment issue: We should try aligning the size of |
Relates to: #4943
Suggested enhancement
This is about the inline assembly for
MULADDC_CORE
on MCUs with DSP extension:MULADDC_CORE
is called many times in a row, which allows for the use ofldm/stm
instructions to allow the CPU to merge consecutive loads into a single cycle provided the data path is wide enough (e.g. Cortex-M55).The text was updated successfully, but these errors were encountered: