-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm: reorganize files, optimize memcpy, memset #347
Conversation
6f36284
to
012e135
Compare
Maybe we can benchmark it against naive C-only implementation with loop unrolling - just for reference? - eg. newlib one Did you try benchmarking it against eg. uClibc-ng arm asm optimized version? (https://github.com/wbx-github/uclibc-ng/blob/master/libc/string/arm/_memcpy.S) |
arch/armv7a/memcpy.S
Outdated
cmp LEN, #64 | ||
mov DST_RET, DST /* preserve return value */ | ||
|
||
bhs .LblkCopy | ||
|
||
/* less than 64 bytes - always copy as if block was always unaligned */ | ||
|
||
.Ltail63Unaligned: | ||
/* unaligned copy, 0-63 bytes */ | ||
|
||
/* r3 = LEN / 4 */ | ||
movs r3, LEN, lsr #2 | ||
beq .Ltail63Un0 | ||
|
||
.Ltail63Un4: | ||
ldr r2, [SRC], #4 | ||
str r2, [DST], #4 | ||
subs r3, #1 | ||
bne .Ltail63Un4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is unaligned memory access allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, ldr
and str
instructions allow for unaligned addresses (but with some performance penalty - which doesn't really matter here as we're copying only up to 60 bytes that way). ARM documentation
edit: ok, it may not always work, I'll change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant your project, not in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion I think it will be beneficial to allow it at least in this case. It will need small CPU initialization change, though. @nalajcie What do you think? Enabling unaligned access will make memcpy simpler and faster (as unaligned 4 byte access should still be faster than 4 separate 1 byte accesses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but unaligned access is supported only in arm
mode (https://developer.arm.com/documentation/ddi0308/d/Programmers--Model/Unaligned-access-support/Load-and-store-alignment-checks), so the resulting bytecode would actually be larger than the thumb version with correct alignment. Not sure about the performance implications THO.
If this would be the only blocker against switching to thumb
then IMHO we can copy it byte-by-byte (as it's only up to 60 bytes)?
For short (up to 64b?) aligned (not sure if only?) memcpy
gcc would probably provide alternative inline implementation (https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/expr.cc;h=8d34d024c9c1248cb36bbfd78f90e9514cee513e;hb=refs/heads/master#l1978 is responsible for heuristics but it might be easier to test it experimentally than unrestand the code), so we should assume this code would be called mostly for unaligned pointers anyway (each str
would be split into byte/halfword access?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalajcie Hmm, what makes you think it doesn't work in thumb mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably some invalid info I've found on internet (https://s-o-c.org/does-arm-allow-unaligned-access/). The reference page I've linked in previous comment indeed says that U=1 on all modern ARMs so STR/LDR
should just produce unaligned access - please just scratch out that part of the comment :).
Sorry for providing invalid info. If we choose to keep the unaligned access, maybe we should explicitly say it in comment to ensure nobody would think in the future that it's an error on our part?
a714c76
to
0e88eac
Compare
@nalajcie regarding your comment - len divisibility (by 2, 4, whatever) does not meaningfully impact performance, as it's always handled in the tail. Edit: I've managed to simplify the code a little bit and squeeze better performance, see next comment. Some more benchmarks: New
|
1ecd512
to
5ce00c1
Compare
Upon further investigation, I've noticed that only 64 byte alignment allows the loops to constantly provide the highest performance. src/dst mutually aligned
dst misaligned by 1
src misaligned by 1
|
83b4346
to
1e40a40
Compare
Thanks for the comprehensive benchmarks, I see that:
Note - for Cortex-A we could probably use NEON extensions to provide even faster memcpy, but let's not dwell on that for now. I've found some NEON implementations of libc functions if you would be up to experiment in the future: https://github.com/genesi/imx-libc-neon/blob/master/memcpy-neon.S 👀 |
bce99b7
to
41b2472
Compare
New implementation
Old implementation
uClibc
|
90d27b0
to
47e0823
Compare
JIRA: RTOS-789
JIRA: RTOS-789
47e0823
to
860f1ae
Compare
I've also tested and benchmarked the implementations on Nucleo STM32L4A6. Iterations were reduced to 1000.
|
Description
Reorganize ARM files:
arm
directoryv7a
,v7m
subdirectories with files specific for that archmemcpy
optimization:len
len
memset
optimization:len
len
Memcpy benchmarks:
Old implementation
off
is offset from aligned address, first column contains number of bytes copied.Times are a sum of 10000 iterations in us.
New implementation
About 2 - 2,5x speed up with aligned addresses and up to 8,5x with unaligned access.
Motivation and Context
Types of changes
How Has This Been Tested?
armv7a9-zynq7000-qemu
Checklist:
Special treatment