arm: reorganize files, optimize memcpy, memset #347

lukileczo · 2024-02-27T12:24:36Z

Description

Reorganize ARM files:

create common arm directory
extract common arm routines
add v7a, v7m subdirectories with files specific for that arch

memcpy optimization:

handle misaligned buffers,
copy in parts - basic cases cover:
- len $<$ 64
- 64 $\leq$ len

memset optimization:

code divided into 2 cases:
- len $<$ 64
- 64 $\leq$ len

Memcpy benchmarks:

Old implementation

off          0 |        1 |        2 |        3 |        4 |        5 |        6 |        7 |        8 | 
--------------------------------------------------------------------------------------------------------
b8:        235 |      256 |      222 |      223 |      108 |      254 |      222 |      249 |      108 | 
b16:       167 |      412 |      442 |      435 |      164 |      438 |      439 |      409 |      167 | 
b32:       288 |      833 |      824 |      822 |      291 |      794 |      794 |      818 |      310 | 
b64:       588 |     1630 |     1594 |     1580 |      571 |     1604 |     1606 |     1586 |      547 | 
b128:     1040 |     3164 |     3178 |     3187 |     1013 |     3152 |     3168 |     3186 |     1039 | 
b256:     1928 |     6343 |     6355 |     6353 |     1916 |     6387 |     6381 |     6375 |     1909 | 
b512:     3685 |    12666 |    12681 |    12660 |     3668 |    12645 |    12664 |    12724 |     3709 | 
b1024:    7671 |    26407 |    26490 |    26427 |     7662 |    26550 |    26528 |    26549 |     7648 | 
b2048:   14812 |    51817 |    51801 |    51852 |    14779 |    51865 |    51663 |    51932 |    14777 | 
b4096:   29007 |   102689 |   102538 |   102299 |    29026 |   102677 |   102962 |   102688 |    29013 | 
b8192:   57764 |   204702 |   205158 |   204743 |    57673 |   205559 |   205738 |   204963 |    57732 | 
b16384  115163 |   414076 |   413271 |   412504 |   114849 |   413162 |   413003 |   413334 |   114911 | 
b32768  229685 |   830554 |   828351 |   828070 |   229596 |   831787 |   829131 |   828910 |   229745 |

off is offset from aligned address, first column contains number of bytes copied.
Times are a sum of 10000 iterations in us.

New implementation

off          0 |        1 |        2 |        3 |        4 |        5 |        6 |        7 |        8 | 
--------------------------------------------------------------------------------------------------------
b8:        251 |      213 |      238 |      212 |      211 |      214 |      211 |      231 |      240 | 
b16:       255 |      252 |      251 |      281 |      283 |      254 |      260 |      255 |      251 | 
b32:       341 |      369 |      337 |      338 |      341 |      341 |      359 |      343 |      330 | 
b64:       544 |      606 |      585 |      582 |      551 |      522 |      501 |      524 |      476 | 
b128:      680 |      790 |      770 |      742 |      716 |      712 |      697 |      672 |      626 | 
b256:     1065 |     1138 |     1119 |     1150 |     1089 |     1063 |     1073 |     1075 |      997 | 
b512:     2321 |     1853 |     1839 |     1808 |     1812 |     2217 |     2182 |     2162 |     2163 | 
b1024:    4298 |     4555 |     4542 |     4615 |     4613 |     4586 |     4587 |     4503 |     4512 | 
b2048:    7757 |     7376 |     7367 |     7578 |     7436 |     7427 |     7488 |     7428 |     7446 | 
b4096:   14355 |    14003 |    13998 |    14031 |    14089 |    13805 |    13730 |    13956 |    13695 | 
b8192:   26073 |    25767 |    25790 |    25832 |    25984 |    25158 |    25239 |    25148 |    25266 | 
b16384   50478 |    49692 |    49739 |    49582 |    50499 |    48158 |    48262 |    48830 |    48462 | 
b32768   99926 |    98652 |    98188 |   100087 |    97771 |    94726 |    94758 |    95663 |    96211 |

About 2 - 2,5x speed up with aligned addresses and up to 8,5x with unaligned access.

Motivation and Context

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

How Has This Been Tested?

Already covered by automatic testing.
New test added: (add PR link here).
Tested by hand on: armv7a9-zynq7000-qemu

Checklist:

My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing linter checks and tests passed.
My changes generate no new compilation warnings for any of the targets.

Special treatment

This PR needs additional PRs to work (list the PRs, preferably in merge-order).
I will merge this PR by myself when appropriate.

arch/armv7a/memcpy.S

github-actions · 2024-02-27T12:39:39Z

Unit Test Results

7 254 tests +99 6 543 ✅ +95 38m 13s ⏱️ + 4m 3s
408 suites +11 711 💤 + 4
1 files ± 0 0 ❌ ± 0

Results for commit 860f1ae. ± Comparison against base commit 8e7daf8.

♻️ This comment has been updated with latest results.

nalajcie · 2024-02-27T13:37:37Z

maybe check len not divisible by 2? (probably more common case than unaligned beginning of the buffer)
off from the aligned address regards src/dst, or diff between them?

Maybe we can benchmark it against naive C-only implementation with loop unrolling - just for reference? - eg. newlib one
(https://github.com/bminor/newlib/blob/master/newlib/libc/string/memcpy.c)?

Did you try benchmarking it against eg. uClibc-ng arm asm optimized version? (https://github.com/wbx-github/uclibc-ng/blob/master/libc/string/arm/_memcpy.S)

kemonats · 2024-02-29T09:06:35Z

arch/armv7a/memcpy.S

+	cmp LEN, #64
+	mov DST_RET, DST /* preserve return value */
+
+	bhs .LblkCopy
+
+	/* less than 64 bytes - always copy as if block was always unaligned */
+
+.Ltail63Unaligned:
+	/* unaligned copy, 0-63 bytes */
+
+	/* r3 = LEN / 4 */
+	movs r3, LEN, lsr #2
+	beq .Ltail63Un0
+
+.Ltail63Un4:
+	ldr r2, [SRC], #4
+	str r2, [DST], #4
+	subs r3, #1
+	bne .Ltail63Un4


Is unaligned memory access allowed?

yes, ldr and str instructions allow for unaligned addresses (but with some performance penalty - which doesn't really matter here as we're copying only up to 60 bytes that way). ARM documentation

edit: ok, it may not always work, I'll change that

I meant your project, not in general.

After some discussion I think it will be beneficial to allow it at least in this case. It will need small CPU initialization change, though. @nalajcie What do you think? Enabling unaligned access will make memcpy simpler and faster (as unaligned 4 byte access should still be faster than 4 separate 1 byte accesses)

I'm not sure but unaligned access is supported only in arm mode (https://developer.arm.com/documentation/ddi0308/d/Programmers--Model/Unaligned-access-support/Load-and-store-alignment-checks), so the resulting bytecode would actually be larger than the thumb version with correct alignment. Not sure about the performance implications THO.

If this would be the only blocker against switching to thumb then IMHO we can copy it byte-by-byte (as it's only up to 60 bytes)?

For short (up to 64b?) aligned (not sure if only?) memcpy gcc would probably provide alternative inline implementation (https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/expr.cc;h=8d34d024c9c1248cb36bbfd78f90e9514cee513e;hb=refs/heads/master#l1978 is responsible for heuristics but it might be easier to test it experimentally than unrestand the code), so we should assume this code would be called mostly for unaligned pointers anyway (each str would be split into byte/halfword access?).

@nalajcie Hmm, what makes you think it doesn't work in thumb mode?

Probably some invalid info I've found on internet (https://s-o-c.org/does-arm-allow-unaligned-access/). The reference page I've linked in previous comment indeed says that U=1 on all modern ARMs so STR/LDR should just produce unaligned access - please just scratch out that part of the comment :).

Sorry for providing invalid info. If we choose to keep the unaligned access, maybe we should explicitly say it in comment to ensure nobody would think in the future that it's an error on our part?

arch/armv7a/memcpy.S

lukileczo · 2024-03-01T17:12:29Z

@nalajcie regarding your comment - len divisibility (by 2, 4, whatever) does not meaningfully impact performance, as it's always handled in the tail. off in the benchmarks is the same value for both src and dst memcpy(src + off, dst + off, len). Lower I attached benchmarks, where src and dst aren't mutually aligned (however there is no visible performance impact).

Edit: I've managed to simplify the code a little bit and squeeze better performance, see next comment.

Some more benchmarks:

New `memcpy`

`.arm`

src/dst mutually aligned

off          0 |        1 | 
------------------------------
b8:        247 |      208 | 
b16:       248 |      248 | 
b32:       368 |      358 | 
b64:       414 |      551 | 
b128:      594 |      737 | 
b256:      929 |     1087 | 
b512:     2185 |     1935 | 
b1024:    3510 |     3754 | 
b2048:    6284 |     7016 | 
b4096:   12579 |    13195 | 
b8192:   25140 |    25711 | 
b16384   50216 |    51356 | 
b32768   98060 |   101563 |

dst misaligned by 1 - `memcpy(src + off, dst + off + 1, len)`

off          0 |        1 | 
------------------------------
b8:        231 |      207 | 
b16:       320 |      263 | 
b32:       321 |      330 | 
b64:       653 |      617 | 
b128:      774 |      774 | 
b256:     1202 |     1184 | 
b512:     1967 |     1984 | 
b1024:    3367 |     3425 | 
b2048:    6320 |     6674 | 
b4096:   12178 |    12107 | 
b8192:   24066 |    24157 | 
b16384   47823 |    48275 | 
b32768   94652 |    99616 |

src misaligned by 1 - `memcpy(src + off + 1, dst + off, len)`

off          0 |        1 | 
------------------------------
b8:        327 |      237 | 
b16:       248 |      247 | 
b32:       327 |      323 | 
b64:       402 |      613 | 
b128:      542 |      805 | 
b256:      920 |     1149 | 
b512:     1813 |     1957 | 
b1024:    3450 |     3367 | 
b2048:    6027 |     6387 | 
b4096:   11918 |    12246 | 
b8192:   23796 |    24361 | 
b16384   47536 |    47897 | 
b32768   94394 |    94629 |

`.thumb`

src/dst mutually aligned

This is a little bit concerning for me as the performance of aligned copy is lower than unaligned - but manual loop alignment stays the same (32 bytes, equal icache line) so I don't really know why it's happening.

off          0 |        1 | 
------------------------------
b8:        248 |      207 | 
b16:       246 |      246 | 
b32:       325 |      361 | 
b64:       441 |      537 | 
b128:      604 |      724 | 
b256:      890 |     1101 | 
b512:     2515 |     1891 | 
b1024:    4386 |     4455 | 
b2048:    8489 |     8667 | 
b4096:   16865 |    16790 | 
b8192:   33021 |    32957 | 
b16384   64768 |    64599 | 
b32768  129154 |   128443 |

dst misaligned by 1

off          0 |        1 | 
------------------------------
b8:        224 |      203 | 
b16:       246 |      263 | 
b32:       406 |      321 | 
b64:       611 |      606 | 
b128:      817 |      770 | 
b256:     1174 |     1174 | 
b512:     1996 |     1945 | 
b1024:    3485 |     3370 | 
b2048:    6345 |     6316 | 
b4096:   12284 |    12151 | 
b8192:   24209 |    23919 | 
b16384   48020 |    47358 | 
b32768   94502 |    94557 |

src misaligned by 1

off          0 |        1 | 
------------------------------
b8:        224 |      208 | 
b16:       242 |      247 | 
b32:       396 |      326 | 
b64:       522 |      798 | 
b128:      749 |      818 | 
b256:     1094 |     1368 | 
b512:     1906 |     2128 | 
b1024:    3308 |     3572 | 
b2048:    6256 |     6883 | 
b4096:   12146 |    13084 | 
b8192:   25058 |    24164 | 
b16384   53383 |    49236 | 
b32768  105324 |    96130 |

C memcpy

dst/src mutually aligned

(https://github.com/bminor/newlib/blob/master/newlib/libc/string/memcpy.c)

off          0 |        1 | 
------------------------------
b8:        445 |      414 | 
b16:       346 |      545 | 
b32:       415 |     1024 | 
b64:       488 |     1493 | 
b128:      761 |     2770 | 
b256:     1257 |     5344 | 
b512:     2318 |    10380 | 
b1024:    4242 |    20532 | 
b2048:    8271 |    40605 | 
b4096:   16320 |    81429 | 
b8192:   32325 |   160967 | 
b16384   64264 |   320014 | 
b32768  130048 |   640968 |

uClibc - `.thumb`

(https://github.com/wbx-github/uclibc-ng/blob/master/libc/string/arm/_memcpy.S)

src/dst mutually aligned

off          0 |        1 | 
------------------------------
b8:        388 |      392 | 
b16:       305 |      349 | 
b32:       383 |      431 | 
b64:       481 |      528 | 
b128:      651 |      821 | 
b256:     1031 |     1099 | 
b512:     1767 |     1848 | 
b1024:    3336 |     3359 | 
b2048:    6431 |     6442 | 
b4096:   12597 |    12598 | 
b8192:   24757 |    24930 | 
b16384   49676 |    49730 | 
b32768  100916 |   101239 |

dst misaligned by 1

off          0 |        1 | 
------------------------------
b8:        436 |      322 | 
b16:       369 |      370 | 
b32:       478 |      488 | 
b64:       593 |      592 | 
b128:      844 |      841 | 
b256:     1368 |     1348 | 
b512:     2506 |     2299 | 
b1024:    4489 |     4292 | 
b2048:    8738 |     8458 | 
b4096:   16536 |    16385 | 
b8192:   31989 |    32126 | 
b16384   63783 |    63841 | 
b32768  129096 |   128766 |

src misaligned by 1

off          0 |        1 | 
------------------------------
b8:        321 |      326 | 
b16:       413 |      394 | 
b32:       381 |      462 | 
b64:       552 |      608 | 
b128:      767 |      850 | 
b256:     1235 |     1471 | 
b512:     2273 |     2346 | 
b1024:    4153 |     4440 | 
b2048:    8363 |     8351 | 
b4096:   16380 |    16209 | 
b8192:   31899 |    32024 | 
b16384   63710 |    63781 | 
b32768  128189 |   128997 |

lukileczo · 2024-03-04T09:19:52Z

Upon further investigation, I've noticed that only 64 byte alignment allows the loops to constantly provide the highest performance. .arm code doesn't provide any advantage above .thumb - so the resulting code is written in .thumb. I've also dropped the case with 128-byte block copy, which didn't provide any speedup compared to 64 byte block copy, which may be due to caching. Dropping this case also makes the resulting code a little smaller and simpler. Benchmarks below.

src/dst mutually aligned

off          0 |        1 | 
-------------------------------
b8:        399 |      259 | 
b16:       300 |      300 | 
b32:       387 |      407 | 
b64:       504 |      568 | 
b128:      634 |      741 | 
b256:      964 |     1095 | 
b512:     1774 |     1800 | 
b1024:    3276 |     3208 | 
b2048:    8593 |     7918 | 
b4096:   14299 |    13416 | 
b8192:   25874 |    25052 | 
b16384   48811 |    47742 | 
b32768   94385 |    93635 |

dst misaligned by 1

off          0 |        1 | 
-------------------------------
b8:        288 |      291 | 
b16:       319 |      303 | 
b32:       417 |      380 | 
b64:       711 |      770 | 
b128:     1018 |      985 | 
b256:     1365 |     1317 | 
b512:     2060 |     2014 | 
b1024:    3514 |     3524 | 
b2048:    7666 |     7874 | 
b4096:   13610 |    13610 | 
b8192:   25248 |    25396 | 
b16384   48967 |    48811 | 
b32768   95953 |    95718 |

src misaligned by 1

off          0 |        1 | 
-------------------------------
b8:        295 |      260 | 
b16:       387 |      300 | 
b32:       382 |      411 | 
b64:       688 |      684 | 
b128:      961 |      978 | 
b256:     1315 |     1365 | 
b512:     2023 |     2154 | 
b1024:    3544 |     3647 | 
b2048:    7899 |     7899 | 
b4096:   14479 |    13891 | 
b8192:   25419 |    25641 | 
b16384   48872 |    49038 | 
b32768   96160 |    95916 |

nalajcie · 2024-03-04T11:14:14Z

Thanks for the comprehensive benchmarks, I see that:

our new implementation is much faster, especially for aligned buffers (IMHO should be most common use case for bigger buffers) which I'm very happy about :)
newlib C implentation for aligned buffers is faster that our original ASM implementation, maybe we should introduce similar "fallback" C implementation for new targets to benefit from that
loop alignment/unrolling is especially useful for cortex-m4 which doesn't have branch predictor (unaligned jumps would take more cycles to refill the pipeline - but aligning to word should be enough), for m7 branch predictor will guess correctly most of the time but sometimes would fail (that's why times might be inconsistent), probably ICache might also impact this
I don't understand why len=4096 is faster for off=1 (13416) than for off=0 (14299), similarly why dst not mutually aligned with src is faster (13610 vs 14299) - seems like there is something we don't understand here for aligned use case (cache invalidation issues?) - as uClibc timings seem to be more logical (mutually aligned memcpy is slower, off=1 the same/slower as off=0)

Note - for Cortex-A we could probably use NEON extensions to provide even faster memcpy, but let's not dwell on that for now. I've found some NEON implementations of libc functions if you would be up to experiment in the future: https://github.com/genesi/imx-libc-neon/blob/master/memcpy-neon.S 👀

lukileczo · 2024-03-04T14:06:34Z

memset benchmarks:

New implementation

off          0 |        1 | 
----------------------------------------------
b8:        216 |      369 | 
b16:       223 |      434 | 
b32:       314 |      446 | 
b64:       415 |      461 | 
b128:      475 |      477 | 
b256:      630 |      676 | 
b512:      996 |     1029 | 
b1024:    1741 |     1761 | 
b2048:    3168 |     3202 | 
b4096:    6077 |     6084 | 
b8192:   12422 |    12433 | 
b16384   24003 |    24086 | 
b32768   47230 |    47339 |

Old implementation

off          0 |        1 | 
----------------------------------------------
b8:        124 |      157 | 
b16:       126 |      291 | 
b32:       218 |      585 | 
b64:       470 |     1124 | 
b128:      780 |     2224 | 
b256:     1461 |     4508 | 
b512:     2792 |     9000 | 
b1024:    5378 |    17972 | 
b2048:   10628 |    35711 | 
b4096:   21316 |    71442 | 
b8192:   42512 |   147078 | 
b16384   85006 |   286835 | 
b32768  170727 |   574138 |

uClibc

off          0 |        1 | 
----------------------------------------------
b8:        199 |      285 | 
b16:       206 |      318 | 
b32:       241 |      329 | 
b64:       349 |      398 | 
b128:      469 |      578 | 
b256:      858 |      890 | 
b512:     1394 |     1471 | 
b1024:    2708 |     2834 | 
b2048:    5153 |     5227 | 
b4096:   10109 |    10150 | 
b8192:   19936 |    20028 | 
b16384   43101 |    42934 | 
b32768   82452 |    82244 |

JIRA: RTOS-789

lukileczo · 2024-03-07T09:47:37Z

I've also tested and benchmarked the implementations on Nucleo STM32L4A6. Iterations were reduced to 1000.
Unaligned src and dst have very similar performance, so I'm posting only dst.
Unaligned ldr/str instructions work without any change in the CPU registers.

`memcpy`

New implementation

mutually aligned

off          0 |        1 | 
----------------------------------------------
b8:       1953 |     2930 | 
b16:      2930 |     4150 | 
b32:      4151 |     6836 | 
b64:      6835 |     9277 | 
b128:    10254 |    13916 | 
b256:    17578 |    20996 | 
b512:    31739 |    35157 | 
b1024:   60547 |    64209 | 
b2048:  118164 |   121826 | 
b4096:  233398 |   236816 | 
b8192:  463623 |   467041 | 
b16384  923828 |   927491 | 
b32768 1844483 |  1848388 |

unaligned `dst`

off          0 |        1 | 
----------------------------------------------
b8:       3174 |     2930 | 
b16:      3662 |     3906 | 
b32:      5859 |     6348 | 
b64:     12207 |    12451 | 
b128:    19287 |    19775 | 
b256:    30518 |    30762 | 
b512:    52979 |    53223 | 
b1024:   98144 |    98632 | 
b2048:  188477 |   188721 | 
b4096:  368408 |   368897 | 
b8192:  729248 |   729248 | 
b16384 1449951 |  1450439 | 
b32768 2892090 |  2892334 |

Old implementation

Mutually aligned

off          0 |        1 | 
----------------------------------------------
b8:       2686 |     3906 | 
b16:      3174 |     7080 | 
b32:      5127 |    13672 | 
b64:      9033 |    26611 | 
b128:    16602 |    52735 | 
b256:    31738 |   104736 | 
b512:    62256 |   208985 | 
b1024:  122802 |   417480 | 
b2048:  244629 |   834229 | 
b4096:  487793 |  1668213 | 
b8192:  974121 |  3335937 | 
b16384 1946778 |  6670410 | 
b32768 3892334 | 13341553 |

Unaligned dst

off           0 |        1 | 
----------------------------------------------
b8:        4883 |     4639 | 
b16:       8300 |     8301 | 
b32:      15870 |    15869 | 
b64:      31005 |    31006 | 
b128:     61524 |    61523 | 
b256:    122314 |   122315 | 
b512:    243897 |   243896 | 
b1024:   487060 |   487061 | 
b2048:   973389 |   973633 | 
b4096:  1946777 |  1946045 | 
b8192:  3892090 |  3891845 | 
b16384  7784180 |  7782959 | 
b32768 15562988 | 15563721 |

`memset`

New implementation

off          0 |        1 | 
----------------------------------------
b8:       2197 |     2685 | 
b16:      2686 |     3418 | 
b32:      3906 |     5371 | 
b64:      8057 |     9766 | 
b128:     9766 |    12207 | 
b256:    13916 |    16113 | 
b512:    21240 |    23682 | 
b1024:   36621 |    38818 | 
b2048:   66894 |    69336 | 
b4096:  127686 |   129883 | 
b8192:  249268 |   251709 | 
b16384  492431 |   495117 | 
b32768  979493 |   981690 |

Old implementation

off          0 |        1 | 
----------------------------------------
b8:       2685 |     3418 | 
b16:      2686 |     6347 | 
b32:      4394 |    11475 | 
b64:      7813 |    22461 | 
b128:    14404 |    44189 | 
b256:    27344 |    87403 | 
b512:    53467 |   174316 | 
b1024:  105468 |   348389 | 
b2048:  209716 |   696045 | 
b4096:  418457 |  1390869 | 
b8192:  835450 |  2781250 | 
b16384 1669433 |  5560303 | 
b32768 3337647 | 11120361 |

arch/arm/memset.S

github-actions bot reviewed Feb 27, 2024

View reviewed changes

arch/armv7a/memcpy.S Outdated Show resolved Hide resolved

lukileczo force-pushed the lukileczo/memcpy branch from 6f36284 to 012e135 Compare February 27, 2024 13:08

kemonats reviewed Feb 29, 2024

View reviewed changes

agkaminski reviewed Feb 29, 2024

View reviewed changes

lukileczo force-pushed the lukileczo/memcpy branch 4 times, most recently from a714c76 to 0e88eac Compare March 1, 2024 15:31

lukileczo force-pushed the lukileczo/memcpy branch from 1ecd512 to 5ce00c1 Compare March 1, 2024 17:18

lukileczo force-pushed the lukileczo/memcpy branch 6 times, most recently from 83b4346 to 1e40a40 Compare March 4, 2024 10:58

lukileczo force-pushed the lukileczo/memcpy branch from bce99b7 to 41b2472 Compare March 4, 2024 13:56

lukileczo force-pushed the lukileczo/memcpy branch 2 times, most recently from 90d27b0 to 47e0823 Compare March 5, 2024 09:46

lukileczo marked this pull request as ready for review March 5, 2024 10:00

lukileczo added 2 commits March 7, 2024 10:22

arch/arm: reorganize common arm files

bb28b02

JIRA: RTOS-789

arch/arm: add optimized memcpy and memset implementation

860f1ae

JIRA: RTOS-789

lukileczo force-pushed the lukileczo/memcpy branch from 47e0823 to 860f1ae Compare March 7, 2024 09:26

lukileczo changed the title ~~armv7-a: optimize memcpy, memset~~ arm: reorganize files, optimize memcpy, memset Mar 7, 2024

lukileczo requested a review from agkaminski March 8, 2024 15:05

agkaminski reviewed Mar 8, 2024

View reviewed changes

arch/arm/memset.S Show resolved Hide resolved

agkaminski approved these changes Mar 8, 2024

View reviewed changes

agkaminski merged commit 5e649dc into master Mar 8, 2024
30 checks passed

agkaminski deleted the lukileczo/memcpy branch March 8, 2024 15:53

lukileczo mentioned this pull request Mar 8, 2024

arch: add armv8m #342

Draft

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm: reorganize files, optimize memcpy, memset #347

arm: reorganize files, optimize memcpy, memset #347

lukileczo commented Feb 27, 2024 •

edited

Loading

github-actions bot commented Feb 27, 2024 •

edited

Loading

nalajcie commented Feb 27, 2024

kemonats Feb 29, 2024

lukileczo Feb 29, 2024 •

edited

Loading

kemonats Feb 29, 2024

agkaminski Feb 29, 2024

nalajcie Feb 29, 2024

agkaminski Feb 29, 2024

nalajcie Feb 29, 2024

lukileczo commented Mar 1, 2024 •

edited

Loading

lukileczo commented Mar 4, 2024 •

edited

Loading

nalajcie commented Mar 4, 2024

lukileczo commented Mar 4, 2024

lukileczo commented Mar 7, 2024

arm: reorganize files, optimize memcpy, memset #347

arm: reorganize files, optimize memcpy, memset #347

Conversation

lukileczo commented Feb 27, 2024 • edited Loading

Description

Memcpy benchmarks:

Old implementation

New implementation

Motivation and Context

Types of changes

How Has This Been Tested?

Checklist:

Special treatment

github-actions bot commented Feb 27, 2024 • edited Loading

Unit Test Results

nalajcie commented Feb 27, 2024

kemonats Feb 29, 2024

Choose a reason for hiding this comment

lukileczo Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

kemonats Feb 29, 2024

Choose a reason for hiding this comment

agkaminski Feb 29, 2024

Choose a reason for hiding this comment

nalajcie Feb 29, 2024

Choose a reason for hiding this comment

agkaminski Feb 29, 2024

Choose a reason for hiding this comment

nalajcie Feb 29, 2024

Choose a reason for hiding this comment

lukileczo commented Mar 1, 2024 • edited Loading

New memcpy

.arm

src/dst mutually aligned

dst misaligned by 1 - memcpy(src + off, dst + off + 1, len)

src misaligned by 1 - memcpy(src + off + 1, dst + off, len)

.thumb

src/dst mutually aligned

dst misaligned by 1

src misaligned by 1

C memcpy

dst/src mutually aligned

uClibc - .thumb

src/dst mutually aligned

dst misaligned by 1

src misaligned by 1

lukileczo commented Mar 4, 2024 • edited Loading

src/dst mutually aligned

dst misaligned by 1

src misaligned by 1

nalajcie commented Mar 4, 2024

lukileczo commented Mar 4, 2024

New implementation

Old implementation

uClibc

lukileczo commented Mar 7, 2024

memcpy

New implementation

mutually aligned

unaligned dst

Old implementation

Mutually aligned

Unaligned dst

memset

New implementation

Old implementation

lukileczo commented Feb 27, 2024 •

edited

Loading

github-actions bot commented Feb 27, 2024 •

edited

Loading

lukileczo Feb 29, 2024 •

edited

Loading

lukileczo commented Mar 1, 2024 •

edited

Loading

New `memcpy`

`.arm`

dst misaligned by 1 - `memcpy(src + off, dst + off + 1, len)`

src misaligned by 1 - `memcpy(src + off + 1, dst + off, len)`

`.thumb`

uClibc - `.thumb`

lukileczo commented Mar 4, 2024 •

edited

Loading

`memcpy`

unaligned `dst`

`memset`