Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm: reorganize files, optimize memcpy, memset #347

Merged
merged 2 commits into from
Mar 8, 2024
Merged

Conversation

lukileczo
Copy link
Member

@lukileczo lukileczo commented Feb 27, 2024

Description

Reorganize ARM files:

  • create common arm directory
  • extract common arm routines
  • add v7a, v7m subdirectories with files specific for that arch

memcpy optimization:

  • handle misaligned buffers,
  • copy in parts - basic cases cover:
    • len $<$ 64
    • 64 $\leq$ len

memset optimization:

  • code divided into 2 cases:
    • len $<$ 64
    • 64 $\leq$ len

Memcpy benchmarks:

Old implementation

off          0 |        1 |        2 |        3 |        4 |        5 |        6 |        7 |        8 | 
--------------------------------------------------------------------------------------------------------
b8:        235 |      256 |      222 |      223 |      108 |      254 |      222 |      249 |      108 | 
b16:       167 |      412 |      442 |      435 |      164 |      438 |      439 |      409 |      167 | 
b32:       288 |      833 |      824 |      822 |      291 |      794 |      794 |      818 |      310 | 
b64:       588 |     1630 |     1594 |     1580 |      571 |     1604 |     1606 |     1586 |      547 | 
b128:     1040 |     3164 |     3178 |     3187 |     1013 |     3152 |     3168 |     3186 |     1039 | 
b256:     1928 |     6343 |     6355 |     6353 |     1916 |     6387 |     6381 |     6375 |     1909 | 
b512:     3685 |    12666 |    12681 |    12660 |     3668 |    12645 |    12664 |    12724 |     3709 | 
b1024:    7671 |    26407 |    26490 |    26427 |     7662 |    26550 |    26528 |    26549 |     7648 | 
b2048:   14812 |    51817 |    51801 |    51852 |    14779 |    51865 |    51663 |    51932 |    14777 | 
b4096:   29007 |   102689 |   102538 |   102299 |    29026 |   102677 |   102962 |   102688 |    29013 | 
b8192:   57764 |   204702 |   205158 |   204743 |    57673 |   205559 |   205738 |   204963 |    57732 | 
b16384  115163 |   414076 |   413271 |   412504 |   114849 |   413162 |   413003 |   413334 |   114911 | 
b32768  229685 |   830554 |   828351 |   828070 |   229596 |   831787 |   829131 |   828910 |   229745 | 

off is offset from aligned address, first column contains number of bytes copied.
Times are a sum of 10000 iterations in us.

New implementation

off          0 |        1 |        2 |        3 |        4 |        5 |        6 |        7 |        8 | 
--------------------------------------------------------------------------------------------------------
b8:        251 |      213 |      238 |      212 |      211 |      214 |      211 |      231 |      240 | 
b16:       255 |      252 |      251 |      281 |      283 |      254 |      260 |      255 |      251 | 
b32:       341 |      369 |      337 |      338 |      341 |      341 |      359 |      343 |      330 | 
b64:       544 |      606 |      585 |      582 |      551 |      522 |      501 |      524 |      476 | 
b128:      680 |      790 |      770 |      742 |      716 |      712 |      697 |      672 |      626 | 
b256:     1065 |     1138 |     1119 |     1150 |     1089 |     1063 |     1073 |     1075 |      997 | 
b512:     2321 |     1853 |     1839 |     1808 |     1812 |     2217 |     2182 |     2162 |     2163 | 
b1024:    4298 |     4555 |     4542 |     4615 |     4613 |     4586 |     4587 |     4503 |     4512 | 
b2048:    7757 |     7376 |     7367 |     7578 |     7436 |     7427 |     7488 |     7428 |     7446 | 
b4096:   14355 |    14003 |    13998 |    14031 |    14089 |    13805 |    13730 |    13956 |    13695 | 
b8192:   26073 |    25767 |    25790 |    25832 |    25984 |    25158 |    25239 |    25148 |    25266 | 
b16384   50478 |    49692 |    49739 |    49582 |    50499 |    48158 |    48262 |    48830 |    48462 | 
b32768   99926 |    98652 |    98188 |   100087 |    97771 |    94726 |    94758 |    95663 |    96211 | 

About 2 - 2,5x speed up with aligned addresses and up to 8,5x with unaligned access.

Motivation and Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

How Has This Been Tested?

  • Already covered by automatic testing.
  • New test added: (add PR link here).
  • Tested by hand on: armv7a9-zynq7000-qemu

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing linter checks and tests passed.
  • My changes generate no new compilation warnings for any of the targets.

Special treatment

  • This PR needs additional PRs to work (list the PRs, preferably in merge-order).
  • I will merge this PR by myself when appropriate.

arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Feb 27, 2024

Unit Test Results

7 254 tests  +99   6 543 ✅ +95   38m 13s ⏱️ + 4m 3s
  408 suites +11     711 💤 + 4 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 860f1ae. ± Comparison against base commit 8e7daf8.

♻️ This comment has been updated with latest results.

@nalajcie
Copy link
Member

  • maybe check len not divisible by 2? (probably more common case than unaligned beginning of the buffer)
  • off from the aligned address regards src/dst, or diff between them?

Maybe we can benchmark it against naive C-only implementation with loop unrolling - just for reference? - eg. newlib one
(https://github.com/bminor/newlib/blob/master/newlib/libc/string/memcpy.c)?

Did you try benchmarking it against eg. uClibc-ng arm asm optimized version? (https://github.com/wbx-github/uclibc-ng/blob/master/libc/string/arm/_memcpy.S)

Comment on lines 18 to 64
cmp LEN, #64
mov DST_RET, DST /* preserve return value */

bhs .LblkCopy

/* less than 64 bytes - always copy as if block was always unaligned */

.Ltail63Unaligned:
/* unaligned copy, 0-63 bytes */

/* r3 = LEN / 4 */
movs r3, LEN, lsr #2
beq .Ltail63Un0

.Ltail63Un4:
ldr r2, [SRC], #4
str r2, [DST], #4
subs r3, #1
bne .Ltail63Un4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unaligned memory access allowed?

Copy link
Member Author

@lukileczo lukileczo Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ldr and str instructions allow for unaligned addresses (but with some performance penalty - which doesn't really matter here as we're copying only up to 60 bytes that way). ARM documentation

edit: ok, it may not always work, I'll change that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant your project, not in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion I think it will be beneficial to allow it at least in this case. It will need small CPU initialization change, though. @nalajcie What do you think? Enabling unaligned access will make memcpy simpler and faster (as unaligned 4 byte access should still be faster than 4 separate 1 byte accesses)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but unaligned access is supported only in arm mode (https://developer.arm.com/documentation/ddi0308/d/Programmers--Model/Unaligned-access-support/Load-and-store-alignment-checks), so the resulting bytecode would actually be larger than the thumb version with correct alignment. Not sure about the performance implications THO.

If this would be the only blocker against switching to thumb then IMHO we can copy it byte-by-byte (as it's only up to 60 bytes)?

For short (up to 64b?) aligned (not sure if only?) memcpy gcc would probably provide alternative inline implementation (https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/expr.cc;h=8d34d024c9c1248cb36bbfd78f90e9514cee513e;hb=refs/heads/master#l1978 is responsible for heuristics but it might be easier to test it experimentally than unrestand the code), so we should assume this code would be called mostly for unaligned pointers anyway (each str would be split into byte/halfword access?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalajcie Hmm, what makes you think it doesn't work in thumb mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably some invalid info I've found on internet (https://s-o-c.org/does-arm-allow-unaligned-access/). The reference page I've linked in previous comment indeed says that U=1 on all modern ARMs so STR/LDR should just produce unaligned access - please just scratch out that part of the comment :).

Sorry for providing invalid info. If we choose to keep the unaligned access, maybe we should explicitly say it in comment to ensure nobody would think in the future that it's an error on our part?

arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
arch/armv7a/memcpy.S Outdated Show resolved Hide resolved
@lukileczo lukileczo force-pushed the lukileczo/memcpy branch 4 times, most recently from a714c76 to 0e88eac Compare March 1, 2024 15:31
@lukileczo
Copy link
Member Author

lukileczo commented Mar 1, 2024

@nalajcie regarding your comment - len divisibility (by 2, 4, whatever) does not meaningfully impact performance, as it's always handled in the tail. off in the benchmarks is the same value for both src and dst memcpy(src + off, dst + off, len). Lower I attached benchmarks, where src and dst aren't mutually aligned (however there is no visible performance impact).

Edit: I've managed to simplify the code a little bit and squeeze better performance, see next comment.

Some more benchmarks:

New memcpy

.arm

src/dst mutually aligned

off          0 |        1 | 
------------------------------
b8:        247 |      208 | 
b16:       248 |      248 | 
b32:       368 |      358 | 
b64:       414 |      551 | 
b128:      594 |      737 | 
b256:      929 |     1087 | 
b512:     2185 |     1935 | 
b1024:    3510 |     3754 | 
b2048:    6284 |     7016 | 
b4096:   12579 |    13195 | 
b8192:   25140 |    25711 | 
b16384   50216 |    51356 | 
b32768   98060 |   101563 | 

dst misaligned by 1 - memcpy(src + off, dst + off + 1, len)

off          0 |        1 | 
------------------------------
b8:        231 |      207 | 
b16:       320 |      263 | 
b32:       321 |      330 | 
b64:       653 |      617 | 
b128:      774 |      774 | 
b256:     1202 |     1184 | 
b512:     1967 |     1984 | 
b1024:    3367 |     3425 | 
b2048:    6320 |     6674 | 
b4096:   12178 |    12107 | 
b8192:   24066 |    24157 | 
b16384   47823 |    48275 | 
b32768   94652 |    99616 |

src misaligned by 1 - memcpy(src + off + 1, dst + off, len)

off          0 |        1 | 
------------------------------
b8:        327 |      237 | 
b16:       248 |      247 | 
b32:       327 |      323 | 
b64:       402 |      613 | 
b128:      542 |      805 | 
b256:      920 |     1149 | 
b512:     1813 |     1957 | 
b1024:    3450 |     3367 | 
b2048:    6027 |     6387 | 
b4096:   11918 |    12246 | 
b8192:   23796 |    24361 | 
b16384   47536 |    47897 | 
b32768   94394 |    94629 | 

.thumb

src/dst mutually aligned

This is a little bit concerning for me as the performance of aligned copy is lower than unaligned - but manual loop alignment stays the same (32 bytes, equal icache line) so I don't really know why it's happening.

off          0 |        1 | 
------------------------------
b8:        248 |      207 | 
b16:       246 |      246 | 
b32:       325 |      361 | 
b64:       441 |      537 | 
b128:      604 |      724 | 
b256:      890 |     1101 | 
b512:     2515 |     1891 | 
b1024:    4386 |     4455 | 
b2048:    8489 |     8667 | 
b4096:   16865 |    16790 | 
b8192:   33021 |    32957 | 
b16384   64768 |    64599 | 
b32768  129154 |   128443 | 

dst misaligned by 1

off          0 |        1 | 
------------------------------
b8:        224 |      203 | 
b16:       246 |      263 | 
b32:       406 |      321 | 
b64:       611 |      606 | 
b128:      817 |      770 | 
b256:     1174 |     1174 | 
b512:     1996 |     1945 | 
b1024:    3485 |     3370 | 
b2048:    6345 |     6316 | 
b4096:   12284 |    12151 | 
b8192:   24209 |    23919 | 
b16384   48020 |    47358 | 
b32768   94502 |    94557 | 

src misaligned by 1

off          0 |        1 | 
------------------------------
b8:        224 |      208 | 
b16:       242 |      247 | 
b32:       396 |      326 | 
b64:       522 |      798 | 
b128:      749 |      818 | 
b256:     1094 |     1368 | 
b512:     1906 |     2128 | 
b1024:    3308 |     3572 | 
b2048:    6256 |     6883 | 
b4096:   12146 |    13084 | 
b8192:   25058 |    24164 | 
b16384   53383 |    49236 | 
b32768  105324 |    96130 | 

C memcpy

dst/src mutually aligned

(https://github.com/bminor/newlib/blob/master/newlib/libc/string/memcpy.c)

off          0 |        1 | 
------------------------------
b8:        445 |      414 | 
b16:       346 |      545 | 
b32:       415 |     1024 | 
b64:       488 |     1493 | 
b128:      761 |     2770 | 
b256:     1257 |     5344 | 
b512:     2318 |    10380 | 
b1024:    4242 |    20532 | 
b2048:    8271 |    40605 | 
b4096:   16320 |    81429 | 
b8192:   32325 |   160967 | 
b16384   64264 |   320014 | 
b32768  130048 |   640968 | 

uClibc - .thumb

(https://github.com/wbx-github/uclibc-ng/blob/master/libc/string/arm/_memcpy.S)

src/dst mutually aligned

off          0 |        1 | 
------------------------------
b8:        388 |      392 | 
b16:       305 |      349 | 
b32:       383 |      431 | 
b64:       481 |      528 | 
b128:      651 |      821 | 
b256:     1031 |     1099 | 
b512:     1767 |     1848 | 
b1024:    3336 |     3359 | 
b2048:    6431 |     6442 | 
b4096:   12597 |    12598 | 
b8192:   24757 |    24930 | 
b16384   49676 |    49730 | 
b32768  100916 |   101239 | 

dst misaligned by 1

off          0 |        1 | 
------------------------------
b8:        436 |      322 | 
b16:       369 |      370 | 
b32:       478 |      488 | 
b64:       593 |      592 | 
b128:      844 |      841 | 
b256:     1368 |     1348 | 
b512:     2506 |     2299 | 
b1024:    4489 |     4292 | 
b2048:    8738 |     8458 | 
b4096:   16536 |    16385 | 
b8192:   31989 |    32126 | 
b16384   63783 |    63841 | 
b32768  129096 |   128766 | 

src misaligned by 1

off          0 |        1 | 
------------------------------
b8:        321 |      326 | 
b16:       413 |      394 | 
b32:       381 |      462 | 
b64:       552 |      608 | 
b128:      767 |      850 | 
b256:     1235 |     1471 | 
b512:     2273 |     2346 | 
b1024:    4153 |     4440 | 
b2048:    8363 |     8351 | 
b4096:   16380 |    16209 | 
b8192:   31899 |    32024 | 
b16384   63710 |    63781 | 
b32768  128189 |   128997 | 

@lukileczo
Copy link
Member Author

lukileczo commented Mar 4, 2024

Upon further investigation, I've noticed that only 64 byte alignment allows the loops to constantly provide the highest performance. .arm code doesn't provide any advantage above .thumb - so the resulting code is written in .thumb. I've also dropped the case with 128-byte block copy, which didn't provide any speedup compared to 64 byte block copy, which may be due to caching. Dropping this case also makes the resulting code a little smaller and simpler. Benchmarks below.

src/dst mutually aligned

off          0 |        1 | 
-------------------------------
b8:        399 |      259 | 
b16:       300 |      300 | 
b32:       387 |      407 | 
b64:       504 |      568 | 
b128:      634 |      741 | 
b256:      964 |     1095 | 
b512:     1774 |     1800 | 
b1024:    3276 |     3208 | 
b2048:    8593 |     7918 | 
b4096:   14299 |    13416 | 
b8192:   25874 |    25052 | 
b16384   48811 |    47742 | 
b32768   94385 |    93635 | 

dst misaligned by 1

off          0 |        1 | 
-------------------------------
b8:        288 |      291 | 
b16:       319 |      303 | 
b32:       417 |      380 | 
b64:       711 |      770 | 
b128:     1018 |      985 | 
b256:     1365 |     1317 | 
b512:     2060 |     2014 | 
b1024:    3514 |     3524 | 
b2048:    7666 |     7874 | 
b4096:   13610 |    13610 | 
b8192:   25248 |    25396 | 
b16384   48967 |    48811 | 
b32768   95953 |    95718 | 

src misaligned by 1

off          0 |        1 | 
-------------------------------
b8:        295 |      260 | 
b16:       387 |      300 | 
b32:       382 |      411 | 
b64:       688 |      684 | 
b128:      961 |      978 | 
b256:     1315 |     1365 | 
b512:     2023 |     2154 | 
b1024:    3544 |     3647 | 
b2048:    7899 |     7899 | 
b4096:   14479 |    13891 | 
b8192:   25419 |    25641 | 
b16384   48872 |    49038 | 
b32768   96160 |    95916 | 

@lukileczo lukileczo force-pushed the lukileczo/memcpy branch 6 times, most recently from 83b4346 to 1e40a40 Compare March 4, 2024 10:58
@nalajcie
Copy link
Member

nalajcie commented Mar 4, 2024

Thanks for the comprehensive benchmarks, I see that:

  • our new implementation is much faster, especially for aligned buffers (IMHO should be most common use case for bigger buffers) which I'm very happy about :)
  • newlib C implentation for aligned buffers is faster that our original ASM implementation, maybe we should introduce similar "fallback" C implementation for new targets to benefit from that
  • loop alignment/unrolling is especially useful for cortex-m4 which doesn't have branch predictor (unaligned jumps would take more cycles to refill the pipeline - but aligning to word should be enough), for m7 branch predictor will guess correctly most of the time but sometimes would fail (that's why times might be inconsistent), probably ICache might also impact this
  • I don't understand why len=4096 is faster for off=1 (13416) than for off=0 (14299), similarly why dst not mutually aligned with src is faster (13610 vs 14299) - seems like there is something we don't understand here for aligned use case (cache invalidation issues?) - as uClibc timings seem to be more logical (mutually aligned memcpy is slower, off=1 the same/slower as off=0)

Note - for Cortex-A we could probably use NEON extensions to provide even faster memcpy, but let's not dwell on that for now. I've found some NEON implementations of libc functions if you would be up to experiment in the future: https://github.com/genesi/imx-libc-neon/blob/master/memcpy-neon.S 👀

@lukileczo
Copy link
Member Author

memset benchmarks:

New implementation

off          0 |        1 | 
----------------------------------------------
b8:        216 |      369 | 
b16:       223 |      434 | 
b32:       314 |      446 | 
b64:       415 |      461 | 
b128:      475 |      477 | 
b256:      630 |      676 | 
b512:      996 |     1029 | 
b1024:    1741 |     1761 | 
b2048:    3168 |     3202 | 
b4096:    6077 |     6084 | 
b8192:   12422 |    12433 | 
b16384   24003 |    24086 | 
b32768   47230 |    47339 |

Old implementation

off          0 |        1 | 
----------------------------------------------
b8:        124 |      157 | 
b16:       126 |      291 | 
b32:       218 |      585 | 
b64:       470 |     1124 | 
b128:      780 |     2224 | 
b256:     1461 |     4508 | 
b512:     2792 |     9000 | 
b1024:    5378 |    17972 | 
b2048:   10628 |    35711 | 
b4096:   21316 |    71442 | 
b8192:   42512 |   147078 | 
b16384   85006 |   286835 | 
b32768  170727 |   574138 | 

uClibc

off          0 |        1 | 
----------------------------------------------
b8:        199 |      285 | 
b16:       206 |      318 | 
b32:       241 |      329 | 
b64:       349 |      398 | 
b128:      469 |      578 | 
b256:      858 |      890 | 
b512:     1394 |     1471 | 
b1024:    2708 |     2834 | 
b2048:    5153 |     5227 | 
b4096:   10109 |    10150 | 
b8192:   19936 |    20028 | 
b16384   43101 |    42934 | 
b32768   82452 |    82244 |

@lukileczo lukileczo force-pushed the lukileczo/memcpy branch 2 times, most recently from 90d27b0 to 47e0823 Compare March 5, 2024 09:46
@lukileczo lukileczo marked this pull request as ready for review March 5, 2024 10:00
@lukileczo lukileczo changed the title armv7-a: optimize memcpy, memset arm: reorganize files, optimize memcpy, memset Mar 7, 2024
@lukileczo
Copy link
Member Author

I've also tested and benchmarked the implementations on Nucleo STM32L4A6. Iterations were reduced to 1000.
Unaligned src and dst have very similar performance, so I'm posting only dst.
Unaligned ldr/str instructions work without any change in the CPU registers.

memcpy

New implementation

mutually aligned

off          0 |        1 | 
----------------------------------------------
b8:       1953 |     2930 | 
b16:      2930 |     4150 | 
b32:      4151 |     6836 | 
b64:      6835 |     9277 | 
b128:    10254 |    13916 | 
b256:    17578 |    20996 | 
b512:    31739 |    35157 | 
b1024:   60547 |    64209 | 
b2048:  118164 |   121826 | 
b4096:  233398 |   236816 | 
b8192:  463623 |   467041 | 
b16384  923828 |   927491 | 
b32768 1844483 |  1848388 | 

unaligned dst

off          0 |        1 | 
----------------------------------------------
b8:       3174 |     2930 | 
b16:      3662 |     3906 | 
b32:      5859 |     6348 | 
b64:     12207 |    12451 | 
b128:    19287 |    19775 | 
b256:    30518 |    30762 | 
b512:    52979 |    53223 | 
b1024:   98144 |    98632 | 
b2048:  188477 |   188721 | 
b4096:  368408 |   368897 | 
b8192:  729248 |   729248 | 
b16384 1449951 |  1450439 | 
b32768 2892090 |  2892334 |

Old implementation

Mutually aligned

off          0 |        1 | 
----------------------------------------------
b8:       2686 |     3906 | 
b16:      3174 |     7080 | 
b32:      5127 |    13672 | 
b64:      9033 |    26611 | 
b128:    16602 |    52735 | 
b256:    31738 |   104736 | 
b512:    62256 |   208985 | 
b1024:  122802 |   417480 | 
b2048:  244629 |   834229 | 
b4096:  487793 |  1668213 | 
b8192:  974121 |  3335937 | 
b16384 1946778 |  6670410 | 
b32768 3892334 | 13341553 | 

Unaligned dst

off           0 |        1 | 
----------------------------------------------
b8:        4883 |     4639 | 
b16:       8300 |     8301 | 
b32:      15870 |    15869 | 
b64:      31005 |    31006 | 
b128:     61524 |    61523 | 
b256:    122314 |   122315 | 
b512:    243897 |   243896 | 
b1024:   487060 |   487061 | 
b2048:   973389 |   973633 | 
b4096:  1946777 |  1946045 | 
b8192:  3892090 |  3891845 | 
b16384  7784180 |  7782959 | 
b32768 15562988 | 15563721 | 

memset

New implementation

off          0 |        1 | 
----------------------------------------
b8:       2197 |     2685 | 
b16:      2686 |     3418 | 
b32:      3906 |     5371 | 
b64:      8057 |     9766 | 
b128:     9766 |    12207 | 
b256:    13916 |    16113 | 
b512:    21240 |    23682 | 
b1024:   36621 |    38818 | 
b2048:   66894 |    69336 | 
b4096:  127686 |   129883 | 
b8192:  249268 |   251709 | 
b16384  492431 |   495117 | 
b32768  979493 |   981690 | 

Old implementation

off          0 |        1 | 
----------------------------------------
b8:       2685 |     3418 | 
b16:      2686 |     6347 | 
b32:      4394 |    11475 | 
b64:      7813 |    22461 | 
b128:    14404 |    44189 | 
b256:    27344 |    87403 | 
b512:    53467 |   174316 | 
b1024:  105468 |   348389 | 
b2048:  209716 |   696045 | 
b4096:  418457 |  1390869 | 
b8192:  835450 |  2781250 | 
b16384 1669433 |  5560303 | 
b32768 3337647 | 11120361 | 

@agkaminski agkaminski merged commit 5e649dc into master Mar 8, 2024
30 checks passed
@agkaminski agkaminski deleted the lukileczo/memcpy branch March 8, 2024 15:53
@lukileczo lukileczo mentioned this pull request Mar 8, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants