Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix to encode_arm64.s: some registers overwritten in memmove call #56

Merged
merged 1 commit into from
Nov 3, 2020

Commits on Oct 2, 2020

  1. bug fix to encode_arm64.s: some registers overwritten in memmove call

    In encode_arm64.s, encodeBlock, two of the registers added during the port from
    amd64 were not saved or restored for the memmove call. Instead of saving them,
    just recalculate their values. Additionally, I made a few small changes to
    improve things since I've learned a bit more about ARMv8 assembly.
     - The CMP instruction accepts an immediate as the first argument
     - use LDP/STP instead of SIMD instructions
    
    The change to use the load-pair and store-pair instructions instead of the SIMD
    instructions results in some modest performance improvements as meastured on
    Neoverse N1 (Graviton 2).
    
    name              old time/op    new time/op    delta
    WordsDecode1e1-2    25.9ns ± 1%    26.1ns ± 1%  +0.66%  (p=0.005 n=10+10)
    WordsDecode1e2-2     107ns ± 0%     105ns ± 0%  -1.87%  (p=0.000 n=10+10)
    WordsDecode1e3-2     953ns ± 0%     901ns ± 0%  -5.50%  (p=0.000 n=10+10)
    WordsDecode1e4-2    10.6µs ± 0%     9.9µs ± 2%  -6.60%  (p=0.000 n=7+10)
    WordsDecode1e5-2     170µs ± 1%     164µs ± 1%  -3.12%  (p=0.000 n=10+9)
    WordsDecode1e6-2    1.71ms ± 0%    1.66ms ± 0%  -2.98%  (p=0.000 n=10+10)
    WordsEncode1e1-2    22.0ns ± 1%    21.9ns ± 1%  -0.67%  (p=0.006 n=8+10)
    WordsEncode1e2-2     248ns ± 0%     245ns ± 0%  -1.21%  (p=0.002 n=8+10)
    WordsEncode1e3-2    2.50µs ± 0%    2.49µs ± 0%    ~     (p=0.103 n=10+9)
    WordsEncode1e4-2    27.8µs ± 3%    28.0µs ± 2%    ~     (p=0.075 n=10+10)
    WordsEncode1e5-2     339µs ± 0%     343µs ± 0%  +1.18%  (p=0.000 n=9+10)
    WordsEncode1e6-2    3.39ms ± 0%    3.42ms ± 0%  +0.94%  (p=0.000 n=10+10)
    RandomEncode-2      74.8µs ± 1%    77.1µs ± 1%  +3.16%  (p=0.000 n=10+10)
    _UFlat0-2           68.8µs ± 1%    66.4µs ± 2%  -3.54%  (p=0.000 n=10+10)
    _UFlat1-2            770µs ± 0%     740µs ± 1%  -3.93%  (p=0.000 n=10+10)
    _UFlat2-2           6.57µs ± 0%    6.55µs ± 0%  -0.25%  (p=0.000 n=8+10)
    _UFlat3-2            183ns ± 0%     178ns ± 1%  -2.84%  (p=0.000 n=9+10)
    _UFlat4-2           9.76µs ± 1%    9.56µs ± 0%  -2.07%  (p=0.000 n=10+9)
    _UFlat5-2            301µs ± 0%     293µs ± 0%  -2.67%  (p=0.000 n=9+10)
    _UFlat6-2            280µs ± 1%     267µs ± 1%  -4.63%  (p=0.000 n=10+10)
    _UFlat7-2            241µs ± 0%     230µs ± 1%  -4.68%  (p=0.000 n=9+10)
    _UFlat8-2            745µs ± 0%     715µs ± 1%  -4.11%  (p=0.000 n=10+10)
    _UFlat9-2           1.01ms ± 0%    0.96ms ± 0%  -4.60%  (p=0.000 n=10+10)
    _UFlat10-2          62.3µs ± 1%    59.3µs ± 1%  -4.72%  (p=0.000 n=10+9)
    _UFlat11-2           258µs ± 0%     252µs ± 1%  -2.56%  (p=0.000 n=10+10)
    _ZFlat0-2            135µs ± 1%     132µs ± 1%  -1.88%  (p=0.000 n=10+8)
    _ZFlat1-2           1.76ms ± 0%    1.74ms ± 0%  -1.00%  (p=0.000 n=9+9)
    _ZFlat2-2           9.54µs ± 0%    9.84µs ± 5%  +3.18%  (p=0.000 n=10+10)
    _ZFlat3-2            449ns ± 0%     447ns ± 0%  -0.38%  (p=0.000 n=10+9)
    _ZFlat4-2           15.6µs ± 0%    16.0µs ± 4%    ~     (p=0.118 n=9+10)
    _ZFlat5-2            560µs ± 1%     555µs ± 1%  -0.89%  (p=0.000 n=9+9)
    _ZFlat6-2            531µs ± 0%     534µs ± 0%  +0.64%  (p=0.000 n=10+10)
    _ZFlat7-2            466µs ± 0%     468µs ± 0%  +0.32%  (p=0.003 n=10+10)
    _ZFlat8-2           1.42ms ± 0%    1.42ms ± 0%  +0.43%  (p=0.000 n=10+10)
    _ZFlat9-2           1.93ms ± 0%    1.94ms ± 0%  +0.44%  (p=0.000 n=10+10)
    _ZFlat10-2           120µs ± 0%     121µs ± 3%    ~     (p=0.436 n=9+9)
    _ZFlat11-2           433µs ± 0%     437µs ± 0%  +1.03%  (p=0.000 n=10+10)
    ExtendMatch-2       9.77µs ± 0%    9.76µs ± 0%  -0.13%  (p=0.050 n=10+10)
    
    As measured on Cortex-A53 (Raspberry Pi 3)
    
    name              old time/op    new time/op    delta
    WordsDecode1e1-4     152ns ± 2%     151ns ± 0%    ~     (p=0.536 n=10+8)
    WordsDecode1e2-4     639ns ± 0%     617ns ± 0%  -3.54%  (p=0.000 n=9+8)
    WordsDecode1e3-4    6.74µs ± 2%    6.35µs ± 0%  -5.75%  (p=0.000 n=10+9)
    WordsDecode1e4-4    66.7µs ± 0%    63.5µs ± 0%  -4.69%  (p=0.000 n=9+9)
    WordsDecode1e5-4     715µs ± 0%     684µs ± 0%  -4.38%  (p=0.000 n=8+8)
    WordsDecode1e6-4    6.87ms ± 2%    6.53ms ± 1%  -4.99%  (p=0.000 n=10+9)
    WordsEncode1e1-4     127ns ± 2%     126ns ± 0%    ~     (p=0.065 n=10+9)
    WordsEncode1e2-4    1.58µs ± 0%    1.57µs ± 0%  -0.99%  (p=0.000 n=8+8)
    WordsEncode1e3-4    15.1µs ± 0%    14.9µs ± 0%  -1.46%  (p=0.000 n=9+8)
    WordsEncode1e4-4     148µs ± 0%     148µs ± 4%    ~     (p=0.497 n=9+10)
    WordsEncode1e5-4    1.54ms ± 0%    1.54ms ± 0%  +0.12%  (p=0.012 n=10+8)
    WordsEncode1e6-4    14.4ms ± 0%    14.4ms ± 1%  -0.47%  (p=0.015 n=9+8)
    RandomEncode-4      1.13ms ± 1%    1.13ms ± 1%    ~     (p=0.529 n=10+10)
    _UFlat0-4            294µs ± 0%     288µs ± 1%  -2.08%  (p=0.000 n=9+9)
    _UFlat1-4           3.05ms ± 1%    2.98ms ± 1%  -2.22%  (p=0.000 n=9+9)
    _UFlat2-4           37.3µs ± 0%    37.4µs ± 1%    ~     (p=0.093 n=8+9)
    _UFlat3-4            909ns ± 0%     914ns ± 2%    ~     (p=0.526 n=8+10)
    _UFlat4-4           58.7µs ± 0%    58.1µs ± 0%  -1.09%  (p=0.000 n=8+10)
    _UFlat5-4           1.22ms ± 0%    1.19ms ± 1%  -2.14%  (p=0.000 n=8+8)
    _UFlat6-4           1.03ms ± 0%    0.99ms ± 0%  -3.28%  (p=0.000 n=9+8)
    _UFlat7-4            895µs ± 0%     861µs ± 0%  -3.79%  (p=0.000 n=8+8)
    _UFlat8-4           2.83ms ± 0%    2.75ms ± 0%  -2.88%  (p=0.000 n=7+8)
    _UFlat9-4           3.85ms ± 1%    3.73ms ± 1%  -3.03%  (p=0.000 n=8+9)
    _UFlat10-4           286µs ± 0%     282µs ± 0%  -1.59%  (p=0.000 n=9+9)
    _UFlat11-4          1.06ms ± 0%    1.02ms ± 0%  -3.58%  (p=0.000 n=8+9)
    _ZFlat0-4            620µs ± 0%     620µs ± 1%    ~     (p=0.963 n=9+8)
    _ZFlat1-4           9.49ms ± 1%    9.67ms ± 3%  +1.87%  (p=0.000 n=9+10)
    _ZFlat2-4           61.8µs ± 0%    62.3µs ± 3%    ~     (p=0.829 n=8+10)
    _ZFlat3-4           2.80µs ± 1%    2.79µs ± 0%  -0.55%  (p=0.000 n=8+8)
    _ZFlat4-4            108µs ± 0%     109µs ± 0%  +0.55%  (p=0.000 n=10+8)
    _ZFlat5-4           2.59ms ± 2%    2.58ms ± 1%    ~     (p=0.274 n=10+8)
    _ZFlat6-4           2.39ms ± 3%    2.40ms ± 1%    ~     (p=0.631 n=10+10)
    _ZFlat7-4           2.11ms ± 0%    2.08ms ± 1%  -1.23%  (p=0.000 n=10+9)
    _ZFlat8-4           6.86ms ± 0%    6.92ms ± 1%  +0.78%  (p=0.000 n=9+8)
    _ZFlat9-4           9.42ms ± 0%    9.40ms ± 1%    ~     (p=0.606 n=8+9)
    _ZFlat10-4           620µs ± 1%     621µs ± 4%    ~     (p=0.173 n=8+10)
    _ZFlat11-4          1.94ms ± 0%    1.93ms ± 0%  -0.52%  (p=0.001 n=9+8)
    ExtendMatch-4       69.3µs ± 2%    69.2µs ± 0%    ~     (p=0.515 n=10+8)
    AWSjswinney committed Oct 2, 2020
    1 Configuration menu
    Copy the full SHA
    f81760e View commit details
    Browse the repository at this point in the history