-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fix to encode_arm64.s: some registers overwritten in memmove call #56
Conversation
In encode_arm64.s, encodeBlock, two of the registers added during the port from amd64 were not saved or restored for the memmove call. Instead of saving them, just recalculate their values. Additionally, I made a few small changes to improve things since I've learned a bit more about ARMv8 assembly. - The CMP instruction accepts an immediate as the first argument - use LDP/STP instead of SIMD instructions The change to use the load-pair and store-pair instructions instead of the SIMD instructions results in some modest performance improvements as meastured on Neoverse N1 (Graviton 2). name old time/op new time/op delta WordsDecode1e1-2 25.9ns ± 1% 26.1ns ± 1% +0.66% (p=0.005 n=10+10) WordsDecode1e2-2 107ns ± 0% 105ns ± 0% -1.87% (p=0.000 n=10+10) WordsDecode1e3-2 953ns ± 0% 901ns ± 0% -5.50% (p=0.000 n=10+10) WordsDecode1e4-2 10.6µs ± 0% 9.9µs ± 2% -6.60% (p=0.000 n=7+10) WordsDecode1e5-2 170µs ± 1% 164µs ± 1% -3.12% (p=0.000 n=10+9) WordsDecode1e6-2 1.71ms ± 0% 1.66ms ± 0% -2.98% (p=0.000 n=10+10) WordsEncode1e1-2 22.0ns ± 1% 21.9ns ± 1% -0.67% (p=0.006 n=8+10) WordsEncode1e2-2 248ns ± 0% 245ns ± 0% -1.21% (p=0.002 n=8+10) WordsEncode1e3-2 2.50µs ± 0% 2.49µs ± 0% ~ (p=0.103 n=10+9) WordsEncode1e4-2 27.8µs ± 3% 28.0µs ± 2% ~ (p=0.075 n=10+10) WordsEncode1e5-2 339µs ± 0% 343µs ± 0% +1.18% (p=0.000 n=9+10) WordsEncode1e6-2 3.39ms ± 0% 3.42ms ± 0% +0.94% (p=0.000 n=10+10) RandomEncode-2 74.8µs ± 1% 77.1µs ± 1% +3.16% (p=0.000 n=10+10) _UFlat0-2 68.8µs ± 1% 66.4µs ± 2% -3.54% (p=0.000 n=10+10) _UFlat1-2 770µs ± 0% 740µs ± 1% -3.93% (p=0.000 n=10+10) _UFlat2-2 6.57µs ± 0% 6.55µs ± 0% -0.25% (p=0.000 n=8+10) _UFlat3-2 183ns ± 0% 178ns ± 1% -2.84% (p=0.000 n=9+10) _UFlat4-2 9.76µs ± 1% 9.56µs ± 0% -2.07% (p=0.000 n=10+9) _UFlat5-2 301µs ± 0% 293µs ± 0% -2.67% (p=0.000 n=9+10) _UFlat6-2 280µs ± 1% 267µs ± 1% -4.63% (p=0.000 n=10+10) _UFlat7-2 241µs ± 0% 230µs ± 1% -4.68% (p=0.000 n=9+10) _UFlat8-2 745µs ± 0% 715µs ± 1% -4.11% (p=0.000 n=10+10) _UFlat9-2 1.01ms ± 0% 0.96ms ± 0% -4.60% (p=0.000 n=10+10) _UFlat10-2 62.3µs ± 1% 59.3µs ± 1% -4.72% (p=0.000 n=10+9) _UFlat11-2 258µs ± 0% 252µs ± 1% -2.56% (p=0.000 n=10+10) _ZFlat0-2 135µs ± 1% 132µs ± 1% -1.88% (p=0.000 n=10+8) _ZFlat1-2 1.76ms ± 0% 1.74ms ± 0% -1.00% (p=0.000 n=9+9) _ZFlat2-2 9.54µs ± 0% 9.84µs ± 5% +3.18% (p=0.000 n=10+10) _ZFlat3-2 449ns ± 0% 447ns ± 0% -0.38% (p=0.000 n=10+9) _ZFlat4-2 15.6µs ± 0% 16.0µs ± 4% ~ (p=0.118 n=9+10) _ZFlat5-2 560µs ± 1% 555µs ± 1% -0.89% (p=0.000 n=9+9) _ZFlat6-2 531µs ± 0% 534µs ± 0% +0.64% (p=0.000 n=10+10) _ZFlat7-2 466µs ± 0% 468µs ± 0% +0.32% (p=0.003 n=10+10) _ZFlat8-2 1.42ms ± 0% 1.42ms ± 0% +0.43% (p=0.000 n=10+10) _ZFlat9-2 1.93ms ± 0% 1.94ms ± 0% +0.44% (p=0.000 n=10+10) _ZFlat10-2 120µs ± 0% 121µs ± 3% ~ (p=0.436 n=9+9) _ZFlat11-2 433µs ± 0% 437µs ± 0% +1.03% (p=0.000 n=10+10) ExtendMatch-2 9.77µs ± 0% 9.76µs ± 0% -0.13% (p=0.050 n=10+10) As measured on Cortex-A53 (Raspberry Pi 3) name old time/op new time/op delta WordsDecode1e1-4 152ns ± 2% 151ns ± 0% ~ (p=0.536 n=10+8) WordsDecode1e2-4 639ns ± 0% 617ns ± 0% -3.54% (p=0.000 n=9+8) WordsDecode1e3-4 6.74µs ± 2% 6.35µs ± 0% -5.75% (p=0.000 n=10+9) WordsDecode1e4-4 66.7µs ± 0% 63.5µs ± 0% -4.69% (p=0.000 n=9+9) WordsDecode1e5-4 715µs ± 0% 684µs ± 0% -4.38% (p=0.000 n=8+8) WordsDecode1e6-4 6.87ms ± 2% 6.53ms ± 1% -4.99% (p=0.000 n=10+9) WordsEncode1e1-4 127ns ± 2% 126ns ± 0% ~ (p=0.065 n=10+9) WordsEncode1e2-4 1.58µs ± 0% 1.57µs ± 0% -0.99% (p=0.000 n=8+8) WordsEncode1e3-4 15.1µs ± 0% 14.9µs ± 0% -1.46% (p=0.000 n=9+8) WordsEncode1e4-4 148µs ± 0% 148µs ± 4% ~ (p=0.497 n=9+10) WordsEncode1e5-4 1.54ms ± 0% 1.54ms ± 0% +0.12% (p=0.012 n=10+8) WordsEncode1e6-4 14.4ms ± 0% 14.4ms ± 1% -0.47% (p=0.015 n=9+8) RandomEncode-4 1.13ms ± 1% 1.13ms ± 1% ~ (p=0.529 n=10+10) _UFlat0-4 294µs ± 0% 288µs ± 1% -2.08% (p=0.000 n=9+9) _UFlat1-4 3.05ms ± 1% 2.98ms ± 1% -2.22% (p=0.000 n=9+9) _UFlat2-4 37.3µs ± 0% 37.4µs ± 1% ~ (p=0.093 n=8+9) _UFlat3-4 909ns ± 0% 914ns ± 2% ~ (p=0.526 n=8+10) _UFlat4-4 58.7µs ± 0% 58.1µs ± 0% -1.09% (p=0.000 n=8+10) _UFlat5-4 1.22ms ± 0% 1.19ms ± 1% -2.14% (p=0.000 n=8+8) _UFlat6-4 1.03ms ± 0% 0.99ms ± 0% -3.28% (p=0.000 n=9+8) _UFlat7-4 895µs ± 0% 861µs ± 0% -3.79% (p=0.000 n=8+8) _UFlat8-4 2.83ms ± 0% 2.75ms ± 0% -2.88% (p=0.000 n=7+8) _UFlat9-4 3.85ms ± 1% 3.73ms ± 1% -3.03% (p=0.000 n=8+9) _UFlat10-4 286µs ± 0% 282µs ± 0% -1.59% (p=0.000 n=9+9) _UFlat11-4 1.06ms ± 0% 1.02ms ± 0% -3.58% (p=0.000 n=8+9) _ZFlat0-4 620µs ± 0% 620µs ± 1% ~ (p=0.963 n=9+8) _ZFlat1-4 9.49ms ± 1% 9.67ms ± 3% +1.87% (p=0.000 n=9+10) _ZFlat2-4 61.8µs ± 0% 62.3µs ± 3% ~ (p=0.829 n=8+10) _ZFlat3-4 2.80µs ± 1% 2.79µs ± 0% -0.55% (p=0.000 n=8+8) _ZFlat4-4 108µs ± 0% 109µs ± 0% +0.55% (p=0.000 n=10+8) _ZFlat5-4 2.59ms ± 2% 2.58ms ± 1% ~ (p=0.274 n=10+8) _ZFlat6-4 2.39ms ± 3% 2.40ms ± 1% ~ (p=0.631 n=10+10) _ZFlat7-4 2.11ms ± 0% 2.08ms ± 1% -1.23% (p=0.000 n=10+9) _ZFlat8-4 6.86ms ± 0% 6.92ms ± 1% +0.78% (p=0.000 n=9+8) _ZFlat9-4 9.42ms ± 0% 9.40ms ± 1% ~ (p=0.606 n=8+9) _ZFlat10-4 620µs ± 1% 621µs ± 4% ~ (p=0.173 n=8+10) _ZFlat11-4 1.94ms ± 0% 1.93ms ± 0% -0.52% (p=0.001 n=9+8) ExtendMatch-4 69.3µs ± 2% 69.2µs ± 0% ~ (p=0.515 n=10+8)
@nigeltao The change to the Go compiler has now been merged and will expose this bug in snappy if we don't merge this fix. Can you take a look or refer someone who can? Thanks! |
See golang/snappy#56 Signed-off-by: Matthew Sykes <[email protected]>
See golang/snappy#56 Signed-off-by: Matthew Sykes <[email protected]>
See golang/snappy#56 Signed-off-by: Matthew Sykes <[email protected]>
Should there be a test for this? The following does the trick, but there's probably a smaller input: snappy.Encode(nil, []byte{0xa, 0x5c, 0xa, 0x17, 0xa, 0x8, 0x5f, 0x5f, 0x6e, 0x61, 0x6d, 0x65, 0x5f, 0x5f, 0x12, 0xb, 0x4d, 0x61, 0x74, 0x63, 0x68, 0x53, 0x65, 0x72, 0x69, 0x65, 0x73, 0xa, 0xf, 0xa, 0x2, 0x64, 0x62, 0x12, 0x9, 0x54, 0x65, 0x6e, 0x4d, 0x69, 0x6e, 0x75, 0x74, 0x65, 0xa, 0x13, 0xa, 0xb, 0x54, 0x75, 0x72, 0x62, 0x69, 0x6e, 0x65, 0x54, 0x79, 0x70, 0x65, 0x12, 0x4, 0x56, 0x31, 0x31, 0x32, 0xa, 0x9, 0xa, 0x4, 0x50, 0x61, 0x72, 0x6b, 0x12, 0x1, 0x31, 0x12, 0x10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0x10, 0xb5, 0xaf, 0xdc, 0xf6, 0xfa, 0x2e, 0xa, 0x5c, 0xa, 0x17, 0xa, 0x8, 0x5f, 0x5f, 0x6e, 0x61, 0x6d, 0x65, 0x5f, 0x5f, 0x12, 0xb, 0x4d, 0x61, 0x74, 0x63, 0x68, 0x53, 0x65, 0x72, 0x69, 0x65, 0x73, 0xa, 0xf, 0xa, 0x2, 0x64, 0x62, 0x12, 0x9, 0x54, 0x65, 0x6e, 0x4d, 0x69, 0x6e, 0x75, 0x74, 0x65, 0xa, 0x13, 0xa, 0xb, 0x54, 0x75, 0x72, 0x62, 0x69, 0x6e, 0x65, 0x54, 0x79, 0x70, 0x65, 0x12, 0x4, 0x56, 0x31, 0x31, 0x32, 0xa, 0x9, 0xa, 0x4, 0x50, 0x61, 0x72, 0x6b, 0x12, 0x1, 0x32, 0x12, 0x10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0x10, 0xb5, 0xaf, 0xdc, 0xf6, 0xfa, 0x2e, 0xa, 0x5c, 0xa, 0x17, 0xa, 0x8, 0x5f, 0x5f, 0x6e, 0x61, 0x6d, 0x65, 0x5f, 0x5f, 0x12, 0xb, 0x4d, 0x61, 0x74, 0x63, 0x68, 0x53, 0x65, 0x72, 0x69, 0x65, 0x73, 0xa, 0xf, 0xa, 0x2, 0x64, 0x62, 0x12, 0x9, 0x54, 0x65, 0x6e, 0x4d, 0x69, 0x6e, 0x75, 0x74, 0x65, 0xa, 0x13, 0xa, 0xb, 0x54, 0x75, 0x72, 0x62, 0x69, 0x6e, 0x65, 0x54, 0x79, 0x70, 0x65, 0x12, 0x4, 0x56, 0x31, 0x31, 0x32, 0xa, 0x9, 0xa, 0x4, 0x50, 0x61, 0x72, 0x6b, 0x12, 0x1, 0x33, 0x12, 0x10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0x10, 0xb5, 0xaf, 0xdc, 0xf6, 0xfa, 0x2e, 0xa, 0x5c, 0xa, 0x17, 0xa, 0x8, 0x5f, 0x5f, 0x6e, 0x61, 0x6d, 0x65, 0x5f, 0x5f, 0x12, 0xb, 0x4d, 0x61, 0x74, 0x63, 0x68, 0x53, 0x65, 0x72, 0x69, 0x65, 0x73, 0xa, 0xf, 0xa, 0x2, 0x64, 0x62, 0x12, 0x9, 0x54, 0x65, 0x6e, 0x4d, 0x69, 0x6e, 0x75, 0x74, 0x65, 0xa, 0x13, 0xa, 0xb, 0x54, 0x75, 0x72, 0x62, 0x69, 0x6e, 0x65, 0x54, 0x79, 0x70, 0x65, 0x12, 0x4, 0x56, 0x31, 0x31, 0x32, 0xa, 0x9, 0xa, 0x4, 0x50, 0x61, 0x72, 0x6b, 0x12, 0x1, 0x34, 0x12, 0x10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf0, 0x3f, 0x10, 0xb5, 0xaf, 0xdc, 0xf6, 0xfa, 0x2e}) It may be also good to create a new tag now that Go 1.16 is out. |
Done. Sorry for the delay. |
…11396) Pick up golang/snappy#56 to handle arm64 architectures to fix panics. tldr; Golang 1.16 changed `memmove` implementation for arm64 requiring additional cpu registers that snappy wasn't preserving in its assembly implementation. Other projects have experienced this issue as well, searching for `encode_arm64.s:666` on your favorite search engine will reveal some. Vault updated the dependency earlier this August: hashicorp/vault#12371 . I believe this issue affects Nomad 1.2.x and 1.1.x. Nomad 1.0.x use Golang 1.15 and isn't affected. However, backporting the change to 1.0.x should be harmless. Fixed #11385 .
…11396) Pick up golang/snappy#56 to handle arm64 architectures to fix panics. tldr; Golang 1.16 changed `memmove` implementation for arm64 requiring additional cpu registers that snappy wasn't preserving in its assembly implementation. Other projects have experienced this issue as well, searching for `encode_arm64.s:666` on your favorite search engine will reveal some. Vault updated the dependency earlier this August: hashicorp/vault#12371 . I believe this issue affects Nomad 1.2.x and 1.1.x. Nomad 1.0.x use Golang 1.15 and isn't affected. However, backporting the change to 1.0.x should be harmless. Fixed #11385 .
In an upcoming change to the Go runtime, the memmove function makes use of new registers that exposed this bug. If that change is merged, it will be more important to also merge this change.
This change has passed go-fuzz testing:
In encode_arm64.s, encodeBlock, two of the registers added during the port from
amd64 were not saved or restored for the memmove call. Instead of saving them,
just recalculate their values. Additionally, I made a few small changes to
improve things since I've learned a bit more about ARMv8 assembly.
The change to use the load-pair and store-pair instructions instead of the SIMD
instructions results in some modest performance improvements as meastured on
Neoverse N1 (Graviton 2).