Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: performance regression in EncodeMVCCKey and EncodeMVCCValue #88818

Closed
nicktrav opened this issue Sep 27, 2022 · 8 comments · Fixed by #88911 or #88989
Closed

storage: performance regression in EncodeMVCCKey and EncodeMVCCValue #88818

nicktrav opened this issue Sep 27, 2022 · 8 comments · Fixed by #88911 or #88989
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-storage Storage Team

Comments

@nicktrav
Copy link
Collaborator

nicktrav commented Sep 27, 2022

Describe the problem

Currently, on the 22.2 release branch there is a sizable performance regression in EncodeMVCCKey, relative to 22.1:

$ benchstat bench-22.1 bench-22.2
name                                            old time/op  new time/op   delta
EncodeMVCCKey/key=empty/ts=empty-16             9.85ns ± 3%  14.73ns ± 1%  +49.49%  (p=0.000 n=9+7)
EncodeMVCCKey/key=empty/ts=walltime-16          11.7ns ± 2%   12.3ns ± 4%   +5.10%  (p=0.000 n=10+10)
EncodeMVCCKey/key=empty/ts=walltime+logical-16  12.3ns ± 3%   12.6ns ± 4%     ~     (p=0.072 n=10+10)
EncodeMVCCKey/key=empty/ts=all-16               12.5ns ± 2%   12.7ns ± 4%     ~     (p=0.315 n=8+10)
EncodeMVCCKey/key=short/ts=all-16               13.0ns ± 2%   12.8ns ± 4%     ~     (p=0.247 n=10+10)
EncodeMVCCKey/key=short/ts=empty-16             10.3ns ± 4%   15.0ns ± 3%  +45.29%  (p=0.000 n=10+10)
EncodeMVCCKey/key=short/ts=walltime-16          11.9ns ± 1%   12.7ns ± 4%   +6.61%  (p=0.000 n=9+10)
EncodeMVCCKey/key=short/ts=walltime+logical-16  12.7ns ± 4%   12.8ns ± 4%     ~     (p=0.579 n=10+10)
EncodeMVCCKey/key=long/ts=all-16                36.7ns ± 2%   37.8ns ± 4%   +3.08%  (p=0.013 n=9+10)
EncodeMVCCKey/key=long/ts=empty-16              36.1ns ± 4%   39.2ns ± 2%   +8.63%  (p=0.000 n=10+10)
EncodeMVCCKey/key=long/ts=walltime-16           37.1ns ± 4%   38.0ns ± 2%   +2.41%  (p=0.015 n=10+10)
EncodeMVCCKey/key=long/ts=walltime+logical-16   37.8ns ± 3%   37.8ns ± 3%     ~     (p=0.353 n=10+10)

The above was generated by comparing benchmark results between 22.1 (9eb4da2) to 22.2 (054f32b), via the following:

$ ./dev bench ./pkg/storage --filter EncodeMVCCKey --count 10 -v --stream-output

Jira issue: CRDB-19971

Epic CRDB-2624

@nicktrav nicktrav added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Sep 27, 2022
@nicktrav nicktrav self-assigned this Sep 27, 2022
@nicktrav
Copy link
Collaborator Author

I did some digging on this - I started with the empty/empty case, as that was the one showing the most regression. Below are CPU profiles for this benchmark from both release-22.1 and release-22.2:

Screen Shot 2022-09-27 at 7 43 39 AM

Screen Shot 2022-09-27 at 7 43 44 AM

We're spending more time in memeqbody, which is a Go runtime method.

If I remove the Go 1.19 patch (2675c7c), I see the following delta, which shows gains in the opposite direction (i.e. performance improved):

$ benchstat bench-22.1 bench-22.2-nogo
name                                            old time/op  new time/op  delta
EncodeMVCCKey/key=empty/ts=empty-16             9.85ns ± 3%  9.82ns ± 3%    ~     (p=0.887 n=9+10)
EncodeMVCCKey/key=empty/ts=walltime-16          11.7ns ± 2%  10.8ns ± 3%  -7.39%  (p=0.000 n=10+10)
EncodeMVCCKey/key=empty/ts=walltime+logical-16  12.3ns ± 3%  11.5ns ± 5%  -6.05%  (p=0.000 n=10+10)
EncodeMVCCKey/key=empty/ts=all-16               12.5ns ± 2%  11.5ns ± 3%  -8.09%  (p=0.000 n=8+10)
EncodeMVCCKey/key=short/ts=all-16               13.0ns ± 2%  11.8ns ± 1%  -9.01%  (p=0.000 n=10+9)
EncodeMVCCKey/key=short/ts=empty-16             10.3ns ± 4%  10.2ns ± 2%    ~     (p=0.233 n=10+9)
EncodeMVCCKey/key=short/ts=walltime-16          11.9ns ± 1%  11.3ns ± 1%  -5.12%  (p=0.000 n=9+7)
EncodeMVCCKey/key=short/ts=walltime+logical-16  12.7ns ± 4%  11.9ns ± 2%  -6.38%  (p=0.000 n=10+8)
EncodeMVCCKey/key=long/ts=all-16                36.7ns ± 2%  36.2ns ± 3%    ~     (p=0.161 n=9+10)
EncodeMVCCKey/key=long/ts=empty-16              36.1ns ± 4%  34.0ns ± 4%  -5.65%  (p=0.000 n=10+10)
EncodeMVCCKey/key=long/ts=walltime-16           37.1ns ± 4%  35.3ns ± 3%  -4.80%  (p=0.000 n=10+9)
EncodeMVCCKey/key=long/ts=walltime+logical-16   37.8ns ± 3%  36.2ns ± 4%  -4.32%  (p=0.000 n=10+10)

Probably worth digging into this further, despite the empty/empty case being uncommon in production.

cc: @erikgrinaker

@erikgrinaker
Copy link
Contributor

Working on a fix, which may give a nice boost over 22.1.

@blathers-crl
Copy link

blathers-crl bot commented Sep 28, 2022

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@nicktrav
Copy link
Collaborator Author

Thanks for digging into this @erikgrinaker!

Even though this is closed, I was curious as to why this was slower on go1.19, so I kept digging. Here's how Go is compiling ts == Timestamp{}, in go1.18

TEXT test.Timestamp.IsEmpty(SB) /Users/nickt/Downloads/test/main.go
        return t == Timestamp{}
  0x10ef080             4885c0                  TESTQ AX, AX
  0x10ef083             7509                    JNE 0x10ef08e
  0x10ef085             85db                    TESTL BX, BX
  0x10ef087             7505                    JNE 0x10ef08e
  0x10ef089             83f101                  XORL $0x1, CX
  0x10ef08c             eb02                    JMP 0x10ef090
  0x10ef08e             31c9                    XORL CX, CX
  0x10ef090             89c8                    MOVL CX, AX
  0x10ef092             c3                      RET

  0x10ef093             cc                      INT $0x3
  0x10ef094             cc                      INT $0x3
  0x10ef095             cc                      INT $0x3
  0x10ef096             cc                      INT $0x3
  0x10ef097             cc                      INT $0x3
  0x10ef098             cc                      INT $0x3
  0x10ef099             cc                      INT $0x3
  0x10ef09a             cc                      INT $0x3
  0x10ef09b             cc                      INT $0x3
  0x10ef09c             cc                      INT $0x3
  0x10ef09d             cc                      INT $0x3
  0x10ef09e             cc                      INT $0x3
  0x10ef09f             cc                      INT $0x3

TEXT test.makeTS(SB) /Users/nickt/Downloads/test/main.go
        return Timestamp{WallTime: walltime, Logical: logical}
  0x10ef0a0             31c9                    XORL CX, CX
  0x10ef0a2             c3                      RET

And in go1.19:

TEXT test.Timestamp.IsEmpty(SB) /Users/nickt/Downloads/test/main.go
func (t Timestamp) IsEmpty() bool {
  0x10f4a60             493b6610                CMPQ 0x10(R14), SP
  0x10f4a64             764f                    JBE 0x10f4ab5
  0x10f4a66             4883ec40                SUBQ $0x40, SP
  0x10f4a6a             48896c2438              MOVQ BP, 0x38(SP)
  0x10f4a6f             488d6c2438              LEAQ 0x38(SP), BP
        return t == Timestamp{}
  0x10f4a74             4889442428              MOVQ AX, 0x28(SP)
  0x10f4a79             895c2430                MOVL BX, 0x30(SP)
  0x10f4a7d             884c2434                MOVB CL, 0x34(SP)
  0x10f4a81             48c744241800000000      MOVQ $0x0, 0x18(SP)
  0x10f4a8a             c744242000000000        MOVL $0x0, 0x20(SP)
  0x10f4a92             c644242400              MOVB $0x0, 0x24(SP)
  0x10f4a97             488d442428              LEAQ 0x28(SP), AX
  0x10f4a9c             488d5c2418              LEAQ 0x18(SP), BX
  0x10f4aa1             b90d000000              MOVL $0xd, CX
  0x10f4aa6             e8f5e4f0ff              CALL runtime.memequal(SB)
  0x10f4aab             488b6c2438              MOVQ 0x38(SP), BP
  0x10f4ab0             4883c440                ADDQ $0x40, SP
  0x10f4ab4             c3                      RET
func (t Timestamp) IsEmpty() bool {
  0x10f4ab5             4889442408              MOVQ AX, 0x8(SP)
  0x10f4aba             895c2410                MOVL BX, 0x10(SP)
  0x10f4abe             884c2414                MOVB CL, 0x14(SP)
  0x10f4ac2             e839eff6ff              CALL runtime.morestack_noctxt.abi0(SB)
  0x10f4ac7             488b442408              MOVQ 0x8(SP), AX
  0x10f4acc             8b5c2410                MOVL 0x10(SP), BX
  0x10f4ad0             0fb64c2414              MOVZX 0x14(SP), CX
  0x10f4ad5             eb89                    JMP test.Timestamp.IsEmpty(SB)

  0x10f4ad7             cc                      INT $0x3
  0x10f4ad8             cc                      INT $0x3
  0x10f4ad9             cc                      INT $0x3
  0x10f4ada             cc                      INT $0x3
  0x10f4adb             cc                      INT $0x3
  0x10f4adc             cc                      INT $0x3
  0x10f4add             cc                      INT $0x3
  0x10f4ade             cc                      INT $0x3
  0x10f4adf             cc                      INT $0x3

TEXT test.makeTS(SB) /Users/nickt/Downloads/test/main.go
        return Timestamp{WallTime: walltime, Logical: logical}
  0x10f4ae0             31c9                    XORL CX, CX
  0x10f4ae2             c3                      RET

The latter calls runtime.memequal(SB), which will then call _memeqbody(SB), which is what was showing up in the profiles. I assume the ASM from the former would be similar to the ASM from what we have now, after #88911 (essentially testing each field, as opposed to moving registers around and calling runtime.memequal).

Curious as to why the compiler wants to take this approach in newer versions. It's significantly slower.

Are we worried that this will crop up elsewhere?

@nicktrav
Copy link
Collaborator Author

Looks like this is no longer an issue upstream. I ran with the tip of the Go compiler and the ASM is back to what it was.

Upstream CL that introduced the regression (present in go1.19) is this one.

Upstream CL that fixed the regression (not present on the go1.19 release branch) is this one.

@erikgrinaker
Copy link
Contributor

Thanks for getting to the bottom of this @nicktrav, this makes sense.

Seems like the upstream fix won't make it into the 1.19 line. I'm not super-worried about this, since roachperf benchmarks aren't showing any significant regressions over 22.1. I also did a simple regex-based search for empty struct comparisons, and none of them appeared to be in particularly hot paths where this would matter. However, it might be worth doing an AST search if there are any decent tools for it.

I happen to know of one site that might matter though, in EncodeMVCCValue -- I'll run a quick benchmark and see.

if v.MVCCValueHeader == emptyValueHeader {

@erikgrinaker
Copy link
Contributor

Same effect on EncodeMVCCValue, as expected. This one's a bit more annoying to fix due to some test metamorphism and inlining constraints, but I'll see what I can do.

name                                                              old time/op  new time/op  delta
EncodeMVCCValue/header=empty/value=tombstone-24                   5.93ns ± 0%  4.73ns ± 0%  -20.31%  (p=0.000 n=10+10)
EncodeMVCCValue/header=empty/value=short-24                       5.93ns ± 0%  4.83ns ± 3%  -18.58%  (p=0.000 n=9+9)
EncodeMVCCValue/header=empty/value=long-24                        5.93ns ± 0%  4.72ns ± 0%  -20.40%  (p=0.000 n=10+10)
EncodeMVCCValue/header=local_walltime/value=short-24              54.5ns ± 0%  55.3ns ± 1%   +1.44%  (p=0.000 n=9+10)
EncodeMVCCValue/header=local_walltime/value=long-24               1.36µs ± 2%  1.35µs ± 1%     ~     (p=0.239 n=10+10)
EncodeMVCCValue/header=local_walltime/value=tombstone-24          52.2ns ± 1%  52.9ns ± 1%   +1.22%  (p=0.000 n=10+10)
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-24  57.0ns ± 1%  56.7ns ± 1%   -0.57%  (p=0.003 n=10+10)
EncodeMVCCValue/header=local_walltime+logical/value=short-24      59.7ns ± 2%  60.2ns ± 2%     ~     (p=0.085 n=10+10)
EncodeMVCCValue/header=local_walltime+logical/value=long-24       1.34µs ± 2%  1.34µs ± 4%     ~     (p=0.924 n=10+10)

@erikgrinaker
Copy link
Contributor

Reopening this to track EncodeMVCCValue as well.

@erikgrinaker erikgrinaker reopened this Oct 3, 2022
@erikgrinaker erikgrinaker changed the title storage: performance regression in EncodeMVCCKey storage: performance regression in EncodeMVCCKey and EncodeMVCCValue Oct 3, 2022
@craig craig bot closed this as completed in d36efd1 Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-storage Storage Team
Projects
None yet
2 participants