Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443

Closed
rafiss opened this issue Aug 24, 2023 · 12 comments
Closed

kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443

rafiss opened this issue Aug 24, 2023 · 12 comments
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage branch-master Failures and bugs on the master branch. C-investigation Further steps needed to qualify. C-label will change. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team

Comments

@rafiss
Copy link
Collaborator

rafiss commented Aug 24, 2023

Describe the problem

kv95/enc=false/nodes=3/cpu=96 had a regression in July. QPS dropped from ~185k down to ~150k.

To Reproduce

See https://roachperf.crdb.dev/?filter=&view=kv95%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=gce

Jira issue: CRDB-30926

@rafiss rafiss added C-investigation Further steps needed to qualify. C-label will change. A-roachperf-investigation Issue opened as a result of roachperf triage T-testeng TestEng Team labels Aug 24, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 24, 2023

cc @cockroachdb/test-eng

@ajstorm ajstorm added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 24, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 24, 2023

Hi @ajstorm, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@rafiss rafiss added the branch-master Failures and bugs on the master branch. label Aug 24, 2023
@smg260
Copy link
Contributor

smg260 commented Aug 24, 2023

Bisection results:

2023-08-24T19:39:22Z   Bisecting regression in [kv95/enc=false/nodes=3/cpu=96] using commit range [c89d5868978 (known good),06b5ba78883 (known bad)]
2023-08-24T19:39:22Z   Thresholds [good >= 179410, bad <= 138725]
2023-08-24T20:00:12Z   [9a4f139c000] avg ops/s: [180089]. Auto marked as good.
2023-08-24T20:20:16Z   [8aebb2993e9] avg ops/s: [180375]. Auto marked as good.
2023-08-24T20:36:58Z   [6f420390c1f] avg ops/s: [137051]. Auto marked as bad.
2023-08-24T20:52:30Z   [e0b83fdd096] avg ops/s: [181482]. Auto marked as good.
2023-08-24T21:07:19Z   [25855acc268] avg ops/s: [182082]. Auto marked as good.
2023-08-24T21:25:45Z   [5cc456bb4c4] avg ops/s: [182696]. Auto marked as good.
2023-08-24T21:41:51Z   [dc6b2dc923a] avg ops/s: [180852]. Auto marked as good.
2023-08-24T21:59:40Z   [60e8bcec612] avg ops/s: [137160]. Auto marked as bad.
2023-08-24T21:59:40Z   Bisection complete. Suspect commit:
commit 60e8bcec61269c6648dc40c8094fbf303b8a02aa
Author: Ricky Stewart <[email protected]>
Date:   Mon Jul 17 15:09:52 2023 -0500

    bazel,dev: in `dev`, handle different `--compilation_mode`s correctly...

    ... and switch our default compilation mode to `dbg`.

    Under `fastbuild`, built binaries are stripped. This is not the case
    with `dbg`. Have `dev doctor` recommend using `dbg`, and either way
    make `dev` resilient to whatever kind of `--compilation_mode` you're
    using.

    In principle or theory `dbg` is slower than `fastbuild`, but for the Go
    compiler it generates debuggable binaries by default and you have to
    opt-in to stripping, so it shoould make no real difference.

    Now, we think of the `--compilation_mode` simply as follows: `dbg` is
    the default that everyone is opted into by default unless they
    explicitly set `-c opt` (we use this for release). For now I don't see
    a reason anyone would need `fastbuild`.

    Epic: CRDB-17171
    Release note: None
    Closes: #106820

@smg260
Copy link
Contributor

smg260 commented Aug 24, 2023

I also ran this test against latest master with the above commit (#106944) and its subsequent "partial revert" commit (#107207), reverted.

Result: 179597.8 ops/s - in line with the higher pre-regression throughput.

@smg260
Copy link
Contributor

smg260 commented Aug 25, 2023

Not so fast...

Same code -

via gce-worker
image

via CI
image

@smg260
Copy link
Contributor

smg260 commented Aug 26, 2023

The start regression date lead to the bisect identifying the above commit, which did cause a regression, but that commit was subsequently fixed, which means something else was introduced.

So we again bisect, but with a new starting good commit (the fix for above: f52fe97ad8) and a known bad commit (c6f4c0ed6e3).

Specifying --first-parent, bisect only follows the first parent of merge commits, resulting in the offending merge commit below.

2023-08-25T20:07:04Z   Bisecting regression in [kv95/enc=false/nodes=3/cpu=96] using commit range [f52fe97ad8b (known good),c6f4c0ed6e3 (known bad)]
2023-08-25T20:07:04Z   Thresholds [good >= 179696, bad <= 144555]
2023-08-25T20:24:11Z   [e23993ff117] avg ops/s: [143712]. Auto marked as bad.
2023-08-25T20:52:30Z   [b005e97c4ba] avg ops/s: [178173]. User marked as good. Threshold updated.
2023-08-25T20:52:30Z   Bisection complete. Suspect commit:

commit e23993ff117521c5bf72390c46f589b275af9747
Merge: b005e97c4ba a792cd30a86 466c4e9b709 8e4578b0e20 db4be30fa87 fb97251da55 a05d100dfae f52fe97ad8b
Author: craig[bot] <[email protected]>
Date:   Wed Jul 19 22:12:33 2023 +0000

    Merge #106860 #106912 #107099 #107170 #107189 #107197 #107207

<... truncated >

This is a relatively large batch, and we run the roachtest against each hash until we find the one with a decreased throughput. In this case, a792cd30a86 with ~150k.

For sanity, we check other hashes in the batch. The first "hump" is a792cd30a86, clearly less than the others.
image

commit a792cd30a864cae879f8f91040b7394304ee3572
Author: j82w <[email protected]>
Date:   Fri Jul 14 15:03:47 2023 -0400

    ssmemstorage: improve lock contention on RecordStatement

    1. The stats object lock was being held even during the insights event
    creation and insert which has it's own lock. The insights logic is now
    moved above the stats lock.
    2. Moved latency calculation outside the stats lock.
    3. Moved error code calculation outside the stats lock.
    4. ss_mem_storage converted to use a RWMutex. Each object in the map
    has it's own lock. Read lock allows mulitple threads to read from the
    map at the same time which is the common case. Writes are only done the
    first time a statement is seen.

    Fixes: #106789, Fixes: #55285

    Release note (performance improvement): Reduced lock contention on
    `ssmemstorage.RecordStatement`. This is useful for workloads that
    execute the same statement concurrently on the same SQL instance.

cc @cockroachdb/sql-foundations

@rafiss
Copy link
Collaborator Author

rafiss commented Aug 26, 2023

Very nice investigation! a792cd3 is work from the @cockroachdb/cluster-observability team.

@maryliag
Copy link
Contributor

that's curious, specially because the benchmark done on that commit shows performance improvements:
#106860

cc @j82w

srosenberg added a commit to srosenberg/cockroach that referenced this issue Aug 27, 2023
An experiment using drwmutex [1] to speed up read lock contention
on 96 vCPUs, as observed in [2]. The final run of
`kv95/enc=false/nodes=3/cpu=96` exhibited average
throughput of 173413 ops/sec. That's worse than the implementation
without RWMutex. It appears that read lock, as implemented by
Go's runtime scales poorly to a high number of vCPUs [3].
On the other hand, the write lock under drwmutex requires
acquiring 96 locks in this case, which appears to be the only
bottleneck; the sharded read lock is optimal enough that it
doesn't show up on the cpu profile. The only slow down
appears to be the write lock inside getStatsForStmtWithKeySlow
which is unavoidable. Although inconclusive, it appears that
drwmutex doesn't scale well above a certain number of vCPUs,
when the write mutex is on a critical path.

[1] https://github.com/jonhoo/drwmutex
[2] cockroachdb#109443
[3] golang/go#17973

Epic: none

Release note: None
@srosenberg
Copy link
Member

srosenberg commented Aug 27, 2023

that's curious, specially because the benchmark done on that commit shows performance improvements

Unfortunately, this is a great example where microbenchmarks are insufficient. They are great for establishing a relative change (wrt baseline), but we can't rely on them to tell us about absolute performance. Below is a summary of my investigation.

Investigation

Below, we refer to a792cd3 as the bad run; the good run is f52fe97a. (I've simply cherry-picked the bad run on top of the good run.)

CPU Profile

First, the CPU profiles didn't yield any obvious insights (e.g., system is overloaded). However, we can clearly see that the bad run is burning ~10% more cpu utilization. Given the context of the change in that commit, it's a hint that the additional cpu cycles are spent on acquiring mutexes.

Good run (grafana link),

cpu_good_run

Bad run (grafana link),

cpu_bad_run

Mutex Profile

Indeed, the mutex profiles are very telling.

Good run,

mutex_good_run_flame

Bad run,

mutex_bad_run_flame

RWMutex Scales Poorly with CPU Count

The fact that RWMutex scales poorly is a known issue [1]. It also explains why only kv95/enc=false/nodes=3/cpu=96 and kv95/enc=false/nodes=3/cpu=32 regressed, and not kv95/enc=false/nodes=3. (The latter uses 8 vCPUs.)

DRWMutex

I've experimented with drwmutex [2] which allegedly scales better with cpu count. Roughly speaking, it shards read locks across all available vCPUs. It does so be using x86 assembly which invokes CPUID to obtain the vCPU index where the current goroutine is running. Unfortunately, that doesn't come free wrt overhead. I ended up replacing it with an atomic counter which effectively round robins and only invokes CPUID 1/num_cpus times. While that reduces the overhead of the read lock, the write lock is also expensive–it must acquire a mutex for each vCPU. I was able to get a better average throughput (173413 ops/sec) than the bad run, but it's still worse than the good run. You can examine the changes in [3].

It appears we've hit the wall with the optimization to split the mutex into read/write locks, in the cases where vCPU count > 16 cores. (tpccbench/nodes=3/cpu=16 didn't regress either, so I am assuming we're good with 16 or fewer cores.) The mutex profile with drwmutex shows the overhead is down to a single acquire in getStatsForStmtWithKeySlow, which is unavoidable.

drwmutex_lock_flame

(Note, I've renamed getStatsForStmtWithKeyLocked to getStatsForStmtWithKeySlow in [3].)

Next Steps

I think we need to determine how much performance is gained by read/write locks for <= 16 vCPUs? If it's significant, an adaptive approach could work, i.e., fall back to sync.Mutex when vCPU count > 16. Ultimately, we need a better atomic map implementation [4] because locks are not without overhead, especially when core density increases.

[1] golang/go#17973
[2] https://github.com/jonhoo/drwmutex
[3] https://github.com/srosenberg/cockroach/tree/sr/drwmutex_109443
[4] golang/go#54720

@j82w
Copy link
Contributor

j82w commented Aug 28, 2023

@srosenberg should I revert the PR?

@srosenberg
Copy link
Member

@srosenberg should I revert the PR?

Yep, let's revert and add an annotation to roachperf. Could you also open a follow-up issue, this way we can close this one out.

@srosenberg
Copy link
Member

I've created a new issue ^^^ as a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage branch-master Failures and bugs on the master branch. C-investigation Further steps needed to qualify. C-label will change. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants