-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443
Comments
cc @cockroachdb/test-eng |
Hi @ajstorm, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Bisection results:
|
The start regression date lead to the bisect identifying the above commit, which did cause a regression, but that commit was subsequently fixed, which means something else was introduced. So we again bisect, but with a new starting good commit (the fix for above: Specifying
This is a relatively large batch, and we run the roachtest against each hash until we find the one with a decreased throughput. In this case, For sanity, we check other hashes in the batch. The first "hump" is
cc @cockroachdb/sql-foundations |
Very nice investigation! a792cd3 is work from the @cockroachdb/cluster-observability team. |
An experiment using drwmutex [1] to speed up read lock contention on 96 vCPUs, as observed in [2]. The final run of `kv95/enc=false/nodes=3/cpu=96` exhibited average throughput of 173413 ops/sec. That's worse than the implementation without RWMutex. It appears that read lock, as implemented by Go's runtime scales poorly to a high number of vCPUs [3]. On the other hand, the write lock under drwmutex requires acquiring 96 locks in this case, which appears to be the only bottleneck; the sharded read lock is optimal enough that it doesn't show up on the cpu profile. The only slow down appears to be the write lock inside getStatsForStmtWithKeySlow which is unavoidable. Although inconclusive, it appears that drwmutex doesn't scale well above a certain number of vCPUs, when the write mutex is on a critical path. [1] https://github.com/jonhoo/drwmutex [2] cockroachdb#109443 [3] golang/go#17973 Epic: none Release note: None
Unfortunately, this is a great example where microbenchmarks are insufficient. They are great for establishing a relative change (wrt baseline), but we can't rely on them to tell us about absolute performance. Below is a summary of my investigation. InvestigationBelow, we refer to a792cd3 as the CPU ProfileFirst, the CPU profiles didn't yield any obvious insights (e.g., system is overloaded). However, we can clearly see that the Good run (grafana link), Bad run (grafana link), Mutex ProfileIndeed, the mutex profiles are very telling. Good run, Bad run, RWMutex Scales Poorly with CPU CountThe fact that DRWMutexI've experimented with It appears we've hit the wall with the optimization to split the mutex into read/write locks, in the cases where vCPU count > 16 cores. ( (Note, I've renamed Next StepsI think we need to determine how much performance is gained by read/write locks for <= 16 vCPUs? If it's significant, an adaptive approach could work, i.e., fall back to [1] golang/go#17973 |
@srosenberg should I revert the PR? |
Yep, let's revert and add an annotation to |
I've created a new issue ^^^ as a follow-up. |
Describe the problem
kv95/enc=false/nodes=3/cpu=96
had a regression in July. QPS dropped from ~185k down to ~150k.To Reproduce
See https://roachperf.crdb.dev/?filter=&view=kv95%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=gce
Jira issue: CRDB-30926
The text was updated successfully, but these errors were encountered: