-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachperf: kv95
and kv0
regression around October 28th
#72499
Comments
Confirmed, comparing 3207354 (admission control disabled) with 7e5741b (admission control enabled) shows a ~20% ops/sec regression on
@sumeerbhola Is this regression expected and accepted with admission control? |
cc @sumeerbhola |
The regression should only be for cases where the LSM was degrading prior to admission control (admission control is not expected to reduce throughput if the bottleneck was CPU). If one clicks to get details of a run, and sees something like the following where the rate fluctuates, it is because for LSM overload the tokens get adjusted every 15s (unhealthy for 15s, so fewer tokens, resulting in transition to healthy, with unconstrained tokens, and then repeat). Admission control will only affect writes in this case but since we are issuing a sequence of reads and writes from the same workers, reducing write throughput also reduces read throughput. The corresponding no-admission roachtest has a higher mean rate, but note the decline in rate over time as the LSM becomes more and more unhealthy. It is not a sustainable rate, but since we only run for 10min we have overlooked this problem. You should see the same thing with the kv0 slowdowns. But this is not the case with |
Interesting find. This seems fine then, since presumably running for another 10 minutes could have shown an overall improvement when admission control is enabled.
Interesting, I'll run another bisection and see if something turns up. Thanks! |
In light of this, should we consider running these perf tests for longer than ten minutes (say thirty minutes)? This will completely shift the graphs we see (since they use total qps or something like that) but it may give us better data in the long run? |
Yeah, 30 minutes makes sense. That probably shouldn't be the default though, since people will run these from the command-line as well, but for the nightlies that seems reasonable. I'll submit a PR. |
Thanks! You don't think that approach will be confusing to people trying to repro the numbers? I would've thought 30m is the default but it should be "easy" to lower that to 10 minutes when desired, perhaps through an env var (I know you love those!) or small code change (but code change sucks for bisecting). |
OK, we can make 30 the default. |
wowza, ack, okay |
I've been benchmarking and tuning this. Indeed #71936 caused a good bit of the pain. However, I don't think it was all 10%, I think it was closer to 7% and then another bit was cause by admission control. I went and tuned the catalog layer more in #72740 such that I think it makes up for both, but I see that the admission control code now shows up prominently in the mutex profile. Turning off admission control seems to show about a 3-4% improvement in that benchmark. My guess is that we should be running that benchmark at a higher concurrency, at which point I suspect it’d show more. I've run the benchmark with admission control off and on with the patch in #72740 and the results are around |
72669: bazel: properly generate `.eg.go` code in `pkg/sql/colconv` via bazel r=rail a=rickystewart Release note: None 72714: sql/catalog/lease: permit gaps in descriptor history r=ajwerner a=ajwerner In #71239, we added a new mechanism to look up historical descriptors. I erroneously informed @jameswsj10 that we would never have gaps in the descriptor history, and, thus, when looking up historical descriptors, we could always use the earliest descriptor's modification time as the bounds for the relevant query. This turns out to not be true. Consider the case where version 3 is a historical version and then version 4 pops up and gets leased. Version 3 will get removed if it is not referenced. In the meantime, version 3 existed when we went to go find version 2. At that point, we'll inject version 2 and have version 4 leased. We need to make sure we can handle the case where we need to go fetch version 3. In the meantime, this change also removes some logic added to support the eventual resurrection of #59606 whereby we'll use the export request to fetch descriptor history to power historical queries even in the face of descriptors having been deleted. Fixes #72706. Release note: None 72740: sql/catalog/descs: fix perf regression r=ajwerner a=ajwerner This commit in #71936 had the unfortunate side-effect of allocating and forcing reads on the `uncommittedDescriptors` set even when we aren't looking for the system database. This has an outsized impact on the performance of the single-node, high-core-count KV runs. Instead of always initializing the system database, just do it when we access it. ``` name old ops/s new ops/s delta KV95-throughput 88.6k ± 0% 94.8k ± 1% +7.00% (p=0.008 n=5+5) name old ms/s new ms/s delta KV95-P50 1.60 ± 0% 1.40 ± 0% -12.50% (p=0.008 n=5+5) KV95-Avg 0.60 ± 0% 0.50 ± 0% -16.67% (p=0.008 n=5+5) ``` The second commit is more speculative and came from looking at a profile where 1.6% of the allocated garbage was due to that `NameInfo` even though we'll never, ever hit it. <img width="2345" alt="Screen Shot 2021-11-15 at 12 57 31 AM" src="https://user-images.githubusercontent.com/1839234/141729924-d00eebab-b35c-42bd-8d0b-ee39f3ac7d46.png"> Fixes #72499 Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
@ajwerner thanks for investigating. |
I think the mutex profile covers time since the process started, though I could be wrong. It was probably around a minute into the benchmark run. I wasn't running it for that long. I'm setting it back up now. Turning off admission control had only a very modest impact on CPU utilization. There was an initial throughput bump, but interestingly, it didn't last long. Tail latency went down and stayed down, but from a small number. I turned off admission control at 23:29. |
With that low number of runnable goroutines, there should be no queueing in admission control (the admission control page in the db console should confirm that), so any affect would be the contention due to acquiring the mutexes |
Just a drive-by to mention that to the best of my knowledge, on recent Go versions the mutex profile supports the |
@ajwerner if admission control is not decreasing cpu utilization, and we are seeing a 3-4% decline in throughput, the admission control code should be showing up as significant in the cpu profile -- do you have one? |
profile_without.pb.gz https://share.polarsignals.com/c7ecb30/ -- is with. I have the binary if you want it to get line info. Looks like admission control is ~1% of CPU samples. It does look like some of the samples are in |
72646: roachtest: bump KV and YCSB benchmark durations r=stevendanna,srosenberg a=erikgrinaker roachtest benchmark performance often deteriorates over time, e.g. due to garbage buildup in the LSM and MVCC. We previously ran benchmarks for 10 minutes, but this is often not long enough for this buildup to significantly affect the benchmark. To better take this performance deterioration into account, this patch changes the default duration for KV and YCSB benchmarks from 10 minutes to 30 minutes. It also adds the environment variables `ROACHTEST_KV_FLAGS` and `ROACHTEST_YCSB_FLAGS` to pass additional flags to the workloads in order to override the duration (e.g. `ROACHTEST_KV_FLAGS="--duration 5m"`). Touches #72499. Release note: None A follow-up PR will annotate this change on roachperf graphs. Co-authored-by: Erik Grinaker <[email protected]>
Some versions of
kv95
andkv0
saw a ~10% performance regression around October 28th, e.g.kv95/enc=false/nodes=1/cpu=32
:This is even more pronounced with 64KB payloads, e.g.
kv95/enc=false/nodes=3/cpu=32/size=64kb
:Admission control was enabled by default in #68535, which merged on October 28th, so that seems like a likely culprit. Running a couple of benchmarks to confirm.
The text was updated successfully, but these errors were encountered: