Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: lower default sampling rate to 1% #62998

Merged
merged 1 commit into from
Apr 2, 2021

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Apr 1, 2021

We arrived at the previous default rate of 10% back in #59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what #59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced #61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in #59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.


To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                      ops/sec            ops/sec
--------------------|------------------|------------------
rate   op      grp  | median    diff   | mean      diff
--------------------|------------------|------------------
0.00 / read  / #1   | 69817.90         | 69406.37
0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
0.00 / write / #1   |  3672.55         |  3653.63
0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                      ops/sec            ops/sec
--------------------|------------------|------------------
rate   op      grp  | median    diff   | mean      diff
--------------------|------------------|------------------
0.00 / read  / #2   | 69440.80          68893.24
0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
0.00 / write / #2   |  3652.45           3625.24
0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.


The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in #62227), but they're much too invasive to backport to
21.1.

Release note: None

@irfansharif irfansharif requested review from tbg, yuzefovich and a team April 1, 2021 20:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

irfansharif commented Apr 1, 2021

Raw results for the curious.


Group 1

0.0 runs:

                                                                          _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
kv95/enc=false/nodes=1/cpu=32/run_6/run_191441.427_n2_workload_run_kv.log:  600.0s        0       39724458        66207.2      0.8      0.7      1.5      3.3     32.5  read
kv95/enc=false/nodes=1/cpu=32/run_9/run_191429.806_n2_workload_run_kv.log:  600.0s        0       41903894        69839.7      0.8      0.7      1.4      3.0     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_3/run_191459.027_n2_workload_run_kv.log:  600.0s        0       41809098        69681.5      0.8      0.7      1.4      3.0     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_8/run_191433.527_n2_workload_run_kv.log:  600.0s        0       41333863        68889.5      0.8      0.7      1.4      3.0     56.6  read
kv95/enc=false/nodes=1/cpu=32/run_7/run_191450.662_n2_workload_run_kv.log:  600.0s        0       42290475        70484.0      0.8      0.7      1.4      3.0     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_2/run_191436.031_n2_workload_run_kv.log:  600.0s        0       41340231        68900.3      0.8      0.7      1.4      3.0     31.5  read
kv95/enc=false/nodes=1/cpu=32/run_10/run_19143.350_n2_workload_run_kv.log:  600.0s        0       41910278        69850.4      0.8      0.7      1.4      3.0     27.3  read
kv95/enc=false/nodes=1/cpu=32/run_4/run_191429.397_n2_workload_run_kv.log:  600.0s        0       41877715        69796.1      0.8      0.7      1.4      3.0     29.4  read
kv95/enc=false/nodes=1/cpu=32/run_1/run_191436.586_n2_workload_run_kv.log:  600.0s        0       42194838        70324.5      0.8      0.7      1.4      3.0     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_5/run_191430.916_n2_workload_run_kv.log:  600.0s        0       42054339        70090.5      0.8      0.7      1.4      3.0     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_6/run_191441.427_n2_workload_run_kv.log:  600.0s        0        2091739         3486.2      2.6      2.0      6.6     10.0     33.6  write
kv95/enc=false/nodes=1/cpu=32/run_8/run_191433.527_n2_workload_run_kv.log:  600.0s        0        2176836         3628.0      2.8      2.0      7.1     11.5     96.5  write
kv95/enc=false/nodes=1/cpu=32/run_2/run_191436.031_n2_workload_run_kv.log:  600.0s        0        2175776         3626.3      2.8      2.1      7.1     10.5     39.8  write
kv95/enc=false/nodes=1/cpu=32/run_9/run_191429.806_n2_workload_run_kv.log:  600.0s        0        2206819         3678.0      2.3      1.7      6.0      8.9     35.7  write
kv95/enc=false/nodes=1/cpu=32/run_3/run_191459.027_n2_workload_run_kv.log:  600.0s        0        2201661         3669.4      2.4      1.9      6.0      8.9     30.4  write
kv95/enc=false/nodes=1/cpu=32/run_4/run_191429.397_n2_workload_run_kv.log:  600.0s        0        2205410         3675.7      2.6      1.9      6.3     11.5    100.7  write
kv95/enc=false/nodes=1/cpu=32/run_10/run_19144.350_n2_workload_run_kv.log:  600.0s        0        2204345         3673.9      2.4      1.9      6.0      8.9     28.3  write
kv95/enc=false/nodes=1/cpu=32/run_7/run_191450.662_n2_workload_run_kv.log:  600.0s        0        2224893         3708.2      2.3      1.8      5.5      8.4     29.4  write
kv95/enc=false/nodes=1/cpu=32/run_1/run_191436.586_n2_workload_run_kv.log:  600.0s        0        2220265         3700.4      2.2      1.6      5.8      8.4     24.1  write
kv95/enc=false/nodes=1/cpu=32/run_5/run_191430.916_n2_workload_run_kv.log:  600.0s        0        2214152         3690.2      2.4      1.9      6.0      8.9     35.7  write

0.01 runs:

                                                                          _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
kv95/enc=false/nodes=1/cpu=32/run_6/run_161406.533_n2_workload_run_kv.log:  600.0s        0       38793226        64655.3      0.8      0.7      1.4      3.3     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_9/run_161322.886_n2_workload_run_kv.log:  600.0s        0       41586102        69310.1      0.8      0.7      1.4      3.0     35.7  read
kv95/enc=false/nodes=1/cpu=32/run_3/run_161316.328_n2_workload_run_kv.log:  600.0s        0       41168904        68614.8      0.8      0.7      1.4      3.0     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_8/run_161324.100_n2_workload_run_kv.log:  600.0s        0       41616740        69361.2      0.8      0.7      1.4      3.0     29.4  read
kv95/enc=false/nodes=1/cpu=32/run_2/run_161348.911_n2_workload_run_kv.log:  600.0s        0       41985094        69975.1      0.8      0.7      1.4      3.0     29.4  read
kv95/enc=false/nodes=1/cpu=32/run_7/run_161329.964_n2_workload_run_kv.log:  600.0s        0       41574483        69290.6      0.8      0.7      1.4      3.0     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_10/run_16124.701_n2_workload_run_kv.log:  600.0s        0       41418902        69031.3      0.8      0.7      1.4      3.0     30.4  read
kv95/enc=false/nodes=1/cpu=32/run_4/run_161324.328_n2_workload_run_kv.log:  600.0s        0       41937343        69895.5      0.8      0.7      1.4      3.0     33.6  read
kv95/enc=false/nodes=1/cpu=32/run_5/run_161357.044_n2_workload_run_kv.log:  600.0s        0       39756352        66260.5      0.8      0.7      1.4      3.1     27.3  read
kv95/enc=false/nodes=1/cpu=32/run_1/run_161339.375_n2_workload_run_kv.log:  600.0s        0       42466769        70777.9      0.8      0.7      1.4      3.0     27.3  read
kv95/enc=false/nodes=1/cpu=32/run_9/run_161322.886_n2_workload_run_kv.log:  600.0s        0        2190241         3650.4      2.8      2.1      7.1     10.5     31.5  write
kv95/enc=false/nodes=1/cpu=32/run_8/run_161324.100_n2_workload_run_kv.log:  600.0s        0        2191090         3651.8      2.4      1.9      5.8      8.9     24.1  write
kv95/enc=false/nodes=1/cpu=32/run_6/run_161406.533_n2_workload_run_kv.log:  600.0s        0        2038990         3398.3      3.8      3.0      9.4     13.6     39.8  write
kv95/enc=false/nodes=1/cpu=32/run_3/run_161316.328_n2_workload_run_kv.log:  600.0s        0        2163330         3605.5      3.4      2.8      7.6     11.0     37.7  write
kv95/enc=false/nodes=1/cpu=32/run_10/run_16124.701_n2_workload_run_kv.log:  600.0s        0        2179010         3631.7      2.9      2.1      7.1     10.5     28.3  write
kv95/enc=false/nodes=1/cpu=32/run_2/run_161348.911_n2_workload_run_kv.log:  600.0s        0        2210650         3684.4      2.4      1.8      5.8      8.9     25.2  write
kv95/enc=false/nodes=1/cpu=32/run_7/run_161329.964_n2_workload_run_kv.log:  600.0s        0        2186953         3644.9      2.9      2.1      7.1     10.5     29.4  write
kv95/enc=false/nodes=1/cpu=32/run_5/run_161357.044_n2_workload_run_kv.log:  600.0s        0        2093366         3488.9      3.7      3.0      8.9     13.6     46.1  write
kv95/enc=false/nodes=1/cpu=32/run_4/run_161324.328_n2_workload_run_kv.log:  600.0s        0        2206659         3677.8      2.4      1.9      6.3      9.4     35.7  write
kv95/enc=false/nodes=1/cpu=32/run_1/run_161339.375_n2_workload_run_kv.log:  600.0s        0        2235196         3725.3      2.4      1.9      5.8      8.9     27.3  write

0.1 runs:

                                                                          _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
kv95/enc=false/nodes=1/cpu=32/run_9/run_165945.744_n2_workload_run_kv.log:  600.0s        0       40836089        68059.9      0.8      0.7      1.4      3.3     31.5  read
kv95/enc=false/nodes=1/cpu=32/run_6/run_165936.133_n2_workload_run_kv.log:  600.0s        0       40443198        67405.3      0.8      0.7      1.4      3.1     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_8/run_165903.539_n2_workload_run_kv.log:  600.0s        0       40322723        67204.4      0.8      0.7      1.5      3.3     27.3  read
kv95/enc=false/nodes=1/cpu=32/run_3/run_165907.376_n2_workload_run_kv.log:  600.0s        0       40801445        68002.3      0.8      0.7      1.5      3.3     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_2/run_165955.395_n2_workload_run_kv.log:  600.0s        0       40062649        66771.0      0.8      0.7      1.4      3.3     32.5  read
kv95/enc=false/nodes=1/cpu=32/run_7/run_165903.578_n2_workload_run_kv.log:  600.0s        0       41321513        68869.1      0.8      0.7      1.4      3.1     23.1  read
kv95/enc=false/nodes=1/cpu=32/run_4/run_165907.637_n2_workload_run_kv.log:  600.0s        0       41120013        68533.3      0.8      0.7      1.4      3.1     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_5/run_165928.418_n2_workload_run_kv.log:  600.0s        0       39411833        65686.1      0.8      0.7      1.4      3.1     29.4  read
kv95/enc=false/nodes=1/cpu=32/run_1/run_165916.844_n2_workload_run_kv.log:  600.0s        0       40538760        67564.5      0.8      0.7      1.5      3.3     27.3  read
kv95/enc=false/nodes=1/cpu=32/run_10/run_15908.026_n2_workload_run_kv.log:  600.0s        0       40753333        67922.2      0.8      0.7      1.4      3.1     32.5  read
kv95/enc=false/nodes=1/cpu=32/run_8/run_165903.539_n2_workload_run_kv.log:  600.0s        0        2122375         3537.3      2.8      2.1      7.3     11.0     29.4  write
kv95/enc=false/nodes=1/cpu=32/run_9/run_165945.744_n2_workload_run_kv.log:  600.0s        0        2148981         3581.6      2.5      2.0      6.3      9.4     33.6  write
kv95/enc=false/nodes=1/cpu=32/run_6/run_165936.133_n2_workload_run_kv.log:  600.0s        0        2127696         3546.2      3.2      2.5      7.6     11.0     35.7  write
kv95/enc=false/nodes=1/cpu=32/run_3/run_165907.376_n2_workload_run_kv.log:  600.0s        0        2147007         3578.3      2.5      1.9      6.0      9.4     32.5  write
kv95/enc=false/nodes=1/cpu=32/run_2/run_165955.395_n2_workload_run_kv.log:  600.0s        0        2108114         3513.5      2.9      2.2      7.3     11.0     35.7  write
kv95/enc=false/nodes=1/cpu=32/run_7/run_165903.578_n2_workload_run_kv.log:  600.0s        0        2175956         3626.6      2.4      1.8      5.8      8.9     25.2  write
kv95/enc=false/nodes=1/cpu=32/run_4/run_165907.637_n2_workload_run_kv.log:  600.0s        0        2166718         3611.2      2.6      2.0      6.3      9.4     28.3  write
kv95/enc=false/nodes=1/cpu=32/run_1/run_165916.844_n2_workload_run_kv.log:  600.0s        0        2134224         3557.0      2.5      2.0      6.0      9.4     28.3  write
kv95/enc=false/nodes=1/cpu=32/run_10/run_16598.026_n2_workload_run_kv.log:  600.0s        0        2146427         3577.4      2.6      2.0      6.3      9.4     31.5  write
kv95/enc=false/nodes=1/cpu=32/run_5/run_165928.418_n2_workload_run_kv.log:  600.0s        0        2075959         3459.9      3.7      2.9      8.9     13.6     46.1  write

Group 2

0.00 runs

                                                                          _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
enc=false/nodes=1/cpu=32/run_1/run_192353.068_n2_workload_run_kv.log:  600.0s        0       40789788        67982.9      0.8      0.7      1.4      3.1     28.3  read
enc=false/nodes=1/cpu=32/run_7/run_192341.641_n2_workload_run_kv.log:  600.0s        0       40092323        66820.5      0.8      0.7      1.4      3.0     26.2  read
enc=false/nodes=1/cpu=32/run_3/run_192351.493_n2_workload_run_kv.log:  600.0s        0       40574626        67624.3      0.8      0.7      1.5      3.1     62.9  read
enc=false/nodes=1/cpu=32/run_9/run_192218.006_n2_workload_run_kv.log:  600.0s        0       42408401        70680.6      0.8      0.7      1.4      2.9     35.7  read
enc=false/nodes=1/cpu=32/run_8/run_191934.644_n2_workload_run_kv.log:  600.0s        0       42217273        70362.1      0.8      0.7      1.4      3.0     26.2  read
enc=false/nodes=1/cpu=32/run_6/run_192314.459_n2_workload_run_kv.log:  600.0s        0       41544914        69241.5      0.8      0.7      1.4      3.0     67.1  read
enc=false/nodes=1/cpu=32/run_4/run_192343.195_n2_workload_run_kv.log:  600.0s        0       42083198        70138.6      0.8      0.7      1.4      3.0     31.5  read
enc=false/nodes=1/cpu=32/run_10/run_19232.771_n2_workload_run_kv.log:  600.0s        0       41851082        69751.7      0.8      0.7      1.4      3.0     28.3  read
enc=false/nodes=1/cpu=32/run_5/run_192205.771_n2_workload_run_kv.log:  600.0s        0       41784078        69640.1      0.8      0.7      1.4      3.0     28.3  read
enc=false/nodes=1/cpu=32/run_2/run_191611.234_n2_workload_run_kv.log:  600.0s        0       40014080        66690.1      0.8      0.7      1.5      3.3     46.1  read
enc=false/nodes=1/cpu=32/run_7/run_192341.641_n2_workload_run_kv.log:  600.0s        0        2109896         3516.5      3.7      2.6      9.4     14.7     71.3  write
enc=false/nodes=1/cpu=32/run_9/run_192218.006_n2_workload_run_kv.log:  600.0s        0        2230695         3717.8      2.4      1.8      5.8     10.0     37.7  write
enc=false/nodes=1/cpu=32/run_3/run_192351.493_n2_workload_run_kv.log:  600.0s        0        2136822         3561.4      2.5      1.9      6.0     10.0    104.9  write
enc=false/nodes=1/cpu=32/run_1/run_192353.068_n2_workload_run_kv.log:  600.0s        0        2146687         3577.8      2.7      2.1      6.8     10.0     39.8  write
enc=false/nodes=1/cpu=32/run_10/run_19232.771_n2_workload_run_kv.log:  600.0s        0        2201763         3669.6      2.6      1.9      6.0     11.0    104.9  write
enc=false/nodes=1/cpu=32/run_6/run_192314.459_n2_workload_run_kv.log:  600.0s        0        2183711         3639.5      2.6      2.0      6.6     10.0    176.2  write
enc=false/nodes=1/cpu=32/run_8/run_191934.644_n2_workload_run_kv.log:  600.0s        0        2222878         3704.8      2.3      1.8      5.5      8.4     28.3  write
enc=false/nodes=1/cpu=32/run_2/run_191611.234_n2_workload_run_kv.log:  600.0s        0        2105696         3509.5      2.7      2.0      6.6     10.5    104.9  write
enc=false/nodes=1/cpu=32/run_4/run_192343.195_n2_workload_run_kv.log:  600.0s        0        2214087         3690.1      2.4      1.9      5.8      8.9     26.2  write
enc=false/nodes=1/cpu=32/run_5/run_192205.771_n2_workload_run_kv.log:  600.0s        0        2199235         3665.4      2.2      1.6      5.8      8.9     30.4  write

0.01 runs:

                                                                          _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
kv95/enc=false/nodes=1/cpu=32/run_7/run_170641.591_n2_workload_run_kv.log:  600.0s        0       41871591        69785.9      0.8      0.7      1.5      3.0     25.2  read
kv95/enc=false/nodes=1/cpu=32/run_8/run_171006.330_n2_workload_run_kv.log:  600.0s        0       41149099        68581.8      0.8      0.7      1.4      3.0    125.8  read
kv95/enc=false/nodes=1/cpu=32/run_10/run_17956.541_n2_workload_run_kv.log:  600.0s        0       42537259        70895.4      0.8      0.7      1.4      3.0     27.3  read
kv95/enc=false/nodes=1/cpu=32/run_6/run_170721.221_n2_workload_run_kv.log:  600.0s        0       41680248        69467.0      0.8      0.7      1.4      3.0     23.1  read
kv95/enc=false/nodes=1/cpu=32/run_4/run_170909.475_n2_workload_run_kv.log:  600.0s        0       41184430        68640.7      0.8      0.7      1.4      3.0     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_3/run_170906.233_n2_workload_run_kv.log:  600.0s        0       41697838        69496.1      0.8      0.7      1.4      3.0     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_1/run_171018.571_n2_workload_run_kv.log:  600.0s        0       42476207        70793.4      0.8      0.7      1.4      3.0     29.4  read
kv95/enc=false/nodes=1/cpu=32/run_9/run_171008.467_n2_workload_run_kv.log:  600.0s        0       41932542        69887.5      0.8      0.7      1.4      3.0     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_2/run_170921.705_n2_workload_run_kv.log:  600.0s        0       40999597        68332.6      0.8      0.7      1.4      3.0     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_5/run_170539.346_n2_workload_run_kv.log:  600.0s        0       41250636        68750.9      0.8      0.7      1.4      3.0     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_8/run_171006.330_n2_workload_run_kv.log:  600.0s        0        2165360         3608.9      2.8      2.0      6.8     11.5    234.9  write
kv95/enc=false/nodes=1/cpu=32/run_1/run_171018.571_n2_workload_run_kv.log:  600.0s        0        2235457         3725.7      2.3      1.8      5.5      8.4     28.3  write
kv95/enc=false/nodes=1/cpu=32/run_9/run_171008.467_n2_workload_run_kv.log:  600.0s        0        2205802         3676.3      2.4      1.9      5.8      8.9     37.7  write
kv95/enc=false/nodes=1/cpu=32/run_7/run_170641.591_n2_workload_run_kv.log:  600.0s        0        2199251         3665.4      2.2      1.7      5.8      8.9     28.3  write
kv95/enc=false/nodes=1/cpu=32/run_4/run_170909.475_n2_workload_run_kv.log:  600.0s        0        2163926         3606.5      3.2      2.5      7.6     11.0     37.7  write
kv95/enc=false/nodes=1/cpu=32/run_5/run_170539.346_n2_workload_run_kv.log:  600.0s        0        2170589         3617.6      2.9      2.0      7.6     11.0     28.3  write
kv95/enc=false/nodes=1/cpu=32/run_2/run_170921.705_n2_workload_run_kv.log:  600.0s        0        2157541         3595.9      3.2      2.5      7.3     11.0     35.7  write
kv95/enc=false/nodes=1/cpu=32/run_10/run_70956.541_n2_workload_run_kv.log:  600.0s        0        2239195         3732.0      2.3      1.8      5.5      8.4     25.2  write
kv95/enc=false/nodes=1/cpu=32/run_3/run_170906.233_n2_workload_run_kv.log:  600.0s        0        2194448         3657.4      2.4      1.9      5.8      8.9     50.3  write
kv95/enc=false/nodes=1/cpu=32/run_6/run_170721.221_n2_workload_run_kv.log:  600.0s        0        2194600         3657.7      2.8      2.0      7.3     10.5     32.5  write

0.1 runs:

                                                                          _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
kv95/enc=false/nodes=1/cpu=32/run_7/run_162024.152_n2_workload_run_kv.log:  600.0s        0       35171119        58618.5      0.9      0.8      1.6      4.1     31.5  read
kv95/enc=false/nodes=1/cpu=32/run_8/run_161718.608_n2_workload_run_kv.log:  600.0s        0       40806139        68010.0      0.8      0.7      1.5      3.3     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_1/run_162202.588_n2_workload_run_kv.log:  600.0s        0       40857078        68095.1      0.8      0.7      1.5      3.1     28.3  read
kv95/enc=false/nodes=1/cpu=32/run_9/run_162156.727_n2_workload_run_kv.log:  600.0s        0       41034067        68390.0      0.8      0.7      1.4      3.1     35.7  read
kv95/enc=false/nodes=1/cpu=32/run_4/run_162148.274_n2_workload_run_kv.log:  600.0s        0       40864177        68106.7      0.8      0.7      1.4      3.1     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_10/run_16220.228_n2_workload_run_kv.log:  600.0s        0       40528882        67547.9      0.8      0.7      1.5      3.1    117.4  read
kv95/enc=false/nodes=1/cpu=32/run_3/run_162058.015_n2_workload_run_kv.log:  600.0s        0       40604267        67673.6      0.8      0.7      1.5      3.1     31.5  read
kv95/enc=false/nodes=1/cpu=32/run_6/run_161720.722_n2_workload_run_kv.log:  600.0s        0       40475028        67458.3      0.8      0.7      1.4      3.3     31.5  read
kv95/enc=false/nodes=1/cpu=32/run_5/run_161927.518_n2_workload_run_kv.log:  600.0s        0       40421725        67369.3      0.8      0.7      1.4      3.1     26.2  read
kv95/enc=false/nodes=1/cpu=32/run_2/run_162130.521_n2_workload_run_kv.log:  600.0s        0       41193811        68656.1      0.8      0.7      1.4      3.1     27.3  read
kv95/enc=false/nodes=1/cpu=32/run_7/run_162024.152_n2_workload_run_kv.log:  600.0s        0        1849040         3081.7      3.6      2.9      8.4     12.6     37.7  write
kv95/enc=false/nodes=1/cpu=32/run_9/run_162156.727_n2_workload_run_kv.log:  600.0s        0        2159650         3599.4      2.5      2.0      6.3      9.4     32.5  write
kv95/enc=false/nodes=1/cpu=32/run_8/run_161718.608_n2_workload_run_kv.log:  600.0s        0        2147545         3579.2      2.4      1.9      5.8      8.9     29.4  write
kv95/enc=false/nodes=1/cpu=32/run_3/run_162058.015_n2_workload_run_kv.log:  600.0s        0        2137402         3562.3      3.0      2.2      7.9     12.1     58.7  write
kv95/enc=false/nodes=1/cpu=32/run_4/run_162148.274_n2_workload_run_kv.log:  600.0s        0        2151489         3585.8      2.6      2.0      6.3      9.4     35.7  write
kv95/enc=false/nodes=1/cpu=32/run_10/run_16207.228_n2_workload_run_kv.log:  600.0s        0        2133327         3555.5      2.7      2.0      6.6     10.5    285.2  write
kv95/enc=false/nodes=1/cpu=32/run_6/run_161720.722_n2_workload_run_kv.log:  600.0s        0        2129437         3549.1      2.7      2.0      6.6     10.0     30.4  write
kv95/enc=false/nodes=1/cpu=32/run_1/run_162202.588_n2_workload_run_kv.log:  600.0s        0        2150665         3584.4      2.4      1.9      5.8      8.9     26.2  write
kv95/enc=false/nodes=1/cpu=32/run_2/run_162130.521_n2_workload_run_kv.log:  600.0s        0        2169128         3615.2      2.4      1.9      5.8      8.9     27.3  write
kv95/enc=false/nodes=1/cpu=32/run_5/run_161927.518_n2_workload_run_kv.log:  600.0s        0        2128674         3547.8      3.4      2.6      7.9     12.6     75.5  write

We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
@irfansharif irfansharif force-pushed the 210401.lower-sample-rate branch from 47f586c to f4ec58f Compare April 1, 2021 21:05
@irfansharif
Copy link
Contributor Author

Separately I also tried grabbing tpccbench numbers, but it's pretty hard to make heads or tails from it the numbers tend to be nonsensical once the VMs plunge into overloaded territory (see #62039 (comment)).

@jordanlewis
Copy link
Member

So the short summary here, if I'm understanding correctly, is that:

1% sampling gives about 1% degradation over no sampling
10% sampling gives about 3% degradation over no sampling

right?

@irfansharif
Copy link
Contributor Author

Yup.

@irfansharif
Copy link
Contributor Author

bors r+

@irfansharif
Copy link
Contributor Author

irfansharif commented Apr 1, 2021

Woops, wrong PR tab.

bors r-

@craig
Copy link
Contributor

craig bot commented Apr 1, 2021

Canceled.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

What's the difference between group 1 and 2? Simply two different clusters? Or two different machine types?

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

@irfansharif
Copy link
Contributor Author

No real difference, just wanted to do the whole thing over again to see that I wasn't deluding myself. TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 2, 2021

Build succeeded:

@craig craig bot merged commit 2e044b5 into cockroachdb:master Apr 2, 2021
@irfansharif irfansharif deleted the 210401.lower-sample-rate branch April 2, 2021 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants