storage: don't bump tscache on successful CPut/InitPut requests #38122

nvanbenschoten · 2019-06-09T22:18:41Z

A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which
provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache.

This commit uses this reasoning to stop bumping the timestamp cache on successful ConditionalPutRequests and InitPutRequests. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache.

Benchmark Results

Microbenchmarks

I initially saw the slowdown when running the KV/Insert/SQL microbenchmark suite from pkg/sql/tests/kv_tests.go. These tests demonstrate that under heavy INSERT load (which translates to ConditionalPutRequests under the hood), avoiding the timestamp cache is a major win.

name                         old time/op    new time/op    delta
KV/Insert/SQL/rows=1-16         256µs ± 3%     254µs ± 2%     ~     (p=0.182 n=9+10)
KV/Insert/SQL/rows=10-16        404µs ± 2%     344µs ± 0%  -14.89%  (p=0.000 n=10+8)
KV/Insert/SQL/rows=100-16      3.27ms ± 6%    1.01ms ± 1%  -69.12%  (p=0.000 n=8+9)
KV/Insert/SQL/rows=1000-16     16.8ms ± 2%     7.1ms ± 1%  -57.76%  (p=0.000 n=10+10)
KV/Insert/SQL/rows=10000-16     189ms ± 2%     103ms ± 2%  -45.45%  (p=0.000 n=9+10)

name                         old alloc/op   new alloc/op   delta
KV/Insert/SQL/rows=1-16        36.6kB ± 0%    36.5kB ± 0%   -0.27%  (p=0.001 n=10+10)
KV/Insert/SQL/rows=10-16       77.4kB ± 0%    76.4kB ± 0%   -1.18%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=100-16       468kB ± 0%     458kB ± 0%   -2.08%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=1000-16     4.93MB ± 0%    4.84MB ± 0%   -1.75%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=10000-16    87.7MB ± 0%    86.8MB ± 0%   -1.01%  (p=0.000 n=10+9)

name                         old allocs/op  new allocs/op  delta
KV/Insert/SQL/rows=1-16           337 ± 0%       337 ± 0%     ~     (all equal)
KV/Insert/SQL/rows=10-16          600 ± 0%       599 ± 0%   -0.12%  (p=0.011 n=10+10)
KV/Insert/SQL/rows=100-16       2.99k ± 0%     2.98k ± 0%   -0.13%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=1000-16      26.8k ± 0%     26.7k ± 0%   -0.36%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=10000-16      384k ± 0%      384k ± 0%     ~     (p=0.984 n=10+10)

CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of KV/Insert/SQL/rows=100:

CPU profile flamegraph before change, tscache accesses highlighted

CPU profile flamegraph after change, tscache accesses highlighted

As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) 45% of the CPU profile. After this change, timestamp cache access accounts for only .3% of the CPU profile.

The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In
other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced.

Full-System Benchmarks

A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are kv0 and ycsb f (our broken version). The reason the former is blind to this issue is because it issues UPSERT requests instead of INSERT requests, which use PutRequests under the hood instead of ConditionalPutRequests. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots.

To test this change with a full-system benchmark on a real cluster, I modified kv to use INSERT statements instead of UPSERT statements. I then ran it on a cluster of 1 and 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also see
that the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters.

name                         old ops/sec  new ops/sec  delta
kv0-b1/cores=16/nodes=1       31.3k ± 0%   32.0k ± 0%   +2.14%  (p=0.029 n=4+4)
kv0-b10/cores=16/nodes=1      4.39k ± 0%   4.55k ± 0%   +3.47%  (p=0.008 n=5+5)
kv0-b100/cores=16/nodes=1       938 ± 0%    1016 ± 0%   +8.38%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      126 ± 0%     141 ± 0%  +11.13%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1    12.7 ± 1%    13.9 ± 2%   +9.78%  (p=0.008 n=5+5)
kv0-b1/cores=16/nodes=3       25.1k ± 0%   25.3k ± 1%   +1.04%  (p=0.016 n=5+5)
kv0-b10/cores=16/nodes=3      4.49k ± 3%   4.56k ± 3%     ~     (p=0.222 n=5+5)
kv0-b100/cores=16/nodes=3     1.13k ± 4%   1.19k ± 1%   +4.95%  (p=0.032 n=5+5)
kv0-b1000/cores=16/nodes=3      177 ± 3%     184 ± 3%   +3.98%  (p=0.056 n=5+5)
kv0-b10000/cores=16/nodes=3    19.4 ± 4%    20.1 ± 6%   +3.71%  (p=0.103 n=5+5)

name                         old p50(ms)  new p50(ms)  delta
kv0-b1/cores=16/nodes=1        1.00 ± 0%    1.00 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       7.30 ± 0%    6.80 ± 0%   -6.85%  (p=0.079 n=4+5)
kv0-b100/cores=16/nodes=1      35.7 ± 0%    31.5 ± 0%  -11.76%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      255 ± 2%     226 ± 0%  -11.18%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   2.42k ± 0%   2.19k ± 5%   -9.44%  (p=0.000 n=4+5)
kv0-b1/cores=16/nodes=3        1.20 ± 0%    1.20 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=3       6.92 ± 3%    6.82 ± 4%     ~     (p=0.643 n=5+5)
kv0-b100/cores=16/nodes=3      28.5 ± 4%    27.3 ± 0%   -4.34%  (p=0.048 n=5+5)
kv0-b1000/cores=16/nodes=3      185 ± 5%     180 ± 3%     ~     (p=0.365 n=5+5)
kv0-b10000/cores=16/nodes=3   1.41k ± 0%   1.38k ± 3%     ~     (p=0.333 n=4+5)

name                         old p99(ms)  new p99(ms)  delta
kv0-b1/cores=16/nodes=1        2.50 ± 0%    2.50 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       14.2 ± 0%    14.2 ± 0%     ~     (all equal)
kv0-b100/cores=16/nodes=1      46.1 ± 0%    44.0 ± 0%   -4.56%  (p=0.000 n=5+4)
kv0-b1000/cores=16/nodes=1      319 ± 5%     285 ± 0%  -10.53%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   4.72k ± 3%   4.54k ± 8%     ~     (p=0.365 n=5+5)
kv0-b1/cores=16/nodes=3        2.80 ± 0%    2.90 ± 0%   +3.57%  (p=0.008 n=5+5)
kv0-b10/cores=16/nodes=3       12.6 ± 0%    12.8 ± 2%     ~     (p=0.556 n=4+5)
kv0-b100/cores=16/nodes=3      38.1 ± 6%    35.7 ± 0%   -6.40%  (p=0.016 n=5+4)
kv0-b1000/cores=16/nodes=3      225 ± 4%     215 ± 2%   -4.48%  (p=0.079 n=5+5)
kv0-b10000/cores=16/nodes=3   3.54k ± 6%   3.19k ± 5%   -9.85%  (p=0.016 n=5+5)

One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions.

Release note: None

A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache. This commit uses this reasoning to stop bumping the timestamp cache on successful `ConditionalPutRequests` and `InitPutRequests`. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache. \### Benchmark Results \#### Microbenchmarks I initially saw the slowdown when running the `KV/Insert/SQL` microbenchmark suite from `pkg/sql/tests/kv_tests.go`. These tests demonstrate that under heavy `INSERT` load (which translates to `ConditionalPutRequests` under the hood), avoiding the timestamp cache is a major win. ``` name old time/op new time/op delta KV/Insert/SQL/rows=1-16 256µs ± 3% 254µs ± 2% ~ (p=0.182 n=9+10) KV/Insert/SQL/rows=10-16 404µs ± 2% 344µs ± 0% -14.89% (p=0.000 n=10+8) KV/Insert/SQL/rows=100-16 3.27ms ± 6% 1.01ms ± 1% -69.12% (p=0.000 n=8+9) KV/Insert/SQL/rows=1000-16 16.8ms ± 2% 7.1ms ± 1% -57.76% (p=0.000 n=10+10) KV/Insert/SQL/rows=10000-16 189ms ± 2% 103ms ± 2% -45.45% (p=0.000 n=9+10) name old alloc/op new alloc/op delta KV/Insert/SQL/rows=1-16 36.6kB ± 0% 36.5kB ± 0% -0.27% (p=0.001 n=10+10) KV/Insert/SQL/rows=10-16 77.4kB ± 0% 76.4kB ± 0% -1.18% (p=0.000 n=9+10) KV/Insert/SQL/rows=100-16 468kB ± 0% 458kB ± 0% -2.08% (p=0.000 n=9+10) KV/Insert/SQL/rows=1000-16 4.93MB ± 0% 4.84MB ± 0% -1.75% (p=0.000 n=10+9) KV/Insert/SQL/rows=10000-16 87.7MB ± 0% 86.8MB ± 0% -1.01% (p=0.000 n=10+9) name old allocs/op new allocs/op delta KV/Insert/SQL/rows=1-16 337 ± 0% 337 ± 0% ~ (all equal) KV/Insert/SQL/rows=10-16 600 ± 0% 599 ± 0% -0.12% (p=0.011 n=10+10) KV/Insert/SQL/rows=100-16 2.99k ± 0% 2.98k ± 0% -0.13% (p=0.000 n=10+9) KV/Insert/SQL/rows=1000-16 26.8k ± 0% 26.7k ± 0% -0.36% (p=0.000 n=9+10) KV/Insert/SQL/rows=10000-16 384k ± 0% 384k ± 0% ~ (p=0.984 n=10+10) ``` CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of `KV/Insert/SQL/rows=100`: <before image> <after image> As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) **45%** of the CPU profile. After this change, timestamp cache access accounts for only **.3%** of the CPU profile. The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense, because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced. \#### Full-System Benchmarks A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are `kv0` and `ycsb f` (our broken version). The reason the former is blind to this issue is because it issues `UPSERT` requests instead of `INSERT` requests, which use `PutRequests` under the hood instead of `ConditionalPutRequests`. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots. To test this change with a full-system benchmark on a real cluster, I modified `kv` to use `INSERT` statements instead of `UPSERT` statements. I then ran it on a cluster of 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also see that the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters. ``` name old ops/sec new ops/sec delta kv0-b1/cores=16/nodes=1 31.3k ± 0% 32.0k ± 0% +2.14% (p=0.029 n=4+4) kv0-b10/cores=16/nodes=1 4.39k ± 0% 4.55k ± 0% +3.47% (p=0.008 n=5+5) kv0-b100/cores=16/nodes=1 938 ± 0% 1016 ± 0% +8.38% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 126 ± 0% 141 ± 0% +11.13% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 12.7 ± 1% 13.9 ± 2% +9.78% (p=0.008 n=5+5) kv0-b1/cores=16/nodes=3 25.1k ± 0% 25.3k ± 1% +1.04% (p=0.016 n=5+5) kv0-b10/cores=16/nodes=3 4.49k ± 3% 4.56k ± 3% ~ (p=0.222 n=5+5) kv0-b100/cores=16/nodes=3 1.13k ± 4% 1.19k ± 1% +4.95% (p=0.032 n=5+5) kv0-b1000/cores=16/nodes=3 177 ± 3% 184 ± 3% +3.98% (p=0.056 n=5+5) kv0-b10000/cores=16/nodes=3 19.4 ± 4% 20.1 ± 6% +3.71% (p=0.103 n=5+5) name old p50(ms) new p50(ms) delta kv0-b1/cores=16/nodes=1 1.00 ± 0% 1.00 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 7.30 ± 0% 6.80 ± 0% -6.85% (p=0.079 n=4+5) kv0-b100/cores=16/nodes=1 35.7 ± 0% 31.5 ± 0% -11.76% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 255 ± 2% 226 ± 0% -11.18% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 2.42k ± 0% 2.19k ± 5% -9.44% (p=0.000 n=4+5) kv0-b1/cores=16/nodes=3 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=3 6.92 ± 3% 6.82 ± 4% ~ (p=0.643 n=5+5) kv0-b100/cores=16/nodes=3 28.5 ± 4% 27.3 ± 0% -4.34% (p=0.048 n=5+5) kv0-b1000/cores=16/nodes=3 185 ± 5% 180 ± 3% ~ (p=0.365 n=5+5) kv0-b10000/cores=16/nodes=3 1.41k ± 0% 1.38k ± 3% ~ (p=0.333 n=4+5) name old p99(ms) new p99(ms) delta kv0-b1/cores=16/nodes=1 2.50 ± 0% 2.50 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 14.2 ± 0% 14.2 ± 0% ~ (all equal) kv0-b100/cores=16/nodes=1 46.1 ± 0% 44.0 ± 0% -4.56% (p=0.000 n=5+4) kv0-b1000/cores=16/nodes=1 319 ± 5% 285 ± 0% -10.53% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 4.72k ± 3% 4.54k ± 8% ~ (p=0.365 n=5+5) kv0-b1/cores=16/nodes=3 2.80 ± 0% 2.90 ± 0% +3.57% (p=0.008 n=5+5) kv0-b10/cores=16/nodes=3 12.6 ± 0% 12.8 ± 2% ~ (p=0.556 n=4+5) kv0-b100/cores=16/nodes=3 38.1 ± 6% 35.7 ± 0% -6.40% (p=0.016 n=5+4) kv0-b1000/cores=16/nodes=3 225 ± 4% 215 ± 2% -4.48% (p=0.079 n=5+5) kv0-b10000/cores=16/nodes=3 3.54k ± 6% 3.19k ± 5% -9.85% (p=0.016 n=5+5) ``` One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions. Release note: None

cockroach-teamcity · 2019-06-09T22:18:48Z

This change is

tbg

Nice! For the allocation numbers in the microbenchmarks, why isn't the allocation size going down more significantly? Was the benchmark simply not running long enough to cycle through a number of 64mb pages or am I missing some reason that makes them matter less than I think?

Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

nvanbenschoten · 2019-06-17T15:27:15Z

For the allocation numbers in the microbenchmarks, why isn't the allocation size going down more significantly? Was the benchmark simply not running long enough to cycle through a number of 64mb pages or am I missing some reason that makes them matter less than I think?

My guess is that these are allocation counts per operation, so we don't expect to see a huge amount of movement there. We see a reduction of about 100 bytes per write (change in alloc/op / row count), which is a reasonable estimate for the size of a tscache entry (len(key) + sizeof(ts) + sizeof(txnid) + sizeof(arenaskl.node)).

nvanbenschoten · 2019-06-17T16:07:25Z

bors r=tbg

38122: storage: don't bump tscache on successful CPut/InitPut requests r=tbg a=nvanbenschoten A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache. This commit uses this reasoning to stop bumping the timestamp cache on successful `ConditionalPutRequests` and `InitPutRequests`. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache. ### Benchmark Results #### Microbenchmarks I initially saw the slowdown when running the `KV/Insert/SQL` microbenchmark suite from `pkg/sql/tests/kv_tests.go`. These tests demonstrate that under heavy `INSERT` load (which translates to `ConditionalPutRequests` under the hood), avoiding the timestamp cache is a major win. ``` name old time/op new time/op delta KV/Insert/SQL/rows=1-16 256µs ± 3% 254µs ± 2% ~ (p=0.182 n=9+10) KV/Insert/SQL/rows=10-16 404µs ± 2% 344µs ± 0% -14.89% (p=0.000 n=10+8) KV/Insert/SQL/rows=100-16 3.27ms ± 6% 1.01ms ± 1% -69.12% (p=0.000 n=8+9) KV/Insert/SQL/rows=1000-16 16.8ms ± 2% 7.1ms ± 1% -57.76% (p=0.000 n=10+10) KV/Insert/SQL/rows=10000-16 189ms ± 2% 103ms ± 2% -45.45% (p=0.000 n=9+10) name old alloc/op new alloc/op delta KV/Insert/SQL/rows=1-16 36.6kB ± 0% 36.5kB ± 0% -0.27% (p=0.001 n=10+10) KV/Insert/SQL/rows=10-16 77.4kB ± 0% 76.4kB ± 0% -1.18% (p=0.000 n=9+10) KV/Insert/SQL/rows=100-16 468kB ± 0% 458kB ± 0% -2.08% (p=0.000 n=9+10) KV/Insert/SQL/rows=1000-16 4.93MB ± 0% 4.84MB ± 0% -1.75% (p=0.000 n=10+9) KV/Insert/SQL/rows=10000-16 87.7MB ± 0% 86.8MB ± 0% -1.01% (p=0.000 n=10+9) name old allocs/op new allocs/op delta KV/Insert/SQL/rows=1-16 337 ± 0% 337 ± 0% ~ (all equal) KV/Insert/SQL/rows=10-16 600 ± 0% 599 ± 0% -0.12% (p=0.011 n=10+10) KV/Insert/SQL/rows=100-16 2.99k ± 0% 2.98k ± 0% -0.13% (p=0.000 n=10+9) KV/Insert/SQL/rows=1000-16 26.8k ± 0% 26.7k ± 0% -0.36% (p=0.000 n=9+10) KV/Insert/SQL/rows=10000-16 384k ± 0% 384k ± 0% ~ (p=0.984 n=10+10) ``` CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of `KV/Insert/SQL/rows=100`: ![Screenshot_2019-06-09 unknown cpu](https://user-images.githubusercontent.com/5438456/59165000-d5757f00-8ae2-11e9-9549-084ac631380c.png) _CPU profile flamegraph before change, tscache accesses highlighted_ ![Screenshot_2019-06-09 unknown cpu(1)](https://user-images.githubusercontent.com/5438456/59165001-e0301400-8ae2-11e9-906c-f88a5fcdc2eb.png) _CPU profile flamegraph after change, tscache accesses highlighted_ As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) **45%** of the CPU profile. After this change, timestamp cache access accounts for only **.3%** of the CPU profile. The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced. #### Full-System Benchmarks A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are `kv0` and `ycsb f` (our broken version). The reason the former is blind to this issue is because it issues `UPSERT` requests instead of `INSERT` requests, which use `PutRequests` under the hood instead of `ConditionalPutRequests`. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots. To test this change with a full-system benchmark on a real cluster, I modified `kv` to use `INSERT` statements instead of `UPSERT` statements. I then ran it on a cluster of 1 and 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also see that the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters. ``` name old ops/sec new ops/sec delta kv0-b1/cores=16/nodes=1 31.3k ± 0% 32.0k ± 0% +2.14% (p=0.029 n=4+4) kv0-b10/cores=16/nodes=1 4.39k ± 0% 4.55k ± 0% +3.47% (p=0.008 n=5+5) kv0-b100/cores=16/nodes=1 938 ± 0% 1016 ± 0% +8.38% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 126 ± 0% 141 ± 0% +11.13% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 12.7 ± 1% 13.9 ± 2% +9.78% (p=0.008 n=5+5) kv0-b1/cores=16/nodes=3 25.1k ± 0% 25.3k ± 1% +1.04% (p=0.016 n=5+5) kv0-b10/cores=16/nodes=3 4.49k ± 3% 4.56k ± 3% ~ (p=0.222 n=5+5) kv0-b100/cores=16/nodes=3 1.13k ± 4% 1.19k ± 1% +4.95% (p=0.032 n=5+5) kv0-b1000/cores=16/nodes=3 177 ± 3% 184 ± 3% +3.98% (p=0.056 n=5+5) kv0-b10000/cores=16/nodes=3 19.4 ± 4% 20.1 ± 6% +3.71% (p=0.103 n=5+5) name old p50(ms) new p50(ms) delta kv0-b1/cores=16/nodes=1 1.00 ± 0% 1.00 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 7.30 ± 0% 6.80 ± 0% -6.85% (p=0.079 n=4+5) kv0-b100/cores=16/nodes=1 35.7 ± 0% 31.5 ± 0% -11.76% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 255 ± 2% 226 ± 0% -11.18% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 2.42k ± 0% 2.19k ± 5% -9.44% (p=0.000 n=4+5) kv0-b1/cores=16/nodes=3 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=3 6.92 ± 3% 6.82 ± 4% ~ (p=0.643 n=5+5) kv0-b100/cores=16/nodes=3 28.5 ± 4% 27.3 ± 0% -4.34% (p=0.048 n=5+5) kv0-b1000/cores=16/nodes=3 185 ± 5% 180 ± 3% ~ (p=0.365 n=5+5) kv0-b10000/cores=16/nodes=3 1.41k ± 0% 1.38k ± 3% ~ (p=0.333 n=4+5) name old p99(ms) new p99(ms) delta kv0-b1/cores=16/nodes=1 2.50 ± 0% 2.50 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 14.2 ± 0% 14.2 ± 0% ~ (all equal) kv0-b100/cores=16/nodes=1 46.1 ± 0% 44.0 ± 0% -4.56% (p=0.000 n=5+4) kv0-b1000/cores=16/nodes=1 319 ± 5% 285 ± 0% -10.53% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 4.72k ± 3% 4.54k ± 8% ~ (p=0.365 n=5+5) kv0-b1/cores=16/nodes=3 2.80 ± 0% 2.90 ± 0% +3.57% (p=0.008 n=5+5) kv0-b10/cores=16/nodes=3 12.6 ± 0% 12.8 ± 2% ~ (p=0.556 n=4+5) kv0-b100/cores=16/nodes=3 38.1 ± 6% 35.7 ± 0% -6.40% (p=0.016 n=5+4) kv0-b1000/cores=16/nodes=3 225 ± 4% 215 ± 2% -4.48% (p=0.079 n=5+5) kv0-b10000/cores=16/nodes=3 3.54k ± 6% 3.19k ± 5% -9.85% (p=0.016 n=5+5) ``` One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

craig · 2019-06-17T16:48:40Z

Build succeeded

GitHub CI (Cockroach)

nvanbenschoten requested review from ajwerner and a team June 9, 2019 22:18

tbg approved these changes Jun 17, 2019

View reviewed changes

craig bot merged commit aed0796 into cockroachdb:master Jun 17, 2019

nvanbenschoten deleted the nvanbenschoten/cputTSCache2 branch June 17, 2019 19:57

tbg mentioned this pull request Jun 19, 2019

storage: TestConditionalPutUpdatesTSCacheOnError failed under stress #38256

Closed

knz mentioned this pull request Nov 10, 2019

User-facing changes in 19.2 that were not picked up in release notes cockroachdb/docs#5819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: don't bump tscache on successful CPut/InitPut requests #38122

storage: don't bump tscache on successful CPut/InitPut requests #38122

nvanbenschoten commented Jun 9, 2019

cockroach-teamcity commented Jun 9, 2019

tbg left a comment

nvanbenschoten commented Jun 17, 2019 •

edited

Loading

nvanbenschoten commented Jun 17, 2019

craig bot commented Jun 17, 2019

storage: don't bump tscache on successful CPut/InitPut requests #38122

storage: don't bump tscache on successful CPut/InitPut requests #38122

Conversation

nvanbenschoten commented Jun 9, 2019

Benchmark Results

Microbenchmarks

Full-System Benchmarks

cockroach-teamcity commented Jun 9, 2019

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jun 17, 2019 • edited Loading

nvanbenschoten commented Jun 17, 2019

craig bot commented Jun 17, 2019

Build succeeded

nvanbenschoten commented Jun 17, 2019 •

edited

Loading