-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: don't bump tscache on successful CPut/InitPut requests #38122
storage: don't bump tscache on successful CPut/InitPut requests #38122
Conversation
A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache. This commit uses this reasoning to stop bumping the timestamp cache on successful `ConditionalPutRequests` and `InitPutRequests`. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache. \### Benchmark Results \#### Microbenchmarks I initially saw the slowdown when running the `KV/Insert/SQL` microbenchmark suite from `pkg/sql/tests/kv_tests.go`. These tests demonstrate that under heavy `INSERT` load (which translates to `ConditionalPutRequests` under the hood), avoiding the timestamp cache is a major win. ``` name old time/op new time/op delta KV/Insert/SQL/rows=1-16 256µs ± 3% 254µs ± 2% ~ (p=0.182 n=9+10) KV/Insert/SQL/rows=10-16 404µs ± 2% 344µs ± 0% -14.89% (p=0.000 n=10+8) KV/Insert/SQL/rows=100-16 3.27ms ± 6% 1.01ms ± 1% -69.12% (p=0.000 n=8+9) KV/Insert/SQL/rows=1000-16 16.8ms ± 2% 7.1ms ± 1% -57.76% (p=0.000 n=10+10) KV/Insert/SQL/rows=10000-16 189ms ± 2% 103ms ± 2% -45.45% (p=0.000 n=9+10) name old alloc/op new alloc/op delta KV/Insert/SQL/rows=1-16 36.6kB ± 0% 36.5kB ± 0% -0.27% (p=0.001 n=10+10) KV/Insert/SQL/rows=10-16 77.4kB ± 0% 76.4kB ± 0% -1.18% (p=0.000 n=9+10) KV/Insert/SQL/rows=100-16 468kB ± 0% 458kB ± 0% -2.08% (p=0.000 n=9+10) KV/Insert/SQL/rows=1000-16 4.93MB ± 0% 4.84MB ± 0% -1.75% (p=0.000 n=10+9) KV/Insert/SQL/rows=10000-16 87.7MB ± 0% 86.8MB ± 0% -1.01% (p=0.000 n=10+9) name old allocs/op new allocs/op delta KV/Insert/SQL/rows=1-16 337 ± 0% 337 ± 0% ~ (all equal) KV/Insert/SQL/rows=10-16 600 ± 0% 599 ± 0% -0.12% (p=0.011 n=10+10) KV/Insert/SQL/rows=100-16 2.99k ± 0% 2.98k ± 0% -0.13% (p=0.000 n=10+9) KV/Insert/SQL/rows=1000-16 26.8k ± 0% 26.7k ± 0% -0.36% (p=0.000 n=9+10) KV/Insert/SQL/rows=10000-16 384k ± 0% 384k ± 0% ~ (p=0.984 n=10+10) ``` CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of `KV/Insert/SQL/rows=100`: <before image> <after image> As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) **45%** of the CPU profile. After this change, timestamp cache access accounts for only **.3%** of the CPU profile. The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense, because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced. \#### Full-System Benchmarks A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are `kv0` and `ycsb f` (our broken version). The reason the former is blind to this issue is because it issues `UPSERT` requests instead of `INSERT` requests, which use `PutRequests` under the hood instead of `ConditionalPutRequests`. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots. To test this change with a full-system benchmark on a real cluster, I modified `kv` to use `INSERT` statements instead of `UPSERT` statements. I then ran it on a cluster of 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also see that the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters. ``` name old ops/sec new ops/sec delta kv0-b1/cores=16/nodes=1 31.3k ± 0% 32.0k ± 0% +2.14% (p=0.029 n=4+4) kv0-b10/cores=16/nodes=1 4.39k ± 0% 4.55k ± 0% +3.47% (p=0.008 n=5+5) kv0-b100/cores=16/nodes=1 938 ± 0% 1016 ± 0% +8.38% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 126 ± 0% 141 ± 0% +11.13% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 12.7 ± 1% 13.9 ± 2% +9.78% (p=0.008 n=5+5) kv0-b1/cores=16/nodes=3 25.1k ± 0% 25.3k ± 1% +1.04% (p=0.016 n=5+5) kv0-b10/cores=16/nodes=3 4.49k ± 3% 4.56k ± 3% ~ (p=0.222 n=5+5) kv0-b100/cores=16/nodes=3 1.13k ± 4% 1.19k ± 1% +4.95% (p=0.032 n=5+5) kv0-b1000/cores=16/nodes=3 177 ± 3% 184 ± 3% +3.98% (p=0.056 n=5+5) kv0-b10000/cores=16/nodes=3 19.4 ± 4% 20.1 ± 6% +3.71% (p=0.103 n=5+5) name old p50(ms) new p50(ms) delta kv0-b1/cores=16/nodes=1 1.00 ± 0% 1.00 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 7.30 ± 0% 6.80 ± 0% -6.85% (p=0.079 n=4+5) kv0-b100/cores=16/nodes=1 35.7 ± 0% 31.5 ± 0% -11.76% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 255 ± 2% 226 ± 0% -11.18% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 2.42k ± 0% 2.19k ± 5% -9.44% (p=0.000 n=4+5) kv0-b1/cores=16/nodes=3 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=3 6.92 ± 3% 6.82 ± 4% ~ (p=0.643 n=5+5) kv0-b100/cores=16/nodes=3 28.5 ± 4% 27.3 ± 0% -4.34% (p=0.048 n=5+5) kv0-b1000/cores=16/nodes=3 185 ± 5% 180 ± 3% ~ (p=0.365 n=5+5) kv0-b10000/cores=16/nodes=3 1.41k ± 0% 1.38k ± 3% ~ (p=0.333 n=4+5) name old p99(ms) new p99(ms) delta kv0-b1/cores=16/nodes=1 2.50 ± 0% 2.50 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 14.2 ± 0% 14.2 ± 0% ~ (all equal) kv0-b100/cores=16/nodes=1 46.1 ± 0% 44.0 ± 0% -4.56% (p=0.000 n=5+4) kv0-b1000/cores=16/nodes=1 319 ± 5% 285 ± 0% -10.53% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 4.72k ± 3% 4.54k ± 8% ~ (p=0.365 n=5+5) kv0-b1/cores=16/nodes=3 2.80 ± 0% 2.90 ± 0% +3.57% (p=0.008 n=5+5) kv0-b10/cores=16/nodes=3 12.6 ± 0% 12.8 ± 2% ~ (p=0.556 n=4+5) kv0-b100/cores=16/nodes=3 38.1 ± 6% 35.7 ± 0% -6.40% (p=0.016 n=5+4) kv0-b1000/cores=16/nodes=3 225 ± 4% 215 ± 2% -4.48% (p=0.079 n=5+5) kv0-b10000/cores=16/nodes=3 3.54k ± 6% 3.19k ± 5% -9.85% (p=0.016 n=5+5) ``` One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! For the allocation numbers in the microbenchmarks, why isn't the allocation size going down more significantly? Was the benchmark simply not running long enough to cycle through a number of 64mb pages or am I missing some reason that makes them matter less than I think?
Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
My guess is that these are allocation counts per operation, so we don't expect to see a huge amount of movement there. We see a reduction of about 100 bytes per write ( |
bors r=tbg |
38122: storage: don't bump tscache on successful CPut/InitPut requests r=tbg a=nvanbenschoten A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache. This commit uses this reasoning to stop bumping the timestamp cache on successful `ConditionalPutRequests` and `InitPutRequests`. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache. ### Benchmark Results #### Microbenchmarks I initially saw the slowdown when running the `KV/Insert/SQL` microbenchmark suite from `pkg/sql/tests/kv_tests.go`. These tests demonstrate that under heavy `INSERT` load (which translates to `ConditionalPutRequests` under the hood), avoiding the timestamp cache is a major win. ``` name old time/op new time/op delta KV/Insert/SQL/rows=1-16 256µs ± 3% 254µs ± 2% ~ (p=0.182 n=9+10) KV/Insert/SQL/rows=10-16 404µs ± 2% 344µs ± 0% -14.89% (p=0.000 n=10+8) KV/Insert/SQL/rows=100-16 3.27ms ± 6% 1.01ms ± 1% -69.12% (p=0.000 n=8+9) KV/Insert/SQL/rows=1000-16 16.8ms ± 2% 7.1ms ± 1% -57.76% (p=0.000 n=10+10) KV/Insert/SQL/rows=10000-16 189ms ± 2% 103ms ± 2% -45.45% (p=0.000 n=9+10) name old alloc/op new alloc/op delta KV/Insert/SQL/rows=1-16 36.6kB ± 0% 36.5kB ± 0% -0.27% (p=0.001 n=10+10) KV/Insert/SQL/rows=10-16 77.4kB ± 0% 76.4kB ± 0% -1.18% (p=0.000 n=9+10) KV/Insert/SQL/rows=100-16 468kB ± 0% 458kB ± 0% -2.08% (p=0.000 n=9+10) KV/Insert/SQL/rows=1000-16 4.93MB ± 0% 4.84MB ± 0% -1.75% (p=0.000 n=10+9) KV/Insert/SQL/rows=10000-16 87.7MB ± 0% 86.8MB ± 0% -1.01% (p=0.000 n=10+9) name old allocs/op new allocs/op delta KV/Insert/SQL/rows=1-16 337 ± 0% 337 ± 0% ~ (all equal) KV/Insert/SQL/rows=10-16 600 ± 0% 599 ± 0% -0.12% (p=0.011 n=10+10) KV/Insert/SQL/rows=100-16 2.99k ± 0% 2.98k ± 0% -0.13% (p=0.000 n=10+9) KV/Insert/SQL/rows=1000-16 26.8k ± 0% 26.7k ± 0% -0.36% (p=0.000 n=9+10) KV/Insert/SQL/rows=10000-16 384k ± 0% 384k ± 0% ~ (p=0.984 n=10+10) ``` CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of `KV/Insert/SQL/rows=100`: ![Screenshot_2019-06-09 unknown cpu](https://user-images.githubusercontent.com/5438456/59165000-d5757f00-8ae2-11e9-9549-084ac631380c.png) _CPU profile flamegraph before change, tscache accesses highlighted_ ![Screenshot_2019-06-09 unknown cpu(1)](https://user-images.githubusercontent.com/5438456/59165001-e0301400-8ae2-11e9-906c-f88a5fcdc2eb.png) _CPU profile flamegraph after change, tscache accesses highlighted_ As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) **45%** of the CPU profile. After this change, timestamp cache access accounts for only **.3%** of the CPU profile. The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced. #### Full-System Benchmarks A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are `kv0` and `ycsb f` (our broken version). The reason the former is blind to this issue is because it issues `UPSERT` requests instead of `INSERT` requests, which use `PutRequests` under the hood instead of `ConditionalPutRequests`. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots. To test this change with a full-system benchmark on a real cluster, I modified `kv` to use `INSERT` statements instead of `UPSERT` statements. I then ran it on a cluster of 1 and 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also see that the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters. ``` name old ops/sec new ops/sec delta kv0-b1/cores=16/nodes=1 31.3k ± 0% 32.0k ± 0% +2.14% (p=0.029 n=4+4) kv0-b10/cores=16/nodes=1 4.39k ± 0% 4.55k ± 0% +3.47% (p=0.008 n=5+5) kv0-b100/cores=16/nodes=1 938 ± 0% 1016 ± 0% +8.38% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 126 ± 0% 141 ± 0% +11.13% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 12.7 ± 1% 13.9 ± 2% +9.78% (p=0.008 n=5+5) kv0-b1/cores=16/nodes=3 25.1k ± 0% 25.3k ± 1% +1.04% (p=0.016 n=5+5) kv0-b10/cores=16/nodes=3 4.49k ± 3% 4.56k ± 3% ~ (p=0.222 n=5+5) kv0-b100/cores=16/nodes=3 1.13k ± 4% 1.19k ± 1% +4.95% (p=0.032 n=5+5) kv0-b1000/cores=16/nodes=3 177 ± 3% 184 ± 3% +3.98% (p=0.056 n=5+5) kv0-b10000/cores=16/nodes=3 19.4 ± 4% 20.1 ± 6% +3.71% (p=0.103 n=5+5) name old p50(ms) new p50(ms) delta kv0-b1/cores=16/nodes=1 1.00 ± 0% 1.00 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 7.30 ± 0% 6.80 ± 0% -6.85% (p=0.079 n=4+5) kv0-b100/cores=16/nodes=1 35.7 ± 0% 31.5 ± 0% -11.76% (p=0.008 n=5+5) kv0-b1000/cores=16/nodes=1 255 ± 2% 226 ± 0% -11.18% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 2.42k ± 0% 2.19k ± 5% -9.44% (p=0.000 n=4+5) kv0-b1/cores=16/nodes=3 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=3 6.92 ± 3% 6.82 ± 4% ~ (p=0.643 n=5+5) kv0-b100/cores=16/nodes=3 28.5 ± 4% 27.3 ± 0% -4.34% (p=0.048 n=5+5) kv0-b1000/cores=16/nodes=3 185 ± 5% 180 ± 3% ~ (p=0.365 n=5+5) kv0-b10000/cores=16/nodes=3 1.41k ± 0% 1.38k ± 3% ~ (p=0.333 n=4+5) name old p99(ms) new p99(ms) delta kv0-b1/cores=16/nodes=1 2.50 ± 0% 2.50 ± 0% ~ (all equal) kv0-b10/cores=16/nodes=1 14.2 ± 0% 14.2 ± 0% ~ (all equal) kv0-b100/cores=16/nodes=1 46.1 ± 0% 44.0 ± 0% -4.56% (p=0.000 n=5+4) kv0-b1000/cores=16/nodes=1 319 ± 5% 285 ± 0% -10.53% (p=0.008 n=5+5) kv0-b10000/cores=16/nodes=1 4.72k ± 3% 4.54k ± 8% ~ (p=0.365 n=5+5) kv0-b1/cores=16/nodes=3 2.80 ± 0% 2.90 ± 0% +3.57% (p=0.008 n=5+5) kv0-b10/cores=16/nodes=3 12.6 ± 0% 12.8 ± 2% ~ (p=0.556 n=4+5) kv0-b100/cores=16/nodes=3 38.1 ± 6% 35.7 ± 0% -6.40% (p=0.016 n=5+4) kv0-b1000/cores=16/nodes=3 225 ± 4% 215 ± 2% -4.48% (p=0.079 n=5+5) kv0-b10000/cores=16/nodes=3 3.54k ± 6% 3.19k ± 5% -9.85% (p=0.016 n=5+5) ``` One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Build succeeded |
A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which
provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache.
This commit uses this reasoning to stop bumping the timestamp cache on successful
ConditionalPutRequests
andInitPutRequests
. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache.Benchmark Results
Microbenchmarks
I initially saw the slowdown when running the
KV/Insert/SQL
microbenchmark suite frompkg/sql/tests/kv_tests.go
. These tests demonstrate that under heavyINSERT
load (which translates toConditionalPutRequests
under the hood), avoiding the timestamp cache is a major win.CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of
KV/Insert/SQL/rows=100
:CPU profile flamegraph before change, tscache accesses highlighted
CPU profile flamegraph after change, tscache accesses highlighted
As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) 45% of the CPU profile. After this change, timestamp cache access accounts for only .3% of the CPU profile.
The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In
other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced.
Full-System Benchmarks
A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are
kv0
andycsb f
(our broken version). The reason the former is blind to this issue is because it issuesUPSERT
requests instead ofINSERT
requests, which usePutRequests
under the hood instead ofConditionalPutRequests
. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots.To test this change with a full-system benchmark on a real cluster, I modified
kv
to useINSERT
statements instead ofUPSERT
statements. I then ran it on a cluster of 1 and 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also seethat the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters.
One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions.
Release note: None