Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: don't bump tscache on successful CPut/InitPut requests #38122

Merged

Conversation

nvanbenschoten
Copy link
Member

A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which
provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache.

This commit uses this reasoning to stop bumping the timestamp cache on successful ConditionalPutRequests and InitPutRequests. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache.

Benchmark Results

Microbenchmarks

I initially saw the slowdown when running the KV/Insert/SQL microbenchmark suite from pkg/sql/tests/kv_tests.go. These tests demonstrate that under heavy INSERT load (which translates to ConditionalPutRequests under the hood), avoiding the timestamp cache is a major win.

name                         old time/op    new time/op    delta
KV/Insert/SQL/rows=1-16         256µs ± 3%     254µs ± 2%     ~     (p=0.182 n=9+10)
KV/Insert/SQL/rows=10-16        404µs ± 2%     344µs ± 0%  -14.89%  (p=0.000 n=10+8)
KV/Insert/SQL/rows=100-16      3.27ms ± 6%    1.01ms ± 1%  -69.12%  (p=0.000 n=8+9)
KV/Insert/SQL/rows=1000-16     16.8ms ± 2%     7.1ms ± 1%  -57.76%  (p=0.000 n=10+10)
KV/Insert/SQL/rows=10000-16     189ms ± 2%     103ms ± 2%  -45.45%  (p=0.000 n=9+10)

name                         old alloc/op   new alloc/op   delta
KV/Insert/SQL/rows=1-16        36.6kB ± 0%    36.5kB ± 0%   -0.27%  (p=0.001 n=10+10)
KV/Insert/SQL/rows=10-16       77.4kB ± 0%    76.4kB ± 0%   -1.18%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=100-16       468kB ± 0%     458kB ± 0%   -2.08%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=1000-16     4.93MB ± 0%    4.84MB ± 0%   -1.75%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=10000-16    87.7MB ± 0%    86.8MB ± 0%   -1.01%  (p=0.000 n=10+9)

name                         old allocs/op  new allocs/op  delta
KV/Insert/SQL/rows=1-16           337 ± 0%       337 ± 0%     ~     (all equal)
KV/Insert/SQL/rows=10-16          600 ± 0%       599 ± 0%   -0.12%  (p=0.011 n=10+10)
KV/Insert/SQL/rows=100-16       2.99k ± 0%     2.98k ± 0%   -0.13%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=1000-16      26.8k ± 0%     26.7k ± 0%   -0.36%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=10000-16      384k ± 0%      384k ± 0%     ~     (p=0.984 n=10+10)

CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of KV/Insert/SQL/rows=100:

Screenshot_2019-06-09 unknown cpu

CPU profile flamegraph before change, tscache accesses highlighted

Screenshot_2019-06-09 unknown cpu(1)

CPU profile flamegraph after change, tscache accesses highlighted

As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) 45% of the CPU profile. After this change, timestamp cache access accounts for only .3% of the CPU profile.

The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In
other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced.

Full-System Benchmarks

A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are kv0 and ycsb f (our broken version). The reason the former is blind to this issue is because it issues UPSERT requests instead of INSERT requests, which use PutRequests under the hood instead of ConditionalPutRequests. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots.

To test this change with a full-system benchmark on a real cluster, I modified kv to use INSERT statements instead of UPSERT statements. I then ran it on a cluster of 1 and 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also see
that the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters.

name                         old ops/sec  new ops/sec  delta
kv0-b1/cores=16/nodes=1       31.3k ± 0%   32.0k ± 0%   +2.14%  (p=0.029 n=4+4)
kv0-b10/cores=16/nodes=1      4.39k ± 0%   4.55k ± 0%   +3.47%  (p=0.008 n=5+5)
kv0-b100/cores=16/nodes=1       938 ± 0%    1016 ± 0%   +8.38%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      126 ± 0%     141 ± 0%  +11.13%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1    12.7 ± 1%    13.9 ± 2%   +9.78%  (p=0.008 n=5+5)
kv0-b1/cores=16/nodes=3       25.1k ± 0%   25.3k ± 1%   +1.04%  (p=0.016 n=5+5)
kv0-b10/cores=16/nodes=3      4.49k ± 3%   4.56k ± 3%     ~     (p=0.222 n=5+5)
kv0-b100/cores=16/nodes=3     1.13k ± 4%   1.19k ± 1%   +4.95%  (p=0.032 n=5+5)
kv0-b1000/cores=16/nodes=3      177 ± 3%     184 ± 3%   +3.98%  (p=0.056 n=5+5)
kv0-b10000/cores=16/nodes=3    19.4 ± 4%    20.1 ± 6%   +3.71%  (p=0.103 n=5+5)

name                         old p50(ms)  new p50(ms)  delta
kv0-b1/cores=16/nodes=1        1.00 ± 0%    1.00 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       7.30 ± 0%    6.80 ± 0%   -6.85%  (p=0.079 n=4+5)
kv0-b100/cores=16/nodes=1      35.7 ± 0%    31.5 ± 0%  -11.76%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      255 ± 2%     226 ± 0%  -11.18%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   2.42k ± 0%   2.19k ± 5%   -9.44%  (p=0.000 n=4+5)
kv0-b1/cores=16/nodes=3        1.20 ± 0%    1.20 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=3       6.92 ± 3%    6.82 ± 4%     ~     (p=0.643 n=5+5)
kv0-b100/cores=16/nodes=3      28.5 ± 4%    27.3 ± 0%   -4.34%  (p=0.048 n=5+5)
kv0-b1000/cores=16/nodes=3      185 ± 5%     180 ± 3%     ~     (p=0.365 n=5+5)
kv0-b10000/cores=16/nodes=3   1.41k ± 0%   1.38k ± 3%     ~     (p=0.333 n=4+5)

name                         old p99(ms)  new p99(ms)  delta
kv0-b1/cores=16/nodes=1        2.50 ± 0%    2.50 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       14.2 ± 0%    14.2 ± 0%     ~     (all equal)
kv0-b100/cores=16/nodes=1      46.1 ± 0%    44.0 ± 0%   -4.56%  (p=0.000 n=5+4)
kv0-b1000/cores=16/nodes=1      319 ± 5%     285 ± 0%  -10.53%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   4.72k ± 3%   4.54k ± 8%     ~     (p=0.365 n=5+5)
kv0-b1/cores=16/nodes=3        2.80 ± 0%    2.90 ± 0%   +3.57%  (p=0.008 n=5+5)
kv0-b10/cores=16/nodes=3       12.6 ± 0%    12.8 ± 2%     ~     (p=0.556 n=4+5)
kv0-b100/cores=16/nodes=3      38.1 ± 6%    35.7 ± 0%   -6.40%  (p=0.016 n=5+4)
kv0-b1000/cores=16/nodes=3      225 ± 4%     215 ± 2%   -4.48%  (p=0.079 n=5+5)
kv0-b10000/cores=16/nodes=3   3.54k ± 6%   3.19k ± 5%   -9.85%  (p=0.016 n=5+5)

One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions.

Release note: None

A transaction needs to insert into the timestamp cache on conditional put
requests that fail so that other transactions can’t rewrite its history and
invalidate the read that decided whether the condition was met or not. However,
when a conditional put request does not fail it will lay down an intent which
provides the same degree of protection against other writers invalidating the
conditional put's read. No other writer will be able to write to the same key
until either the original transaction commits resolves its intent or aborts and
removes its intent. In this way, the intent acts like a partial read timestamp
cache for the key, so we don't need to also insert into the in-memory timestamp
cache.

This commit uses this reasoning to stop bumping the timestamp cache on
successful `ConditionalPutRequests` and `InitPutRequests`. This avoids an
interesting performance degeneration I observed under heavy CPut-only load where
each successful write was adding to the timestamp cache and causing it to bloat,
which in turn slowed down subsequent lookups into the timestamp cache. By
avoiding the insertion into the cache on successful CPuts, we can avoid the cost
of inserting into the cache and avoid slowing down future writers when they need
to look in the cache.

\### Benchmark Results

\#### Microbenchmarks

I initially saw the slowdown when running the `KV/Insert/SQL` microbenchmark
suite from `pkg/sql/tests/kv_tests.go`. These tests demonstrate that under
heavy `INSERT` load (which translates to `ConditionalPutRequests` under the
hood), avoiding the timestamp cache is a major win.

```
name                         old time/op    new time/op    delta
KV/Insert/SQL/rows=1-16         256µs ± 3%     254µs ± 2%     ~     (p=0.182 n=9+10)
KV/Insert/SQL/rows=10-16        404µs ± 2%     344µs ± 0%  -14.89%  (p=0.000 n=10+8)
KV/Insert/SQL/rows=100-16      3.27ms ± 6%    1.01ms ± 1%  -69.12%  (p=0.000 n=8+9)
KV/Insert/SQL/rows=1000-16     16.8ms ± 2%     7.1ms ± 1%  -57.76%  (p=0.000 n=10+10)
KV/Insert/SQL/rows=10000-16     189ms ± 2%     103ms ± 2%  -45.45%  (p=0.000 n=9+10)

name                         old alloc/op   new alloc/op   delta
KV/Insert/SQL/rows=1-16        36.6kB ± 0%    36.5kB ± 0%   -0.27%  (p=0.001 n=10+10)
KV/Insert/SQL/rows=10-16       77.4kB ± 0%    76.4kB ± 0%   -1.18%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=100-16       468kB ± 0%     458kB ± 0%   -2.08%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=1000-16     4.93MB ± 0%    4.84MB ± 0%   -1.75%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=10000-16    87.7MB ± 0%    86.8MB ± 0%   -1.01%  (p=0.000 n=10+9)

name                         old allocs/op  new allocs/op  delta
KV/Insert/SQL/rows=1-16           337 ± 0%       337 ± 0%     ~     (all equal)
KV/Insert/SQL/rows=10-16          600 ± 0%       599 ± 0%   -0.12%  (p=0.011 n=10+10)
KV/Insert/SQL/rows=100-16       2.99k ± 0%     2.98k ± 0%   -0.13%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=1000-16      26.8k ± 0%     26.7k ± 0%   -0.36%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=10000-16      384k ± 0%      384k ± 0%     ~     (p=0.984 n=10+10)
```

CPU profiles demonstrate why this is the case. Here is the before and after
CPU profile for a run of `KV/Insert/SQL/rows=100`:

<before image>

<after image>

As the profiles show, before this change, timestamp cache access accounts for
(ignoring threading overhead in this benchmark, which has been cut out) **45%**
of the CPU profile. After this change, timestamp cache access accounts for only
**.3%** of the CPU profile.

The trend across the different tests in the benchmark suite is also interesting.
This change seems to have a much larger impact on workloads that insert in bulk
with batches of rows than workloads that insert a single row at a time. This
makes some sense, because most work in the system is O(num stmts/batches), but
timestamp cache lookup and insertion are both O(num individual requests). In
other words, batching doesn't reduce the amount of load incident on the
timestamp cache, but it does reduce the amount of load incident on most other
components. This explains why with more batching, the cost of the timestamp
cache was able to become more pronounced.

\#### Full-System Benchmarks

A question that should be asked at this point is why we haven't picked up on
this in any of our full-system benchmarks to this point. I believe the answer to
that question is because the workloads that would be most likely to pick this up
are `kv0` and `ycsb f` (our broken version). The reason the former is blind to
this issue is because it issues `UPSERT` requests instead of `INSERT` requests,
which use `PutRequests` under the hood instead of `ConditionalPutRequests`. The
reason the latter is blind to this issue is because it never performs any
batching. It's also fairly broken and until recently created large write
hotspots.

To test this change with a full-system benchmark on a real cluster, I modified
`kv` to use `INSERT` statements instead of `UPSERT` statements. I then ran it on
a cluster of 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly
consistent with the microbenchmark results listed above. We can see that the
impact of the change is more pronounced as the batch size grows. We can also see
that the impact of the change is more pronounced for single-node clusters where
replication cost is negligible, but that there still is some improvement for
multi-node clusters.

```
name                         old ops/sec  new ops/sec  delta
kv0-b1/cores=16/nodes=1       31.3k ± 0%   32.0k ± 0%   +2.14%  (p=0.029 n=4+4)
kv0-b10/cores=16/nodes=1      4.39k ± 0%   4.55k ± 0%   +3.47%  (p=0.008 n=5+5)
kv0-b100/cores=16/nodes=1       938 ± 0%    1016 ± 0%   +8.38%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      126 ± 0%     141 ± 0%  +11.13%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1    12.7 ± 1%    13.9 ± 2%   +9.78%  (p=0.008 n=5+5)
kv0-b1/cores=16/nodes=3       25.1k ± 0%   25.3k ± 1%   +1.04%  (p=0.016 n=5+5)
kv0-b10/cores=16/nodes=3      4.49k ± 3%   4.56k ± 3%     ~     (p=0.222 n=5+5)
kv0-b100/cores=16/nodes=3     1.13k ± 4%   1.19k ± 1%   +4.95%  (p=0.032 n=5+5)
kv0-b1000/cores=16/nodes=3      177 ± 3%     184 ± 3%   +3.98%  (p=0.056 n=5+5)
kv0-b10000/cores=16/nodes=3    19.4 ± 4%    20.1 ± 6%   +3.71%  (p=0.103 n=5+5)

name                         old p50(ms)  new p50(ms)  delta
kv0-b1/cores=16/nodes=1        1.00 ± 0%    1.00 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       7.30 ± 0%    6.80 ± 0%   -6.85%  (p=0.079 n=4+5)
kv0-b100/cores=16/nodes=1      35.7 ± 0%    31.5 ± 0%  -11.76%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      255 ± 2%     226 ± 0%  -11.18%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   2.42k ± 0%   2.19k ± 5%   -9.44%  (p=0.000 n=4+5)
kv0-b1/cores=16/nodes=3        1.20 ± 0%    1.20 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=3       6.92 ± 3%    6.82 ± 4%     ~     (p=0.643 n=5+5)
kv0-b100/cores=16/nodes=3      28.5 ± 4%    27.3 ± 0%   -4.34%  (p=0.048 n=5+5)
kv0-b1000/cores=16/nodes=3      185 ± 5%     180 ± 3%     ~     (p=0.365 n=5+5)
kv0-b10000/cores=16/nodes=3   1.41k ± 0%   1.38k ± 3%     ~     (p=0.333 n=4+5)

name                         old p99(ms)  new p99(ms)  delta
kv0-b1/cores=16/nodes=1        2.50 ± 0%    2.50 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       14.2 ± 0%    14.2 ± 0%     ~     (all equal)
kv0-b100/cores=16/nodes=1      46.1 ± 0%    44.0 ± 0%   -4.56%  (p=0.000 n=5+4)
kv0-b1000/cores=16/nodes=1      319 ± 5%     285 ± 0%  -10.53%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   4.72k ± 3%   4.54k ± 8%     ~     (p=0.365 n=5+5)
kv0-b1/cores=16/nodes=3        2.80 ± 0%    2.90 ± 0%   +3.57%  (p=0.008 n=5+5)
kv0-b10/cores=16/nodes=3       12.6 ± 0%    12.8 ± 2%     ~     (p=0.556 n=4+5)
kv0-b100/cores=16/nodes=3      38.1 ± 6%    35.7 ± 0%   -6.40%  (p=0.016 n=5+4)
kv0-b1000/cores=16/nodes=3      225 ± 4%     215 ± 2%   -4.48%  (p=0.079 n=5+5)
kv0-b10000/cores=16/nodes=3   3.54k ± 6%   3.19k ± 5%   -9.85%  (p=0.016 n=5+5)
```

One difference between the microbenchmark and the full-system benchmark that's
worth mentioning here is that the microbenchmark never splits ranges, so even
when writes are batched, they're still executed as 1PC txns. In the full-system
benchmark, ranges are split so besides the single-write batch test case, all
other cases are performing distributed transactions.

Release note: None
@nvanbenschoten nvanbenschoten requested review from ajwerner and a team June 9, 2019 22:18
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! For the allocation numbers in the microbenchmarks, why isn't the allocation size going down more significantly? Was the benchmark simply not running long enough to cycle through a number of 64mb pages or am I missing some reason that makes them matter less than I think?

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Jun 17, 2019

For the allocation numbers in the microbenchmarks, why isn't the allocation size going down more significantly? Was the benchmark simply not running long enough to cycle through a number of 64mb pages or am I missing some reason that makes them matter less than I think?

My guess is that these are allocation counts per operation, so we don't expect to see a huge amount of movement there. We see a reduction of about 100 bytes per write (change in alloc/op / row count), which is a reasonable estimate for the size of a tscache entry (len(key) + sizeof(ts) + sizeof(txnid) + sizeof(arenaskl.node)).

@nvanbenschoten
Copy link
Member Author

bors r=tbg

craig bot pushed a commit that referenced this pull request Jun 17, 2019
38122: storage: don't bump tscache on successful CPut/InitPut requests r=tbg a=nvanbenschoten

A transaction needs to insert into the timestamp cache on conditional put requests that fail so that other transactions can’t rewrite its history and invalidate the read that decided whether the condition was met or not. However, when a conditional put request does not fail it will lay down an intent which
provides the same degree of protection against other writers invalidating the conditional put's read. No other writer will be able to write to the same key until either the original transaction commits resolves its intent or aborts and removes its intent. In this way, the intent acts like a partial read timestamp cache for the key, so we don't need to also insert into the in-memory timestamp cache.

This commit uses this reasoning to stop bumping the timestamp cache on successful `ConditionalPutRequests` and `InitPutRequests`. This avoids an interesting performance degeneration I observed under heavy CPut-only load where each successful write was adding to the timestamp cache and causing it to bloat, which in turn slowed down subsequent lookups into the timestamp cache. By avoiding the insertion into the cache on successful CPuts, we can avoid the cost of inserting into the cache and avoid slowing down future writers when they need to look in the cache.

### Benchmark Results

#### Microbenchmarks

I initially saw the slowdown when running the `KV/Insert/SQL` microbenchmark suite from `pkg/sql/tests/kv_tests.go`. These tests demonstrate that under heavy `INSERT` load (which translates to `ConditionalPutRequests` under the hood), avoiding the timestamp cache is a major win.

```
name                         old time/op    new time/op    delta
KV/Insert/SQL/rows=1-16         256µs ± 3%     254µs ± 2%     ~     (p=0.182 n=9+10)
KV/Insert/SQL/rows=10-16        404µs ± 2%     344µs ± 0%  -14.89%  (p=0.000 n=10+8)
KV/Insert/SQL/rows=100-16      3.27ms ± 6%    1.01ms ± 1%  -69.12%  (p=0.000 n=8+9)
KV/Insert/SQL/rows=1000-16     16.8ms ± 2%     7.1ms ± 1%  -57.76%  (p=0.000 n=10+10)
KV/Insert/SQL/rows=10000-16     189ms ± 2%     103ms ± 2%  -45.45%  (p=0.000 n=9+10)

name                         old alloc/op   new alloc/op   delta
KV/Insert/SQL/rows=1-16        36.6kB ± 0%    36.5kB ± 0%   -0.27%  (p=0.001 n=10+10)
KV/Insert/SQL/rows=10-16       77.4kB ± 0%    76.4kB ± 0%   -1.18%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=100-16       468kB ± 0%     458kB ± 0%   -2.08%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=1000-16     4.93MB ± 0%    4.84MB ± 0%   -1.75%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=10000-16    87.7MB ± 0%    86.8MB ± 0%   -1.01%  (p=0.000 n=10+9)

name                         old allocs/op  new allocs/op  delta
KV/Insert/SQL/rows=1-16           337 ± 0%       337 ± 0%     ~     (all equal)
KV/Insert/SQL/rows=10-16          600 ± 0%       599 ± 0%   -0.12%  (p=0.011 n=10+10)
KV/Insert/SQL/rows=100-16       2.99k ± 0%     2.98k ± 0%   -0.13%  (p=0.000 n=10+9)
KV/Insert/SQL/rows=1000-16      26.8k ± 0%     26.7k ± 0%   -0.36%  (p=0.000 n=9+10)
KV/Insert/SQL/rows=10000-16      384k ± 0%      384k ± 0%     ~     (p=0.984 n=10+10)
```

CPU profiles demonstrate why this is the case. Here is the before and after CPU profile for a run of `KV/Insert/SQL/rows=100`:

![Screenshot_2019-06-09 unknown cpu](https://user-images.githubusercontent.com/5438456/59165000-d5757f00-8ae2-11e9-9549-084ac631380c.png)

_CPU profile flamegraph before change, tscache accesses highlighted_

![Screenshot_2019-06-09 unknown cpu(1)](https://user-images.githubusercontent.com/5438456/59165001-e0301400-8ae2-11e9-906c-f88a5fcdc2eb.png)

_CPU profile flamegraph after change, tscache accesses highlighted_

As the profiles show, before this change, timestamp cache access accounts for (ignoring threading overhead in this benchmark, which has been cut out) **45%** of the CPU profile. After this change, timestamp cache access accounts for only **.3%** of the CPU profile.

The trend across the different tests in the benchmark suite is also interesting. This change seems to have a much larger impact on workloads that insert in bulk with batches of rows than workloads that insert a single row at a time. This makes some sense because most work in the system is O(num stmts/batches), but timestamp cache lookup and insertion are both O(num individual requests). In
other words, batching doesn't reduce the amount of load incident on the timestamp cache, but it does reduce the amount of load incident on most other components. This explains why with more batching, the cost of the timestamp cache was able to become more pronounced.

#### Full-System Benchmarks

A question that should be asked at this point is why we haven't picked up on this in any of our full-system benchmarks to this point. I believe the answer to that question is because the workloads that would be most likely to pick this up are `kv0` and `ycsb f` (our broken version). The reason the former is blind to this issue is because it issues `UPSERT` requests instead of `INSERT` requests, which use `PutRequests` under the hood instead of `ConditionalPutRequests`. The reason the latter is blind to this issue is because it never performs any batching. It's also fairly broken and until recently created large write hotspots.

To test this change with a full-system benchmark on a real cluster, I modified `kv` to use `INSERT` statements instead of `UPSERT` statements. I then ran it on a cluster of 1 and 3 c5d.4xlarge VMs with varying batch sizes. The results are fairly consistent with the microbenchmark results listed above. We can see that the impact of the change is more pronounced as the batch size grows. We can also see
that the impact of the change is more pronounced for single-node clusters where replication cost is negligible, but that there still is some improvement for multi-node clusters.

```
name                         old ops/sec  new ops/sec  delta
kv0-b1/cores=16/nodes=1       31.3k ± 0%   32.0k ± 0%   +2.14%  (p=0.029 n=4+4)
kv0-b10/cores=16/nodes=1      4.39k ± 0%   4.55k ± 0%   +3.47%  (p=0.008 n=5+5)
kv0-b100/cores=16/nodes=1       938 ± 0%    1016 ± 0%   +8.38%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      126 ± 0%     141 ± 0%  +11.13%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1    12.7 ± 1%    13.9 ± 2%   +9.78%  (p=0.008 n=5+5)
kv0-b1/cores=16/nodes=3       25.1k ± 0%   25.3k ± 1%   +1.04%  (p=0.016 n=5+5)
kv0-b10/cores=16/nodes=3      4.49k ± 3%   4.56k ± 3%     ~     (p=0.222 n=5+5)
kv0-b100/cores=16/nodes=3     1.13k ± 4%   1.19k ± 1%   +4.95%  (p=0.032 n=5+5)
kv0-b1000/cores=16/nodes=3      177 ± 3%     184 ± 3%   +3.98%  (p=0.056 n=5+5)
kv0-b10000/cores=16/nodes=3    19.4 ± 4%    20.1 ± 6%   +3.71%  (p=0.103 n=5+5)

name                         old p50(ms)  new p50(ms)  delta
kv0-b1/cores=16/nodes=1        1.00 ± 0%    1.00 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       7.30 ± 0%    6.80 ± 0%   -6.85%  (p=0.079 n=4+5)
kv0-b100/cores=16/nodes=1      35.7 ± 0%    31.5 ± 0%  -11.76%  (p=0.008 n=5+5)
kv0-b1000/cores=16/nodes=1      255 ± 2%     226 ± 0%  -11.18%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   2.42k ± 0%   2.19k ± 5%   -9.44%  (p=0.000 n=4+5)
kv0-b1/cores=16/nodes=3        1.20 ± 0%    1.20 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=3       6.92 ± 3%    6.82 ± 4%     ~     (p=0.643 n=5+5)
kv0-b100/cores=16/nodes=3      28.5 ± 4%    27.3 ± 0%   -4.34%  (p=0.048 n=5+5)
kv0-b1000/cores=16/nodes=3      185 ± 5%     180 ± 3%     ~     (p=0.365 n=5+5)
kv0-b10000/cores=16/nodes=3   1.41k ± 0%   1.38k ± 3%     ~     (p=0.333 n=4+5)

name                         old p99(ms)  new p99(ms)  delta
kv0-b1/cores=16/nodes=1        2.50 ± 0%    2.50 ± 0%     ~     (all equal)
kv0-b10/cores=16/nodes=1       14.2 ± 0%    14.2 ± 0%     ~     (all equal)
kv0-b100/cores=16/nodes=1      46.1 ± 0%    44.0 ± 0%   -4.56%  (p=0.000 n=5+4)
kv0-b1000/cores=16/nodes=1      319 ± 5%     285 ± 0%  -10.53%  (p=0.008 n=5+5)
kv0-b10000/cores=16/nodes=1   4.72k ± 3%   4.54k ± 8%     ~     (p=0.365 n=5+5)
kv0-b1/cores=16/nodes=3        2.80 ± 0%    2.90 ± 0%   +3.57%  (p=0.008 n=5+5)
kv0-b10/cores=16/nodes=3       12.6 ± 0%    12.8 ± 2%     ~     (p=0.556 n=4+5)
kv0-b100/cores=16/nodes=3      38.1 ± 6%    35.7 ± 0%   -6.40%  (p=0.016 n=5+4)
kv0-b1000/cores=16/nodes=3      225 ± 4%     215 ± 2%   -4.48%  (p=0.079 n=5+5)
kv0-b10000/cores=16/nodes=3   3.54k ± 6%   3.19k ± 5%   -9.85%  (p=0.016 n=5+5)
```

One difference between the microbenchmark and the full-system benchmark that's worth mentioning here is that the microbenchmark never splits ranges, so even when writes are batched, they're still executed as 1PC txns. In the full-system benchmark, ranges are split so besides the single-write batch test case, all other cases are performing distributed transactions.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 17, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants