kvevent: implement chunked blocking buffer #86421

jayshrivastava · 2022-08-18T21:17:43Z

kvevent: refactor blocking buffer benchmark

This change updates the blocking buffer micro benchmark
in several ways:

it uses different types of events
it uses more producers than consumers to keep the
buffer full
it makes b.N correspond to the total number of events,
so the benchmark can analyze allocs per event

Release note: None

Release justification: This change updates a benchmark only.

kvevent: implement chunked buffer event queue

This change implements a simple chunked event queue.
The purpose of this queue is to be used by
kvevent.blockingBuffer in subsequent commits.

Release note: None

Release justification: This change does not affect
any production code. It adds files which are not
called by any packages.

kvevent: refactor memory buffer to chunked linked list

This change refactors kvevent/blocking_buffer.go to use
a chunked linked list instead of a regular linked list to
reduce pointer usage. Note that the underlying sync.Pool,
which is also a linked list, will use less pointers due to
us pooling chunks instead of events.

Release note: None

Release justification: This change significantly
improves performance by significantly reducing
pressure on GC. Consequently, this significantly
improves foreground SQL p99 latency. GC has
been causing severe issues in production changefeeds.
Merging this change in this release is worth it
for its potential to reduce incidents.

Results (micro)

These are the results of running the microbenchmark.
./dev bench pkg/ccl/changefeedccl/kvevent --filter=BenchmarkMemBuffer --count=10 --bench-mem --stream-output --test-args="--test.benchtime=45s" -- --nocache_test_results --test_verbose_timeout_warnings |& tee bench.txt

name          old time/op    new time/op    delta
MemBuffer-10    1.22µs ± 2%    0.85µs ± 3%  -30.04%  (p=0.000 n=8+10)

name          old alloc/op   new alloc/op   delta
MemBuffer-10     0.00B          0.00B          ~     (all equal)

name          old allocs/op  new allocs/op  delta
MemBuffer-10      0.00           0.00          ~     (all equal)

Memory usage is 0 due to pooling in both implementations.
We can achieve a higher throughput with the chunked implementation - about 50-60M events in 45 seconds as opposed to ~40M with the old implementation.

Results (Macro)

Full results are published here. In summary:

I analyzed performing by running TPC-C for 30 mins on a 15 node cluster with 10k warehouses. Before starting the workload, I started a changefeed on the order_line table (~200GB). I also set the following cluster settings to stress the buffer and pressure GC:
changefeed.backfill.concurrent_scan_requests = 100;
changefeed.memory.per_changefeed_limit = '1073741824'; (~1GB)

Then, I analyzed SQL latency from admin UI and GC performance using the output of GODEBUG=gctrace=1. These are the outcomes:

The p99 SQL latency during the workload was reduced from approx. 1.75s -> 0.150s (91%)
CPU time spent doing GC was reduced from 37.86 mins -> 20.75 mins (45%)
The p99 spike at the beginning of the workload was reduced from approx. 15s -> 12s (20%)

Relevant Issues

Addresses: #84582
(for now...)

cockroach-teamcity · 2022-08-18T21:17:52Z

This change is

This change updates the blocking buffer micro benchmark in several ways: - it uses different types of events - it uses more producers than consumers to keep the buffer full - it makes b.N correspond to the total number of events, so the benchmark can analyze allocs per event Release note: None Release justification: This change updates a benchmark only.

This change implements a simple chunked event queue. The purpose of this queue is to be used by kvevent.blockingBuffer in subsequent commits. Release note: None Release justification: This change does not affect any production code. It adds files which are not called by any packages.

miretskiy

Mostly nits; giving LGTM, but please remove unneeded interface, and revert unnecessary refactors from this PR.

pkg/ccl/changefeedccl/kvevent/blocking_buffer.go

pkg/ccl/changefeedccl/kvevent/alloc.go

pkg/ccl/changefeedccl/kvevent/blocking_buffer.go

pkg/ccl/changefeedccl/kvevent/event.go

miretskiy · 2022-08-19T14:35:15Z

@nvanbenschoten : FYI: thanks to your investigations, @jayshrivastava made these changes that show significant improvements.

This change refactors kvevent/blocking_buffer.go to use a chunked linked list instead of a regular linked list to reduce pointer usage. Note that the underlying sync.Pool, which is also a linked list, will use less pointers due to us pooling chunks instead of events. Release note: None Release justification: This change significantly improves performance by significantly reducing pressure on GC. Consequently, this significantly improves foreground SQL p99 latency. GC has been causing severe issues in production changefeeds. Merging this change in this release is worth it for its potential to reduce incedents.

jayshrivastava · 2022-08-19T16:44:21Z

bors r+

craig · 2022-08-19T17:26:43Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

nvanbenschoten · 2022-08-19T18:07:34Z

This is great! Nice experimentation @jayshrivastava.

Fixes: #84709

I don't think I see where we're addressing #84709 in this PR. Doesn't the Event struct still contain 10 pointers? So doesn't a bufferEventChunk contain 128x10+1=1281 pointers?

craig · 2022-08-19T19:07:39Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

miretskiy · 2022-08-19T19:58:35Z

I don't think I see where we're addressing #84709 in this PR. Doesn't the Event struct still contain 10 pointers? So doesn't a bufferEventChunk contain 128x10+1=1281 pointers?

I think previous changes (e8e664c#diff-a2e21a39cea12e1823c4c6f7ce7e1513214a575ff049f204cf5633284cf8c6c9) replaced resolved events which were allocating a pointer.
Event does contains various slices (key/value/resolved span) but those should be coming from rangefeed; so I don't think we are allocating them.

nvanbenschoten · 2022-08-19T20:05:21Z

I think there might be a bit of confusion here. The concern isn't reducing heap allocations, it's about reducing the cost of GC by eliminating pointers that need to be traversed during the GC mark-sweep phase. So e8e664c#diff-a2e21a39cea12e1823c4c6f7ce7e1513214a575ff049f204cf5633284cf8c6c9 might have actually hurt, as the inlining replaced 1 pointer (*ResolvedSpan) for 2 (ResolvedSpan.Span.Key and ResolvedSpan.Span.EndKey).

craig · 2022-08-19T21:28:34Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-08-19T21:28:47Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 78b995a to blathers/backport-release-21.2-86421: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

error creating merge commit from 78b995a to blathers/backport-release-22.1-86421: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

miretskiy · 2022-08-19T21:41:32Z

replaced 1 pointer (*ResolvedSpan) for 2 (ResolvedSpan.Span.Key and ResolvedSpan.Span.EndKey).

This is what's confusing to me too: because those 2 pointers needed to be deleted anyway because those were allocated
by rangefeed RPC. So.. I don't know if we're making situation worse. I think this is an improvement because instead of 3 pointers that need to be traversed, it's 2 (which were allocated anyway).

nvanbenschoten · 2022-08-19T22:19:10Z

Does every Event have a non-empty resolved span? My claim that it might make things worse is based on the assumption that the vast majority of Events aren't carrying a resolved span, but if that's not true then I agree that e8e664c#diff-a2e21a39cea12e1823c4c6f7ce7e1513214a575ff049f204cf5633284cf8c6c9 should help.

miretskiy · 2022-08-19T22:32:40Z

No, not every event has non-empty resolved span; Can't really make a claim that majority of events aren't carrying resolved
span (you could have pretty low traffic table where majority of events will be resolved span events).
I don't know if just having 16 bytes in the structure (to hold resolve spans) is necessarily bad -- those arrays are allocated, contiguously.

We could try to basically break up event further:
for "regular" event, we need 3 []byte slices (key, value, prev value). For Resolved, we only need 2.
So, have just 3 slices, plus type of the event (we already have "flush" boolean -- which can be replace
with some sort of uint8). I don't know if it's worth the complexity. But we should chat.

shermanCRL · 2022-08-23T21:48:10Z

Looks like we need a manual backport? Let’s get that going if we haven’t already.

jayshrivastava · 2022-08-24T17:54:07Z

@miretskiy and I are looking to test out the changes against TPC-E before backporting. It would be nice to see the impact of these changes to the same workload Nathan ran originally.

jayshrivastava · 2022-08-25T17:07:02Z

@shermanCRL Just finished testing with TPC-E, with and without this change. Please see the appendix in this doc more info.

Summary (with both a massive TPC-E load and a massive changefeed backfill running):

SQL latency is reduced from 560ms -> 400ms (29%)
SQL statement throughput goes up from 5k -> 6k (20%)

@miretskiy and I have more ideas regarding #84709. I think it makes the most sense to carry them out and backport everything at the same time.

shermanCRL · 2022-08-25T17:13:06Z

Nice! What’s the baseline SQL latency without changefeeds?

jayshrivastava · 2022-08-25T17:13:56Z

40ms with only TPC-E running

jayshrivastava force-pushed the chunked-ll-blocking-buf branch from 9a6052f to ced8985 Compare August 18, 2022 21:22

Yevgeniy Miretskiy and others added 2 commits August 18, 2022 17:24

jayshrivastava force-pushed the chunked-ll-blocking-buf branch 3 times, most recently from bc95e15 to 5b4b969 Compare August 19, 2022 13:46

jayshrivastava marked this pull request as ready for review August 19, 2022 13:46

jayshrivastava requested a review from a team as a code owner August 19, 2022 13:46

jayshrivastava requested review from gh-casper and miretskiy and removed request for a team and gh-casper August 19, 2022 13:46

miretskiy approved these changes Aug 19, 2022

View reviewed changes

miretskiy added backport-22.1.x labels Aug 19, 2022

jayshrivastava force-pushed the chunked-ll-blocking-buf branch from 5b4b969 to 5734c3d Compare August 19, 2022 14:40

shermanCRL mentioned this pull request Aug 19, 2022

changefeedccl: investigate initial SQL latency spike on initial scan #86469

Closed

shermanCRL changed the title ~~kvevent: implement chunked blocking buff~~ kvevent: implement chunked blocking buffer Aug 19, 2022

shermanCRL mentioned this pull request Aug 19, 2022

changefeedccl: Reduce backfill impact on foreground latency #84582

Closed

5 tasks

craig bot merged commit 6a51183 into cockroachdb:master Aug 19, 2022

miretskiy mentioned this pull request Sep 10, 2022

release-22.1: kvevent: implement chunked blocking buffer; avoid busy loop during out of quota #87767

Merged

shermanCRL mentioned this pull request Sep 13, 2022

changefeedccl: macro benchmarks over several performance improvements #87921

Closed

jayshrivastava removed backport-21.2.x labels Nov 9, 2022

jayshrivastava mentioned this pull request Nov 10, 2022

changefeedccl: reduce allocations in kvevent blocking buffer #85156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvevent: implement chunked blocking buffer #86421

kvevent: implement chunked blocking buffer #86421

jayshrivastava commented Aug 18, 2022 •

edited by shermanCRL

Loading

cockroach-teamcity commented Aug 18, 2022

miretskiy left a comment

miretskiy commented Aug 19, 2022

jayshrivastava commented Aug 19, 2022

craig bot commented Aug 19, 2022

nvanbenschoten commented Aug 19, 2022

craig bot commented Aug 19, 2022

miretskiy commented Aug 19, 2022

nvanbenschoten commented Aug 19, 2022

craig bot commented Aug 19, 2022

blathers-crl bot commented Aug 19, 2022

miretskiy commented Aug 19, 2022

nvanbenschoten commented Aug 19, 2022

miretskiy commented Aug 19, 2022 •

edited

Loading

shermanCRL commented Aug 23, 2022

jayshrivastava commented Aug 24, 2022

jayshrivastava commented Aug 25, 2022

shermanCRL commented Aug 25, 2022

jayshrivastava commented Aug 25, 2022

kvevent: implement chunked blocking buffer #86421

kvevent: implement chunked blocking buffer #86421

Conversation

jayshrivastava commented Aug 18, 2022 • edited by shermanCRL Loading

kvevent: refactor blocking buffer benchmark

kvevent: implement chunked buffer event queue

kvevent: refactor memory buffer to chunked linked list

Results (micro)

Results (Macro)

Relevant Issues

cockroach-teamcity commented Aug 18, 2022

miretskiy left a comment

Choose a reason for hiding this comment

miretskiy commented Aug 19, 2022

jayshrivastava commented Aug 19, 2022

craig bot commented Aug 19, 2022

nvanbenschoten commented Aug 19, 2022

craig bot commented Aug 19, 2022

miretskiy commented Aug 19, 2022

nvanbenschoten commented Aug 19, 2022

craig bot commented Aug 19, 2022

blathers-crl bot commented Aug 19, 2022

miretskiy commented Aug 19, 2022

nvanbenschoten commented Aug 19, 2022

miretskiy commented Aug 19, 2022 • edited Loading

shermanCRL commented Aug 23, 2022

jayshrivastava commented Aug 24, 2022

jayshrivastava commented Aug 25, 2022

shermanCRL commented Aug 25, 2022

jayshrivastava commented Aug 25, 2022

jayshrivastava commented Aug 18, 2022 •

edited by shermanCRL

Loading

miretskiy commented Aug 19, 2022 •

edited

Loading