rangefeed: verify time-bound iterator usage #35122

danhhz · 2019-02-21T17:19:06Z

We need to verify whether RangeFeeds need to use time-bound iterators to make the catchup scan performant and, if so, we need to make sure they have the same workaround introduced in #32909 for ExportRequest.

danhhz · 2019-02-21T17:19:21Z

cc @tbg @nvanbenschoten

danhhz · 2019-03-06T18:07:16Z

Summary: Unlike poller-based changefeeds, I think the difference between tbi and no-tbi in rangefeed-based changefeeds is small enough that it's worth the peace of mind of not using them. I recommend we don't use tbis for rangefeed catchup scans.

There's also been discussion of running a test which lets a changefeed get to steady state and uses zone configs to move all watched data from one set of nodes to a disjoint set of nodes. Based on the below, I fully expect this will work fine, but it's probably worth doing anyway.

@tbg @nvanbenschoten @rolandcrosby, anything else y'all would like to see before we commit to this?

Details

I ran some tests with the cdc/tpcc-1000 roachtest, changed to go for 30m and with background stats collection off. This runs a changefeed->kafka over tpcc warehouses=1000 with no intial scan with a 3-node n1-standard-16 cluster and kafka on a 4th node.

Control (no changefeed) 1 run:
efficiency=99.9% p50=28 p90=52 p95=67 p99=4831

Rangefeed+tbi 2 runs:
efficiency=99.9% p50=29,32 p90=61,65 p90=80,84 p99=3623,4026

Rangefeed+notbi 3 runs:
efficiency=99.9% p50=30,30,29 p90=71,63,61 p90=91,84,80 p99=2952,3489,3489

The latency data is noisy, but it seems like there's not really a difference.
The cpu usage seemed unchanged (see below)
The disk read bytes seems mostly unchanged, which is a little surprising (see below)
The changefeed commit-to-emit p50/p90/p95/p99 seems unchanged
I added a metric to track time spent in catchup scan, which went up dramatically: 40s -> 4853s

Integral of catchup nanos (tbi):

Integral of catchup nanos (no-tbi):

Catchup nanos during steady state rebalances/splits (no-tbi):

CPU (tbi):

CPU (no-tbi):

Disk Read Bytes (tbi):

Disk Read Bytes (no-tbi):

tbg · 2019-03-06T19:39:03Z

Thanks for running these! I'm probably being thick -- you say that these are without catchup scan, so for what would the TBIs really be used? Just for reconnecting a changefeed after, say, a rebalance or split? What does the catchup nanos metric tell you? It seems that it had one giant spike at the beginning of the test (where I assume all of the rangefeeds where established). Is it correct to say that that process has become ~100x slower thanks to the absence of TBIs? Is the steady state even reconnecting any rangefeeds (the metric seems flat)?

BTW, is there a metric for how often rangefeeds need to be reconnected? It would be interesting to have numbers on that.

danhhz · 2019-03-06T19:57:16Z

you say that these are without catchup scan, so for what would the TBIs really be used?

Totally understandable confusion here. They were run without an initial scan, which is a changefeed concept that means outputting the current rows in each range before starting the rangefeed. The rangefeed catchup scan is run when a RangeFeed connects and which outputs everything between the timestamp of the RangeFeed request and whereever raft happens to be at when the request starts processing. We need them for correctness.

Just for reconnecting a changefeed after, say, a rebalance or split? What does the catchup nanos metric tell you? It seems that it had one giant spike at the beginning of the test (where I assume all of the rangefeeds where established). Is it correct to say that that process has become ~100x slower thanks to the absence of TBIs? Is the steady state even reconnecting any rangefeeds (the metric seems flat)?

Yes, it's expected that there is a giant spike at the beginning. This is the initial connection of all the RangeFeed requests. After that it should only happen if a RangeFeed disconnects and reconnects, which happens for a number of reasons, most commonly rebalances and splits. The catchup_nanos metric is the total time we spend doing the iteration I mention above. I've cleaned it up and pushed it in #35470. It's not totally flat, just infrequent during steady state, which is expected. This is why we're probably okay not using tbis in rangefeed, the iteration happens in the beginning and then not often in the steady state.

You could probably say connecting a RangeFeed is ~100x slower yes. But everything else costs the same.

BTW, is there a metric for how often rangefeeds need to be reconnected? It would be interesting to have numbers on that.

I considered that in #35470 but figured it was too redundant with catchup_nanos to be worth the timeseries, but it's super easy to add in that PR if you think it's worth it.

tbg · 2019-03-11T15:10:55Z

TBIs off seems preferrable to me for 19.1. Are you anticipating any problems running the initial scan, which presumably will also be 100x slower (?)
I don't know if we've made any kind of promise on the catchup phase, but if the inital load were so slow that usability of the feed as a whole were not guaranteed, that would be a problem. Curious what the goal there is.

but it's super easy to add in that PR if you think it's worth it.

No, you have a point.

danhhz · 2019-03-11T15:13:40Z

The initial scan is still done separately using ExportRequests, so will be unaffected.

I don't know if we've made any kind of promise on the catchup phase, but if the inital load were so slow that usability of the feed as a whole were not guaranteed, that would be a problem. Curious what the goal there is.

I don't quite follow. Can you expand on this?

tbg · 2019-03-11T15:15:21Z

Sorry, I'm always confusing these words. I meant "initial scan". I always mix the two up.

The initial scan is still done separately using ExportRequests, so will be unaffected.

Ah, I was missing that. Great.

nvanbenschoten · 2019-03-11T21:25:33Z

Thanks for running these tests Dan.

The 100x catch-up scan time is concerning, but it doesn't appear to have had much of an effect on baseline TPC-C performance. I'm a little surprised about this. If my math is correct then the catch-up scan was running for about 2 minutes continuously on all cores in the cluster. Is it possible that the effect of this was hidden by the ramp period of the workload generator? Could we try delaying the changefeed initialization until this ramp period is over?

danhhz · 2019-03-11T21:34:23Z

I was also surprised about how small the effect was. I based these tests off of the cdc/tpcc-1000 roachtest, which doesn't use tpcc ramp. Definitely possible that I messed something else up, though

RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes cockroachdb#35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits.

35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (#28358, #34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes #35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits. Co-authored-by: Daniel Harrison <[email protected]>

danhhz self-assigned this Mar 6, 2019

danhhz mentioned this issue Mar 6, 2019

rangefeed: stop using time-bound iterator for catchup scan #35470

Merged

craig bot closed this as completed in #35470 Mar 12, 2019

ajwerner mentioned this issue Apr 1, 2021

Cluster latency when changefeed created #62927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rangefeed: verify time-bound iterator usage #35122

rangefeed: verify time-bound iterator usage #35122

danhhz commented Feb 21, 2019

danhhz commented Feb 21, 2019

danhhz commented Mar 6, 2019

tbg commented Mar 6, 2019

danhhz commented Mar 6, 2019

tbg commented Mar 11, 2019

danhhz commented Mar 11, 2019

tbg commented Mar 11, 2019

nvanbenschoten commented Mar 11, 2019

danhhz commented Mar 11, 2019

rangefeed: verify time-bound iterator usage #35122

rangefeed: verify time-bound iterator usage #35122

Comments

danhhz commented Feb 21, 2019

danhhz commented Feb 21, 2019

danhhz commented Mar 6, 2019

tbg commented Mar 6, 2019

danhhz commented Mar 6, 2019

tbg commented Mar 11, 2019

danhhz commented Mar 11, 2019

tbg commented Mar 11, 2019

nvanbenschoten commented Mar 11, 2019

danhhz commented Mar 11, 2019