Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rangefeed: verify time-bound iterator usage #35122

Closed
danhhz opened this issue Feb 21, 2019 · 9 comments
Closed

rangefeed: verify time-bound iterator usage #35122

danhhz opened this issue Feb 21, 2019 · 9 comments
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@danhhz
Copy link
Contributor

danhhz commented Feb 21, 2019

We need to verify whether RangeFeeds need to use time-bound iterators to make the catchup scan performant and, if so, we need to make sure they have the same workaround introduced in #32909 for ExportRequest.

@danhhz danhhz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-cdc Change Data Capture labels Feb 21, 2019
@danhhz
Copy link
Contributor Author

danhhz commented Feb 21, 2019

cc @tbg @nvanbenschoten

@danhhz danhhz self-assigned this Mar 6, 2019
@danhhz
Copy link
Contributor Author

danhhz commented Mar 6, 2019

Summary: Unlike poller-based changefeeds, I think the difference between tbi and no-tbi in rangefeed-based changefeeds is small enough that it's worth the peace of mind of not using them. I recommend we don't use tbis for rangefeed catchup scans.

There's also been discussion of running a test which lets a changefeed get to steady state and uses zone configs to move all watched data from one set of nodes to a disjoint set of nodes. Based on the below, I fully expect this will work fine, but it's probably worth doing anyway.

@tbg @nvanbenschoten @rolandcrosby, anything else y'all would like to see before we commit to this?

Details

I ran some tests with the cdc/tpcc-1000 roachtest, changed to go for 30m and with background stats collection off. This runs a changefeed->kafka over tpcc warehouses=1000 with no intial scan with a 3-node n1-standard-16 cluster and kafka on a 4th node.

Control (no changefeed) 1 run:
efficiency=99.9% p50=28 p90=52 p95=67 p99=4831

Rangefeed+tbi 2 runs:
efficiency=99.9% p50=29,32 p90=61,65 p90=80,84 p99=3623,4026

Rangefeed+notbi 3 runs:
efficiency=99.9% p50=30,30,29 p90=71,63,61 p90=91,84,80 p99=2952,3489,3489

  • The latency data is noisy, but it seems like there's not really a difference.
  • The cpu usage seemed unchanged (see below)
  • The disk read bytes seems mostly unchanged, which is a little surprising (see below)
  • The changefeed commit-to-emit p50/p90/p95/p99 seems unchanged
  • I added a metric to track time spent in catchup scan, which went up dramatically: 40s -> 4853s

Integral of catchup nanos (tbi):
screen shot 2019-03-05 at 5 45 48 pm

Integral of catchup nanos (no-tbi):
screen shot 2019-03-05 at 5 58 18 pm

Catchup nanos during steady state rebalances/splits (no-tbi):
screen shot 2019-03-05 at 5 53 45 pm

CPU (tbi):
screen shot 2019-03-05 at 5 58 53 pm

CPU (no-tbi):
screen shot 2019-03-05 at 5 59 05 pm

Disk Read Bytes (tbi):
screen shot 2019-03-05 at 5 58 44 pm

Disk Read Bytes (no-tbi):
screen shot 2019-03-05 at 5 58 31 pm

@tbg
Copy link
Member

tbg commented Mar 6, 2019

Thanks for running these! I'm probably being thick -- you say that these are without catchup scan, so for what would the TBIs really be used? Just for reconnecting a changefeed after, say, a rebalance or split? What does the catchup nanos metric tell you? It seems that it had one giant spike at the beginning of the test (where I assume all of the rangefeeds where established). Is it correct to say that that process has become ~100x slower thanks to the absence of TBIs? Is the steady state even reconnecting any rangefeeds (the metric seems flat)?

BTW, is there a metric for how often rangefeeds need to be reconnected? It would be interesting to have numbers on that.

@danhhz
Copy link
Contributor Author

danhhz commented Mar 6, 2019

you say that these are without catchup scan, so for what would the TBIs really be used?

Totally understandable confusion here. They were run without an initial scan, which is a changefeed concept that means outputting the current rows in each range before starting the rangefeed. The rangefeed catchup scan is run when a RangeFeed connects and which outputs everything between the timestamp of the RangeFeed request and whereever raft happens to be at when the request starts processing. We need them for correctness.

Just for reconnecting a changefeed after, say, a rebalance or split? What does the catchup nanos metric tell you? It seems that it had one giant spike at the beginning of the test (where I assume all of the rangefeeds where established). Is it correct to say that that process has become ~100x slower thanks to the absence of TBIs? Is the steady state even reconnecting any rangefeeds (the metric seems flat)?

Yes, it's expected that there is a giant spike at the beginning. This is the initial connection of all the RangeFeed requests. After that it should only happen if a RangeFeed disconnects and reconnects, which happens for a number of reasons, most commonly rebalances and splits. The catchup_nanos metric is the total time we spend doing the iteration I mention above. I've cleaned it up and pushed it in #35470. It's not totally flat, just infrequent during steady state, which is expected. This is why we're probably okay not using tbis in rangefeed, the iteration happens in the beginning and then not often in the steady state.

You could probably say connecting a RangeFeed is ~100x slower yes. But everything else costs the same.

BTW, is there a metric for how often rangefeeds need to be reconnected? It would be interesting to have numbers on that.

I considered that in #35470 but figured it was too redundant with catchup_nanos to be worth the timeseries, but it's super easy to add in that PR if you think it's worth it.

@tbg
Copy link
Member

tbg commented Mar 11, 2019

TBIs off seems preferrable to me for 19.1. Are you anticipating any problems running the initial scan, which presumably will also be 100x slower (?)
I don't know if we've made any kind of promise on the catchup phase, but if the inital load were so slow that usability of the feed as a whole were not guaranteed, that would be a problem. Curious what the goal there is.

but it's super easy to add in that PR if you think it's worth it.

No, you have a point.

@danhhz
Copy link
Contributor Author

danhhz commented Mar 11, 2019

The initial scan is still done separately using ExportRequests, so will be unaffected.

I don't know if we've made any kind of promise on the catchup phase, but if the inital load were so slow that usability of the feed as a whole were not guaranteed, that would be a problem. Curious what the goal there is.

I don't quite follow. Can you expand on this?

@tbg
Copy link
Member

tbg commented Mar 11, 2019

Sorry, I'm always confusing these words. I meant "initial scan". I always mix the two up.

The initial scan is still done separately using ExportRequests, so will be unaffected.

Ah, I was missing that. Great.

@nvanbenschoten
Copy link
Member

Thanks for running these tests Dan.

The 100x catch-up scan time is concerning, but it doesn't appear to have had much of an effect on baseline TPC-C performance. I'm a little surprised about this. If my math is correct then the catch-up scan was running for about 2 minutes continuously on all cores in the cluster. Is it possible that the effect of this was hidden by the ramp period of the workload generator? Could we try delaying the changefeed initialization until this ramp period is over?

@danhhz
Copy link
Contributor Author

danhhz commented Mar 11, 2019

I was also surprised about how small the effect was. I based these tests off of the cdc/tpcc-1000 roachtest, which doesn't use tpcc ramp. Definitely possible that I messed something else up, though

danhhz added a commit to danhhz/cockroach that referenced this issue Mar 11, 2019
RangeFeed originally intended to use the time-bound iterator
performance optimization. However, they've had correctness issues in
the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence
necessary to be confidant in their correctness going forward. Not using
them causes the total time spent in RangeFeed catchup on changefeed
over tpcc-1000 to go from 40s -> 4853s, which is quite large but still
workable.

Closes cockroachdb#35122

Release note (enterprise change): In exchange for increased correctness
confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default)
now take slightly more resources on startup and range
rebalancing/splits.
craig bot pushed a commit that referenced this issue Mar 12, 2019
35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz

RangeFeed originally intended to use the time-bound iterator
performance optimization. However, they've had correctness issues in
the past (#28358, #34819) and no-one has the time for the due-diligence
necessary to be confidant in their correctness going forward. Not using
them causes the total time spent in RangeFeed catchup on changefeed
over tpcc-1000 to go from 40s -> 4853s, which is quite large but still
workable.

Closes #35122

Release note (enterprise change): In exchange for increased correctness
confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default)
now take slightly more resources on startup and range
rebalancing/splits.

Co-authored-by: Daniel Harrison <[email protected]>
@craig craig bot closed this as completed in #35470 Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

3 participants