-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rangefeed: verify time-bound iterator usage #35122
Comments
Summary: Unlike poller-based changefeeds, I think the difference between tbi and no-tbi in rangefeed-based changefeeds is small enough that it's worth the peace of mind of not using them. I recommend we don't use tbis for rangefeed catchup scans. There's also been discussion of running a test which lets a changefeed get to steady state and uses zone configs to move all watched data from one set of nodes to a disjoint set of nodes. Based on the below, I fully expect this will work fine, but it's probably worth doing anyway. @tbg @nvanbenschoten @rolandcrosby, anything else y'all would like to see before we commit to this? Details I ran some tests with the cdc/tpcc-1000 roachtest, changed to go for 30m and with background stats collection off. This runs a changefeed->kafka over tpcc warehouses=1000 with no intial scan with a 3-node n1-standard-16 cluster and kafka on a 4th node. Control (no changefeed) 1 run: Rangefeed+tbi 2 runs: Rangefeed+notbi 3 runs:
Integral of catchup nanos (tbi): Integral of catchup nanos (no-tbi): Catchup nanos during steady state rebalances/splits (no-tbi): |
Thanks for running these! I'm probably being thick -- you say that these are without catchup scan, so for what would the TBIs really be used? Just for reconnecting a changefeed after, say, a rebalance or split? What does the catchup nanos metric tell you? It seems that it had one giant spike at the beginning of the test (where I assume all of the rangefeeds where established). Is it correct to say that that process has become ~100x slower thanks to the absence of TBIs? Is the steady state even reconnecting any rangefeeds (the metric seems flat)? BTW, is there a metric for how often rangefeeds need to be reconnected? It would be interesting to have numbers on that. |
Totally understandable confusion here. They were run without an initial scan, which is a changefeed concept that means outputting the current rows in each range before starting the rangefeed. The rangefeed catchup scan is run when a RangeFeed connects and which outputs everything between the timestamp of the RangeFeed request and whereever raft happens to be at when the request starts processing. We need them for correctness.
Yes, it's expected that there is a giant spike at the beginning. This is the initial connection of all the RangeFeed requests. After that it should only happen if a RangeFeed disconnects and reconnects, which happens for a number of reasons, most commonly rebalances and splits. The catchup_nanos metric is the total time we spend doing the iteration I mention above. I've cleaned it up and pushed it in #35470. It's not totally flat, just infrequent during steady state, which is expected. This is why we're probably okay not using tbis in rangefeed, the iteration happens in the beginning and then not often in the steady state. You could probably say connecting a RangeFeed is ~100x slower yes. But everything else costs the same.
I considered that in #35470 but figured it was too redundant with catchup_nanos to be worth the timeseries, but it's super easy to add in that PR if you think it's worth it. |
TBIs off seems preferrable to me for 19.1. Are you anticipating any problems running the initial scan, which presumably will also be 100x slower (?)
No, you have a point. |
The initial scan is still done separately using ExportRequests, so will be unaffected.
I don't quite follow. Can you expand on this? |
Sorry, I'm always confusing these words. I meant "initial scan". I always mix the two up.
Ah, I was missing that. Great. |
Thanks for running these tests Dan. The 100x catch-up scan time is concerning, but it doesn't appear to have had much of an effect on baseline TPC-C performance. I'm a little surprised about this. If my math is correct then the catch-up scan was running for about 2 minutes continuously on all cores in the cluster. Is it possible that the effect of this was hidden by the ramp period of the workload generator? Could we try delaying the changefeed initialization until this ramp period is over? |
I was also surprised about how small the effect was. I based these tests off of the cdc/tpcc-1000 roachtest, which doesn't use tpcc ramp. Definitely possible that I messed something else up, though |
RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes cockroachdb#35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits.
35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (#28358, #34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes #35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits. Co-authored-by: Daniel Harrison <[email protected]>
We need to verify whether RangeFeeds need to use time-bound iterators to make the catchup scan performant and, if so, we need to make sure they have the same workaround introduced in #32909 for ExportRequest.
The text was updated successfully, but these errors were encountered: