cdc: allow visibility into changefeed progress on streaming data in existing tables #43363

piyush-singh · 2019-12-19T19:29:10Z

Is your feature request related to a problem? Please describe.
When starting a changefeed on a table that already has data populated in it, it is difficult to know how much of the initial data has been streamed out of the cluster and how much longer it will take to finish "catching up" to the existing data and start streaming updates.

This recently caused an issue with the telemetry cluster - when setting up a changefeed to stream data out of the cluster, the initial catch-up scan took ~3 days due to us writing out uncompressed NDJSON files. During that window, the default GC TTL of 25 hrs elapsed, so once the changefeed finally caught up to the existing data in table, it wasn't able to proceed since the necessary MVCC keys that would be used to stream updates from when the job was started has already been garbage collected.

Describe the solution you'd like
Being able to see the progress of this initial catch up scan both in the jobs table (% complete) and the UI (% complete and maybe an ETA) would help with planning. For example, if we saw that this was going to take 2 days, I could have updated the TTL window ahead of time.

As Nathan and Andrew have mentioned, compression will help here (we suspect the catch up scan speed bottleneck is actually sending all this data over the wire uncompressed, see also #43103) and protected timestamps might prevent users from encountering these types of errors in the future. One nice thing to note here is that almost everything in the registration cluster is append only (no updates), so using protected timestamps in this case wouldn't actually incur much if any overhead.

cc @ajwerner

Epic CRDB-2365

Jira issue: CRDB-5285

miretskiy · 2022-06-09T20:07:41Z

@amruss do you think this issue can be closed?

miretskiy · 2023-01-09T18:31:02Z

Addressed by exposing the metric tracking the number of ranges remaining to be backfilled.

piyush-singh added the A-cdc Change Data Capture label Dec 19, 2019

elinorgarcia added the T-cdc label Dec 7, 2020

shermanCRL added the E-starter Might be suitable for a starter project for new employees or team members. label Apr 14, 2021

jlinder added the sync-me-8 label Jun 10, 2022

miretskiy closed this as completed Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdc: allow visibility into changefeed progress on streaming data in existing tables #43363

cdc: allow visibility into changefeed progress on streaming data in existing tables #43363

piyush-singh commented Dec 19, 2019 •

edited by cockroach-jira-scripts

Loading

miretskiy commented Jun 9, 2022

miretskiy commented Jan 9, 2023

cdc: allow visibility into changefeed progress on streaming data in existing tables #43363

cdc: allow visibility into changefeed progress on streaming data in existing tables #43363

Comments

piyush-singh commented Dec 19, 2019 • edited by cockroach-jira-scripts Loading

miretskiy commented Jun 9, 2022

miretskiy commented Jan 9, 2023

piyush-singh commented Dec 19, 2019 •

edited by cockroach-jira-scripts

Loading