You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When starting a changefeed on a table that already has data populated in it, it is difficult to know how much of the initial data has been streamed out of the cluster and how much longer it will take to finish "catching up" to the existing data and start streaming updates.
This recently caused an issue with the telemetry cluster - when setting up a changefeed to stream data out of the cluster, the initial catch-up scan took ~3 days due to us writing out uncompressed NDJSON files. During that window, the default GC TTL of 25 hrs elapsed, so once the changefeed finally caught up to the existing data in table, it wasn't able to proceed since the necessary MVCC keys that would be used to stream updates from when the job was started has already been garbage collected.
Describe the solution you'd like
Being able to see the progress of this initial catch up scan both in the jobs table (% complete) and the UI (% complete and maybe an ETA) would help with planning. For example, if we saw that this was going to take 2 days, I could have updated the TTL window ahead of time.
As Nathan and Andrew have mentioned, compression will help here (we suspect the catch up scan speed bottleneck is actually sending all this data over the wire uncompressed, see also #43103) and protected timestamps might prevent users from encountering these types of errors in the future. One nice thing to note here is that almost everything in the registration cluster is append only (no updates), so using protected timestamps in this case wouldn't actually incur much if any overhead.
Is your feature request related to a problem? Please describe.
When starting a changefeed on a table that already has data populated in it, it is difficult to know how much of the initial data has been streamed out of the cluster and how much longer it will take to finish "catching up" to the existing data and start streaming updates.
This recently caused an issue with the telemetry cluster - when setting up a changefeed to stream data out of the cluster, the initial catch-up scan took ~3 days due to us writing out uncompressed NDJSON files. During that window, the default GC TTL of 25 hrs elapsed, so once the changefeed finally caught up to the existing data in table, it wasn't able to proceed since the necessary MVCC keys that would be used to stream updates from when the job was started has already been garbage collected.
Describe the solution you'd like
Being able to see the progress of this initial catch up scan both in the jobs table (% complete) and the UI (% complete and maybe an ETA) would help with planning. For example, if we saw that this was going to take 2 days, I could have updated the TTL window ahead of time.
As Nathan and Andrew have mentioned, compression will help here (we suspect the catch up scan speed bottleneck is actually sending all this data over the wire uncompressed, see also #43103) and protected timestamps might prevent users from encountering these types of errors in the future. One nice thing to note here is that almost everything in the registration cluster is append only (no updates), so using protected timestamps in this case wouldn't actually incur much if any overhead.
cc @ajwerner
Epic CRDB-2365
Jira issue: CRDB-5285
The text was updated successfully, but these errors were encountered: