-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: distinguish between schema change backfills and regular updates/inserts #35738
Comments
We don't really have this information at the kv level, so it may not be possible. We could certainly plumb something for the scans that the changefeed does after a backfill finishes, but before that there's a period where normal kv traffic is happening concurrent with the backfill itself and we have no way of distinguishing these two. |
cc @mwang1026 This is the "nice to have" from our discussion on CDC to upserts as a DR strategy. |
It turns out this is annoyingly difficult in today's world. Our column backfiller performs it operations in batches using a transaction. It does this seemingly to deal with foreign keys. Why we have foreign key checks in these cases is a whole separate discussion but alas. Anyways, doing anything clever with omitting logical ops for writes due to one of these backfillers seems out the window if they're writing transactionally. There's too much risk of mucking up the rangefeed bookkeeping if we somehow didn't issue a logical op for creating an intent but then did emit the op when resolving it. Perhaps we could do something fancy like marking the intent as logically a no-op but I'm not trying to shove that sort of thing in late in a release given its general smell. More realistically we could just stop doing these column backfills altogether and instead start creating new indexes and swap after they've been constructed. That's the approach taken for primary key changes as well as for column type changes. It would simplify a bunch. The idea of doing these sort of column backfills on a live table seems also crazy in the context of making schema changes more transactions (#42061). The downside of the separate index approach, at least in its simplest conceptualization, could disable 1PC. |
If we change all backfills to use the protocol proposed by #36850, this problem would be effectively solved. Sure, the catch-up scan writes would get duplicated, but that would likely be a very small number of writes. |
Another option which isn't as good as just omitting these writes but might appeal to some users would be to omit entries which do not correspond to any changes. Given we now support the |
The clearest solution here is to side-step the problem by not touching the primary index when changing columns but rather creating a new one and swapping over: #47989 |
Hi @ajwerner, is there any update on this ticket? Thanks! |
In 21.1 we plan on pursuing #47989 which should resolve this issue, though will bring some new complexity. I'm hopeful we can get this resolved in the 21.1 timeframe. |
we have both updated and mvcc timestamp which lets the users distinguish this. |
CDC feature request: Add a metadata field to the 'wrapped' envelope format to indicate whether a write was caused by a schema change backfill.
This came up in discussion with a user interested in using Kafka-based CDC for both analytics DB and pub/sub use cases. When loading data into an OLAP system, a full-table scan with backfilled schema changes is exactly what is needed: the full state of the table needs to be re-populated to match the new schema. Event-driven pub/sub consumers, however, may not want to receive these updates: they are more likely to be watching a changefeed because they care about taking some action on each record as updates happen.
gz#3930
Jira issue: CRDB-4559
The text was updated successfully, but these errors were encountered: