changefeedccl: distinguish between schema change backfills and regular updates/inserts #35738

rolandcrosby · 2019-03-14T16:59:40Z

CDC feature request: Add a metadata field to the 'wrapped' envelope format to indicate whether a write was caused by a schema change backfill.

This came up in discussion with a user interested in using Kafka-based CDC for both analytics DB and pub/sub use cases. When loading data into an OLAP system, a full-table scan with backfilled schema changes is exactly what is needed: the full state of the table needs to be re-populated to match the new schema. Event-driven pub/sub consumers, however, may not want to receive these updates: they are more likely to be watching a changefeed because they care about taking some action on each record as updates happen.

gz#3930

Jira issue: CRDB-4559

danhhz · 2019-03-14T20:08:09Z

We don't really have this information at the kv level, so it may not be possible. We could certainly plumb something for the scans that the changefeed does after a backfill finishes, but before that there's a period where normal kv traffic is happening concurrent with the backfill itself and we have no way of distinguishing these two.

ajwerner · 2020-01-22T21:47:54Z

cc @mwang1026

This is the "nice to have" from our discussion on CDC to upserts as a DR strategy.

ajwerner · 2020-03-06T01:43:38Z

It turns out this is annoyingly difficult in today's world. Our column backfiller performs it operations in batches using a transaction. It does this seemingly to deal with foreign keys. Why we have foreign key checks in these cases is a whole separate discussion but alas.

Anyways, doing anything clever with omitting logical ops for writes due to one of these backfillers seems out the window if they're writing transactionally. There's too much risk of mucking up the rangefeed bookkeeping if we somehow didn't issue a logical op for creating an intent but then did emit the op when resolving it. Perhaps we could do something fancy like marking the intent as logically a no-op but I'm not trying to shove that sort of thing in late in a release given its general smell.

More realistically we could just stop doing these column backfills altogether and instead start creating new indexes and swap after they've been constructed. That's the approach taken for primary key changes as well as for column type changes. It would simplify a bunch. The idea of doing these sort of column backfills on a live table seems also crazy in the context of making schema changes more transactions (#42061).

The downside of the separate index approach, at least in its simplest conceptualization, could disable 1PC.

ajwerner · 2020-03-12T20:32:24Z

If we change all backfills to use the protocol proposed by #36850, this problem would be effectively solved. Sure, the catch-up scan writes would get duplicated, but that would likely be a very small number of writes.

ajwerner · 2020-04-01T13:58:55Z

Another option which isn't as good as just omitting these writes but might appeal to some users would be to omit entries which do not correspond to any changes. Given we now support the diff option, we could detect when the values identical.

ajwerner · 2020-07-02T02:59:18Z

The clearest solution here is to side-step the problem by not touching the primary index when changing columns but rather creating a new one and swapping over: #47989

cjireland · 2020-10-01T08:26:08Z

Hi @ajwerner, is there any update on this ticket? Thanks!

ajwerner · 2020-10-05T20:07:15Z

In 21.1 we plan on pursuing #47989 which should resolve this issue, though will bring some new complexity. I'm hopeful we can get this resolved in the 21.1 timeframe.

miretskiy · 2023-01-09T17:39:51Z

we have both updated and mvcc timestamp which lets the users distinguish this.

rolandcrosby added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cdc Change Data Capture labels Mar 14, 2019

ajwerner mentioned this issue Jan 13, 2020

cdc: only backfill tables which experience schema changes #43896

Closed

ajwerner mentioned this issue Jan 23, 2020

changefeedccl: provide an option to finish changefeed upon schema change #44265

Closed

ajwerner mentioned this issue Apr 23, 2020

sql: column backfills should build a new index instead of mutating the existing one #47989

Closed

elinorgarcia added the T-cdc label Dec 7, 2020

ajwerner mentioned this issue Apr 5, 2021

cdc: backfill consistently on schema change #56098

Closed

ajwerner mentioned this issue Apr 19, 2022

sql: require CASCADE to drop physical shard column of hash index #80181

Closed

miretskiy closed this as completed Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: distinguish between schema change backfills and regular updates/inserts #35738

changefeedccl: distinguish between schema change backfills and regular updates/inserts #35738

rolandcrosby commented Mar 14, 2019 •

edited by cockroach-jira-scripts

Loading

danhhz commented Mar 14, 2019

ajwerner commented Jan 22, 2020

ajwerner commented Mar 6, 2020

ajwerner commented Mar 12, 2020

ajwerner commented Apr 1, 2020

ajwerner commented Jul 2, 2020

cjireland commented Oct 1, 2020

ajwerner commented Oct 5, 2020

miretskiy commented Jan 9, 2023

changefeedccl: distinguish between schema change backfills and regular updates/inserts #35738

changefeedccl: distinguish between schema change backfills and regular updates/inserts #35738

Comments

rolandcrosby commented Mar 14, 2019 • edited by cockroach-jira-scripts Loading

danhhz commented Mar 14, 2019

ajwerner commented Jan 22, 2020

ajwerner commented Mar 6, 2020

ajwerner commented Mar 12, 2020

ajwerner commented Apr 1, 2020

ajwerner commented Jul 2, 2020

cjireland commented Oct 1, 2020

ajwerner commented Oct 5, 2020

miretskiy commented Jan 9, 2023

rolandcrosby commented Mar 14, 2019 •

edited by cockroach-jira-scripts

Loading