-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*(ticdc): split old update kv entry after restarting changefeed (#10919) #11029
*(ticdc): split old update kv entry after restarting changefeed (#10919) #11029
Conversation
Signed-off-by: ti-chi-bot <[email protected]>
/test cdc-integration-kafka-test |
/test cdc-integration-kafka-test |
1 similar comment
/test cdc-integration-kafka-test |
/test all |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lidezhu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
/unhold |
This is an automated cherry-pick of #10919
What problem does this PR solve?
Issue Number: close #10918
What is changed and how it works?
When the downstream is mysql-compatiable, we change the logic to handle update events at restart
thresholdTs
later);thresholdTs
, it split it into a delete event and a replace event, and then send them downstream;thresholdTs
, it split into a delete kv entry and a insert kv entry, and then write them into sorter;thresholdTs
, all delete events will send downstream before insert events;When the downstream is mysql-compatiable, We also change the logic to handle update events after the previous restart stage finishes
Previously, when meet a transaction with multiple update events which change the primary key or the not null unique key inside sink module, we always split them into delete events and replace events; This may cause data inconsistency problem as the following example:
Suppose a table t has the schema
create table t(a int primary key)
, and it have two rowsa=1
anda=2
;If a transaction contains two update events:
In the ideal scenario, we expect these two events to be splite into the following events:
After the transaction, table t have two rows
a=2
anda=3
;But inside cdc, we cannot get the original order of these two update events, so these two update events may be split into the following events:
After the transaction, table t have only one row
a=3
;(Data inconsistency happens!)So we do not split any update events inside sink module when the downstream is mysql, this may cause
duplicate key entry
error when the order to execute update events inside a transaction is wrong;This error will cause changefeed to restart and enter the previous restart change, the update events will be split inside puller, and the delete events will be send before insert events;
When apply redo log
When apply redo log, split update events which update handle key to delete events and insert events, and cache the insert events until all delete events in the same transaction are emitted. If the insert events is too many(larger than 50), events will be written to a temp local file;
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note