puller(ticdc): should split all update kv entries in puller when sink module is in safe mode #11231
Labels
affects-6.5
This bug affects the 6.5.x(LTS) versions.
affects-7.1
This bug affects the 7.1.x(LTS) versions.
affects-7.5
This bug affects the 7.5.x(LTS) versions.
affects-8.1
This bug affects the 8.1.x(LTS) versions.
area/ticdc
Issues or PRs related to TiCDC.
type/enhancement
The issue or PR belongs to an enhancement.
type/regression
Previous Problem
In #10919, we addresses the issue of downstream data inconsistencies caused by the potentially incorrect order of
UPDATE
events received by TiCDC.Take the following SQL statements as an example:
In this example, the two
UPDATE
statements within the transaction have a sequential dependency on execution. The primary keya
is changed from2
to3
, and then the primary keya
is changed from1
to2
. After this transaction is executed, the records in the upstream database are(2, 1)
and(3, 2)
.However, the order of
UPDATE
events received by TiCDC might differ from the actual execution order of the upstream transaction. For example:Previously, TiCDC just splits these
UPDATE
events intoDELETE
andINSERT
events before sending them to downstream. After the split, the actual execution order of these events in the downstream is as follows:After the downstream executes the transaction, the records in the database are
(3, 2)
, which are different from the records in the upstream database ((2, 1)
and(3, 2)
), indicating a data inconsistency issue.To fix this problem, we need a mechanism to reorder the
DELETE
andINSERT
events after the split.For example, if we can guarantee all
DELETE
events are executing before theINSERT
events in the same transaction. The data inconsistency problem can be avoided.After the downstream executes the transaction, the records in the downstream database are the same as those in the upstream database, which are
(2, 1)
and(3, 2)
, ensuring data consistency.Previous Solution
But the goal to reorder the
DELETE
andINSERT
events after the split is not easy to accomplish.The whole process of TiCDC processing a row of data and sending it to the MySQL compatible downstream is as follows:
CommitTS
and event type (DELETE
>UPDATE
>INSERT
);Splitting the
UPDATE
event in the sink module can ensure that only the events that update the primary key or the unique non-null unique key are split. But in sink module the event data has been fully loaded into the memory, so it is impossible to load all its data into the memory for splitting and then sorting in large transaction scenario;If the
UPDATE
event is split before TiCDC writes the KV data to the local disk, it is hard to know whether the event updates the primary key or the unique non-null unique key, so we can only choose to split allUPDATE
events. However, splitting allUPDATE
events will lead to performance degradation in some scenarios (sysbench oltp_update_non_index performance degradation of about 41%).Therefore, we adopted a compromise solution in #10919: When a new table sink start to write data to downstream, we fetch the current timestamp from pd as
replicateTS
, and we just split all update KV entries whichcommitTS
is smaller than thereplicateTS
.After the fix, when using the MySQL sink, TiCDC does not split the
UPDATE
event in most cases. Consequently, there might be primary key or unique key conflicts during changefeed runtime, causing the changefeed to restart automatically. After the restart, TiCDC will split the conflictingUPDATE
events intoDELETE
andINSERT
events before writing them to the local disk. This ensures that all events within the same transaction are correctly ordered, with allDELETE
events precedingINSERT
events, thus correctly completing data replication.Current Issue
In previous solution, we rely on restarting changefeed automatically to fix data conflicting error and prevent data inconsistency problem caused by incorrect order of
UPDATE
events in the same transaction received by TiCDC.Although after restart, puller can split the conflict
UPDATE
events and continue to run correctly, and from test we can see that the restart has no noticeable impact on latency, some users may be unhappy with this behavior if their workload has many conflict data which cause changefeed restart occasionally. So we need a workaround to avoid restart if need.TiCDC has a config known as
safe-mode
and is default asfalse
currently.If
safe-mode
is true,INSERT
events are translated toREPLACE
sql;UPDATE
events are split intoDELETE
andINSERT
events in sink module(This behavior removed in *(ticdc): split old update kv entry after restarting changefeed #10919 as described in the previous Problem part);Because in
safe-mode
, performance degradation is already expected(REPLACE
is much slower thanINSERT
), so we suggest a solution to split allUPDATE
KV entries before write them to disk whensafe-mode
is true. Users can enablesafe-mode
to prevent changefeed from restarting due to data conflicts while also prevent data inconsistency problem to happen.The text was updated successfully, but these errors were encountered: