Potential data loss after changing the number of TiFlash replicas #9438

JaySon-Huang · 2024-09-19T08:18:22Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Create a table with tiflash replica and load some data
Alter tiflash replica to 0
Wait gc_lifetime and set tiflash replica to 1

2. What did you expect to see? (Required)

The TiFlash replica is built correctly

3. What did you see instead (Required)

There is a chance that data loss happened after step 3

4. What is your TiFlash version? (Required)

v8.1.1

JaySon-Huang · 2024-09-19T08:36:44Z

In v8.1.0 and v8.1.1, if the tiflash replica num is set to 0, applyDropTable(database_id, table_id, "SetTiFlashReplica-0") will be executed and add a tombstone_ts to the IStorage instance.
https://github.com/pingcap/tiflash/blob/v8.1.1/dbms/src/TiDB/Schema/SchemaBuilder.cpp#L392-L407

If all the regions are removed from the tiflash instance, and the tombstone_ts exceeds the gc_safepoint, then we will generate a InterpreterDropQuery to physically drop the IStorage instance.
https://github.com/pingcap/tiflash/blob/v8.1.1/dbms/src/TiDB/Schema/SchemaSyncService.cpp#L304-L354

However, there could be a chance that data loss due to a concurrent issue:

In SchemaSyncService::gcImpl, a table is judge as both "tombstone_ts exceed the gc_safepoint" and "no region peer exists". So InterpreterDropQuery is generated
User set tiflash replica to be K where K > 0, and new region snapshot is applied to tiflash before InterpreterDropQuery get executed.
InterpreterDropQuery get executed, and all the data in the StorageDeltaMerge get physically removed. But the region is still exist in the raft-layer. And the query result after that will meet data loss.

Note: The mechanism of "if the tiflash replica num is set to 0, applyDropTable(database_id, table_id, "SetTiFlashReplica-0") will be executed" is trying to remove the empty segment and .sql schema file from tiflash instance after the tiflash replica num is set to 0. But seems there is no easy way to fix such a concurrent issue.

JaySon-Huang · 2024-09-19T08:39:58Z

For LTS, only v8.1.0 and v8.1.1 are affected.

release-7.5 is not affected because we comment out the "drop table" action when alter tiflash replica to 0
https://github.com/pingcap/tiflash/blob/release-7.5/dbms/src/TiDB/Schema/SchemaBuilder.cpp#L391-L408

Older versions are also not affected because there is no such mechanism.

) close #9438 ddl: Do not physical drop table after tiflash replica is set to 0 To avoid a potential data loss issue when altering tiflash replica

) (#9441) close #9438 ddl: Do not physical drop table after tiflash replica is set to 0 To avoid a potential data loss issue when altering tiflash replica Co-authored-by: JaySon-Huang <[email protected]>

JaySon-Huang added type/bug The issue is confirmed as a bug. severity/major component/storage affects-8.1 This bug affects the 8.1.x(LTS) versions. labels Sep 19, 2024

ti-chi-bot bot added may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Sep 19, 2024

JaySon-Huang removed may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Sep 19, 2024

JaySon-Huang mentioned this issue Sep 19, 2024

ddl: Do not physical drop table after tiflash replica is set to 0 #9440

Merged

12 tasks

ti-chi-bot bot closed this as completed in #9440 Sep 19, 2024

ti-chi-bot mentioned this issue Sep 19, 2024

ddl: Do not physical drop table after tiflash replica is set to 0 (#9440) #9441

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential data loss after changing the number of TiFlash replicas #9438

Potential data loss after changing the number of TiFlash replicas #9438

JaySon-Huang commented Sep 19, 2024

JaySon-Huang commented Sep 19, 2024 •

edited

Loading

JaySon-Huang commented Sep 19, 2024

Potential data loss after changing the number of TiFlash replicas #9438

Potential data loss after changing the number of TiFlash replicas #9438

Comments

JaySon-Huang commented Sep 19, 2024

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

JaySon-Huang commented Sep 19, 2024 • edited Loading

JaySon-Huang commented Sep 19, 2024

JaySon-Huang commented Sep 19, 2024 •

edited

Loading