Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential data loss after changing the number of TiFlash replicas #9438

Closed
JaySon-Huang opened this issue Sep 19, 2024 · 2 comments · Fixed by #9440
Closed

Potential data loss after changing the number of TiFlash replicas #9438

JaySon-Huang opened this issue Sep 19, 2024 · 2 comments · Fixed by #9440
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage severity/major type/bug The issue is confirmed as a bug.

Comments

@JaySon-Huang
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Create a table with tiflash replica and load some data
  2. Alter tiflash replica to 0
  3. Wait gc_lifetime and set tiflash replica to 1

2. What did you expect to see? (Required)

The TiFlash replica is built correctly

3. What did you see instead (Required)

There is a chance that data loss happened after step 3

4. What is your TiFlash version? (Required)

v8.1.1

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Sep 19, 2024

In v8.1.0 and v8.1.1, if the tiflash replica num is set to 0, applyDropTable(database_id, table_id, "SetTiFlashReplica-0") will be executed and add a tombstone_ts to the IStorage instance.
https://github.com/pingcap/tiflash/blob/v8.1.1/dbms/src/TiDB/Schema/SchemaBuilder.cpp#L392-L407

If all the regions are removed from the tiflash instance, and the tombstone_ts exceeds the gc_safepoint, then we will generate a InterpreterDropQuery to physically drop the IStorage instance.
https://github.com/pingcap/tiflash/blob/v8.1.1/dbms/src/TiDB/Schema/SchemaSyncService.cpp#L304-L354

However, there could be a chance that data loss due to a concurrent issue:

  1. In SchemaSyncService::gcImpl, a table is judge as both "tombstone_ts exceed the gc_safepoint" and "no region peer exists". So InterpreterDropQuery is generated
  2. User set tiflash replica to be K where K > 0, and new region snapshot is applied to tiflash before InterpreterDropQuery get executed.
  3. InterpreterDropQuery get executed, and all the data in the StorageDeltaMerge get physically removed. But the region is still exist in the raft-layer. And the query result after that will meet data loss.

Note: The mechanism of "if the tiflash replica num is set to 0, applyDropTable(database_id, table_id, "SetTiFlashReplica-0") will be executed" is trying to remove the empty segment and .sql schema file from tiflash instance after the tiflash replica num is set to 0. But seems there is no easy way to fix such a concurrent issue.

@JaySon-Huang
Copy link
Contributor Author

For LTS, only v8.1.0 and v8.1.1 are affected.

release-7.5 is not affected because we comment out the "drop table" action when alter tiflash replica to 0
https://github.com/pingcap/tiflash/blob/release-7.5/dbms/src/TiDB/Schema/SchemaBuilder.cpp#L391-L408

Older versions are also not affected because there is no such mechanism.

ti-chi-bot bot pushed a commit that referenced this issue Sep 19, 2024
)

close #9438

ddl: Do not physical drop table after tiflash replica is set to 0
To avoid a potential data loss issue when altering tiflash replica
ti-chi-bot bot pushed a commit that referenced this issue Sep 19, 2024
) (#9441)

close #9438

ddl: Do not physical drop table after tiflash replica is set to 0
To avoid a potential data loss issue when altering tiflash replica

Co-authored-by: JaySon-Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant