Removing regions may be blocked for a long time when dropping a table with large amount of data #9437

JaySon-Huang · 2024-09-19T01:08:34Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Create a table with tiflash replica
Load data into the table until it contains thousands of Regions in the raft level, and thousands of Segments in the DeltaTree storage engine level
Drop the table

2. What did you expect to see? (Required)

The dropped table gets dropped smoothly, without affecting TiFlash's raft-log syncing and causing failed queries

3. What did you see instead (Required)

The raft-log syncing is blocked for tens of minutes, and coming raft messages make tiflash OOM.
And raft-log syncing is blocked also makes failed queries because the learner read timeout by waiting raft-log syncing index.

4. What is your TiFlash version? (Required)

v7.1.3

JaySon-Huang · 2024-09-19T01:20:48Z

When a table is dropped in tidb, and exceeds the gc_safepoint, tiflash will generate an InterpreterDropQuery to physically remove the data from the tiflash instance. The "DropQuery" will call DeltaMergeStore::dropAllSegments to remove the Segment at DeltaTree storage one by one. Because there are many segments for the table, running DeltaMergeStore::dropAllSegments is slow.
https://github.com/pingcap/tiflash/blob/v7.1.3/dbms/src/TiDB/Schema/SchemaSyncService.cpp#L216-L227
https://github.com/pingcap/tiflash/blob/v7.1.3/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp#L342-L349

[2024/09/16 21:29:29.604 +00:00] [INFO] [SchemaSyncService.cpp:170] ["Performing GC using safe point 452597098965106688"] [source="keyspace=4294967295"] [thread_id=4314]
[2024/09/16 21:29:29.610 +00:00] [INFO] [SchemaSyncService.cpp:216] ["Physically dropping table test_db(694).test_tbl(563080)"] [source="keyspace=4294967295"] [thread_id=4314]
[2024/09/16 21:29:30.446 +00:00] [INFO] [DeltaMergeStore.cpp:413] ["Clear DeltaMerge segments data"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4314]
[2024/09/16 21:29:30.465 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=292469 epoch=2 range=[?,?) next_segment_id=292472 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_1174749 stable_rows=98304 stable_bytes=576748236 dmf_rows=98304 dmf_bytes=576748236 dmf_packs=12>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=292469 epoch=2"] [thread_id=4314]
[2024/09/16 21:29:30.553 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=292223 epoch=3 range=[?,?) next_segment_id=292469 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_1205953 stable_rows=86440 stable_bytes=507435226 dmf_rows=86440 dmf_bytes=507435226 dmf_packs=11>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=292223 epoch=3"] [thread_id=4314]
...

However, DeltaMergeStore::dropAllSegments does not check that all Regions have been removed from TiFlash before the physical drop action is performed. So at the same time, there are some Region removed actions are performed.
All the raftstore threads that try to remove Region will run into removeObsoleteDataInStorage, calling DeltaMergeStore::flushCache and try to acquire a read latch on table_id=563080. At last, all the raftsotre threads are blocked by the thread that executing DeltaMergeStore::dropAllSegments.

But more raft-message comes into the tiflash instance, the memory usage grows and cause OOM kills. After restarts, the tiflash instance runs into the same blocking again. And at last, all the segments (around 30,000 in total) are removed from tiflash. And tiflash begins to catch-up the raft message.

[2024/09/16 23:25:37.681 +00:00] [INFO] [DeltaMergeStore.cpp:413] ["Clear DeltaMerge segments data"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4143]
...
[2024/09/16 23:46:11.570 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=281216 epoch=2 range=[?,?) next_segment_id=281738 delta_rows=0 delta_bytes=0 delta_deletes=1 stable_file=dmf_1205957 stable_rows=59514 stable_bytes=348855913 dmf_rows=59514 dmf_bytes=348855913 dmf_packs=8>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=281216 epoch=2"] [thread_id=4143]
[2024/09/16 23:46:11.714 +00:00] [INFO] [DeltaMergeStore.cpp:419] ["Clear DeltaMerge segments data done"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4143]

JaySon-Huang · 2024-09-19T01:25:03Z

Affected versions
v4.0.x
v5.0.x ~ v5.4.x
v6.1.0 ~ v6.1.7
v6.5.0 ~ v6.5.11
v7.1.0 ~ v7.1.5
v7.5.0 (fixed since v7.5.1 when fixing another issue #8710)

For the old affected versions before v7.5.x, we can pick the same logic as fixing #8710. That is to ensure all the regions are removed before physically dropping the data from TiFlash instance. In this way, the DeltaMergeStore::dropAllSegments does not cause slow raft-log syncing and failed queries.

JaySon-Huang · 2024-09-19T08:05:48Z

Mark it as major because it may block the raft threads and cause OOM when dropping a large volume table, but it can self-recover.

…ons (#9442) close #9437 ddl: Fix the physical drop storage instance may block removing regions Make sure physical drop storage instance only happen after all related regions are removed

JaySon-Huang · 2024-09-20T09:46:56Z

Close as #9442 is fixed in the release-7.1 branch.
And the master branch is not affected.

JaySon-Huang added type/bug The issue is confirmed as a bug. component/storage labels Sep 19, 2024

JaySon-Huang added affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/major labels Sep 19, 2024

ti-chi-bot bot added may-affects-7.5 may-affects-8.1 labels Sep 19, 2024

JaySon-Huang removed may-affects-7.5 may-affects-8.1 labels Sep 19, 2024

JaySon-Huang mentioned this issue Sep 19, 2024

ddl: Fix the physically drop storage instance may block removing regions #9442

Merged

12 tasks

JaySon-Huang closed this as completed Sep 20, 2024

JaySon-Huang linked a pull request Sep 20, 2024 that will close this issue

ddl: Fix the physically drop storage instance may block removing regions #9442

Merged

12 tasks

ti-chi-bot bot added the report/customer Customers have encountered this bug. label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing regions may be blocked for a long time when dropping a table with large amount of data #9437

Removing regions may be blocked for a long time when dropping a table with large amount of data #9437

JaySon-Huang commented Sep 19, 2024

JaySon-Huang commented Sep 19, 2024

JaySon-Huang commented Sep 19, 2024 •

edited

Loading

JaySon-Huang commented Sep 19, 2024

JaySon-Huang commented Sep 20, 2024

Removing regions may be blocked for a long time when dropping a table with large amount of data #9437

Removing regions may be blocked for a long time when dropping a table with large amount of data #9437

Comments

JaySon-Huang commented Sep 19, 2024

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

JaySon-Huang commented Sep 19, 2024

JaySon-Huang commented Sep 19, 2024 • edited Loading

JaySon-Huang commented Sep 19, 2024

JaySon-Huang commented Sep 20, 2024

JaySon-Huang commented Sep 19, 2024 •

edited

Loading