fix(backup): delay clearing obsoleted backup when it's still checkpointing #327

vagetablechicken · 2019-10-10T03:35:47Z

现有逻辑为，接收到clean request（backup_id==0），先取得backup_context：

如果backup_context不存在就clear_backup_checkpoint，确保相应目录被删除。如果是primary，还要把clean request发给secondary。
如果backup_context存在，则删除backup_context。如果是primary，还要把clean request发给secondary。删除相应目录。

问题点在于收到clean request时，可能正在checkpointing，这时如果目录在copy to dir 途中删除，就会造成core。

举例说明

2019-10-10 01:05出现的core，精确时间Modify: 2019-10-10 01:05:37.504636201 +0800：

pegasus::server::pegasus_server_impl::copy_checkpoint_to_dir_unsafe(
    /home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469

及日志：

D2019-10-10 01:05:28.469 (1570640728469920756 11607) replica.replica13.040015d28eb66b87: replica_stub.cpp:893:on_cold_backup(): received cold backup request: backup{3.16.every_day.1570640707754}
D2019-10-10 01:05:28.469 (1570640728469936097 11607) replica.replica13.040015d28eb66b87: replica_backup.cpp:30:on_cold_backup(): backup{3.16.every_day.1570640707754}: received cold backup request, partition_status = replication::partition_status::PS_SECONDARY
D2019-10-10 01:05:28.470 (1570640728470260034 11607) replica.replica13.040015d28eb66b89: replica_stub.cpp:893:on_cold_backup(): received cold backup request: backup{3.16.every_day.0}
D2019-10-10 01:05:28.470 (1570640728470275669 11607) replica.replica13.040015d28eb66b89: replica_backup.cpp:30:on_cold_backup(): backup{3.16.every_day.0}: received cold backup request, partition_status = replication::partition_status::PS_SECONDARY, this is a clear request
D2019-10-10 01:05:28.470 (1570640728470281984 11607) replica.replica13.040015d28eb66b89: replica_backup.cpp:84:on_cold_backup(): backup{3.16.every_day.0}: clear obsoleted cold backup context, old_backup_id = 1570640707754, old_backup_status = ColdBackupCheckpointing
D2019-10-10 01:05:28.470 (1570640728470322651 11631) replica.rep_long5.0404000d000ac4a7: replica_backup.cpp:412:clear_backup_checkpoint(): [[email protected]:32801] clear all checkpoint dirs of policy(every_day)
D2019-10-10 01:05:28.471 (1570640728471839107 11631) replica.rep_long5.0404000d000ac4a7: replica_backup.cpp:425:clear_backup_checkpoint(): [[email protected]:32801] remove backup checkpoint dir(/home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469.tmp) succeed

可以看出，这个secondary在收到backup request后后台开始备份操作，很短的时间后又收到了clean request，导致立马删除了/home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469.tmp文件夹，而core的产生正是 copy_checkpoint_to_dir_unsafe 函数在向该文件夹写入，也就是 rocksdb 的 CreateCheckpointQuick。

测试

尝试过在测试环境中让secondary推迟copy_checkpoint_to_dir_unsafe随机秒数，但很难复现这个bug，通常得到的结果是No such file or directory等可以handle的错误。
因此，此bug是否完成修复，还需要观察。

vagetablechicken · 2019-10-10T09:40:55Z

感觉最好的方式还是通过本地的检测来做删除目录，而不是让primary发给secondary clean request。是否可以考虑交给replica_stub的gc来处理？

neverchanje · 2019-10-11T06:00:50Z

我觉得这个 PR 看起来应该能解决问题，可以在 1.11.7 升级下去

fix(coldbackup): delay clean request when chkpting

3301c0a

vagetablechicken added the type/bug-fix This PR fixes a bug. label Oct 10, 2019

vagetablechicken requested review from qinzuoyan, acelyc111, hycdong, levy5307, neverchanje and foreverneverer October 10, 2019 06:47

neverchanje changed the title ~~fix(coldbackup): delay clean request when chkpting~~ fix(backup): delay clean request when checkpointing Oct 10, 2019

neverchanje changed the title ~~fix(backup): delay clean request when checkpointing~~ fix(backup): delay clearing obsoleted backup when it's still checkpointing Oct 11, 2019

neverchanje approved these changes Oct 11, 2019

View reviewed changes

acelyc111 approved these changes Oct 11, 2019

View reviewed changes

neverchanje merged commit 2b8147a into XiaoMi:master Oct 11, 2019

weekly-digest bot mentioned this pull request Oct 13, 2019

Weekly Digest (6 October, 2019 - 13 October, 2019) #328

Closed

neverchanje mentioned this pull request Oct 25, 2019

Release 1.12.0 apache/incubator-pegasus#409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backup): delay clearing obsoleted backup when it's still checkpointing #327

fix(backup): delay clearing obsoleted backup when it's still checkpointing #327

vagetablechicken commented Oct 10, 2019 •

edited by acelyc111

Loading

vagetablechicken commented Oct 10, 2019

neverchanje commented Oct 11, 2019

fix(backup): delay clearing obsoleted backup when it's still checkpointing #327

fix(backup): delay clearing obsoleted backup when it's still checkpointing #327

Conversation

vagetablechicken commented Oct 10, 2019 • edited by acelyc111 Loading

举例说明

测试

vagetablechicken commented Oct 10, 2019

neverchanje commented Oct 11, 2019

vagetablechicken commented Oct 10, 2019 •

edited by acelyc111

Loading