Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

fix(backup): delay clearing obsoleted backup when it's still checkpointing #327

Merged
merged 1 commit into from
Oct 11, 2019

Conversation

vagetablechicken
Copy link
Member

@vagetablechicken vagetablechicken commented Oct 10, 2019

issue: apache/incubator-pegasus#311

现有逻辑为,接收到clean request(backup_id==0),先取得backup_context:

  1. 如果backup_context不存在就clear_backup_checkpoint,确保相应目录被删除。如果是primary,还要把clean request发给secondary。

  2. 如果backup_context存在,则删除backup_context。如果是primary,还要把clean request发给secondary。删除相应目录。

问题点在于收到clean request时,可能正在checkpointing,这时如果目录在copy to dir 途中删除,就会造成core。

举例说明

2019-10-10 01:05出现的core,精确时间Modify: 2019-10-10 01:05:37.504636201 +0800:

pegasus::server::pegasus_server_impl::copy_checkpoint_to_dir_unsafe(
    /home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469

及日志:

D2019-10-10 01:05:28.469 (1570640728469920756 11607) replica.replica13.040015d28eb66b87: replica_stub.cpp:893:on_cold_backup(): received cold backup request: backup{3.16.every_day.1570640707754}
D2019-10-10 01:05:28.469 (1570640728469936097 11607) replica.replica13.040015d28eb66b87: replica_backup.cpp:30:on_cold_backup(): backup{3.16.every_day.1570640707754}: received cold backup request, partition_status = replication::partition_status::PS_SECONDARY
D2019-10-10 01:05:28.470 (1570640728470260034 11607) replica.replica13.040015d28eb66b89: replica_stub.cpp:893:on_cold_backup(): received cold backup request: backup{3.16.every_day.0}
D2019-10-10 01:05:28.470 (1570640728470275669 11607) replica.replica13.040015d28eb66b89: replica_backup.cpp:30:on_cold_backup(): backup{3.16.every_day.0}: received cold backup request, partition_status = replication::partition_status::PS_SECONDARY, this is a clear request
D2019-10-10 01:05:28.470 (1570640728470281984 11607) replica.replica13.040015d28eb66b89: replica_backup.cpp:84:on_cold_backup(): backup{3.16.every_day.0}: clear obsoleted cold backup context, old_backup_id = 1570640707754, old_backup_status = ColdBackupCheckpointing
D2019-10-10 01:05:28.470 (1570640728470322651 11631) replica.rep_long5.0404000d000ac4a7: replica_backup.cpp:412:clear_backup_checkpoint(): [[email protected]:32801] clear all checkpoint dirs of policy(every_day)
D2019-10-10 01:05:28.471 (1570640728471839107 11631) replica.rep_long5.0404000d000ac4a7: replica_backup.cpp:425:clear_backup_checkpoint(): [[email protected]:32801] remove backup checkpoint dir(/home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469.tmp) succeed

可以看出,这个secondary在收到backup request后后台开始备份操作,很短的时间后又收到了clean request,导致立马删除了/home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469.tmp文件夹,而core的产生正是 copy_checkpoint_to_dir_unsafe 函数在向该文件夹写入,也就是 rocksdb 的 CreateCheckpointQuick

测试

尝试过在测试环境中让secondary推迟copy_checkpoint_to_dir_unsafe随机秒数,但很难复现这个bug,通常得到的结果是No such file or directory等可以handle的错误。
因此,此bug是否完成修复,还需要观察。

@vagetablechicken vagetablechicken added the type/bug-fix This PR fixes a bug. label Oct 10, 2019
@neverchanje neverchanje changed the title fix(coldbackup): delay clean request when chkpting fix(backup): delay clean request when checkpointing Oct 10, 2019
@vagetablechicken
Copy link
Member Author

感觉最好的方式还是通过本地的检测来做删除目录,而不是让primary发给secondary clean request。是否可以考虑交给replica_stub的gc来处理?

@neverchanje neverchanje changed the title fix(backup): delay clean request when checkpointing fix(backup): delay clearing obsoleted backup when it's still checkpointing Oct 11, 2019
@neverchanje
Copy link
Contributor

我觉得这个 PR 看起来应该能解决问题,可以在 1.11.7 升级下去

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/bug-fix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants