This repository has been archived by the owner on Jun 23, 2022. It is now read-only.
fix(backup): delay clearing obsoleted backup when it's still checkpointing #327
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
issue: apache/incubator-pegasus#311
现有逻辑为,接收到clean request(backup_id==0),先取得backup_context:
如果backup_context不存在就clear_backup_checkpoint,确保相应目录被删除。如果是primary,还要把clean request发给secondary。
如果backup_context存在,则删除backup_context。如果是primary,还要把clean request发给secondary。删除相应目录。
问题点在于收到clean request时,可能正在checkpointing,这时如果目录在copy to dir 途中删除,就会造成core。
举例说明
2019-10-10 01:05出现的core,精确时间Modify: 2019-10-10 01:05:37.504636201 +0800:
及日志:
D2019-10-10 01:05:28.469 (1570640728469920756 11607) replica.replica13.040015d28eb66b87: replica_stub.cpp:893:on_cold_backup(): received cold backup request: backup{3.16.every_day.1570640707754}
D2019-10-10 01:05:28.469 (1570640728469936097 11607) replica.replica13.040015d28eb66b87: replica_backup.cpp:30:on_cold_backup(): backup{3.16.every_day.1570640707754}: received cold backup request, partition_status = replication::partition_status::PS_SECONDARY
D2019-10-10 01:05:28.470 (1570640728470260034 11607) replica.replica13.040015d28eb66b89: replica_stub.cpp:893:on_cold_backup(): received cold backup request: backup{3.16.every_day.0}
D2019-10-10 01:05:28.470 (1570640728470275669 11607) replica.replica13.040015d28eb66b89: replica_backup.cpp:30:on_cold_backup(): backup{3.16.every_day.0}: received cold backup request, partition_status = replication::partition_status::PS_SECONDARY, this is a clear request
D2019-10-10 01:05:28.470 (1570640728470281984 11607) replica.replica13.040015d28eb66b89: replica_backup.cpp:84:on_cold_backup(): backup{3.16.every_day.0}: clear obsoleted cold backup context, old_backup_id = 1570640707754, old_backup_status = ColdBackupCheckpointing
D2019-10-10 01:05:28.470 (1570640728470322651 11631) replica.rep_long5.0404000d000ac4a7: replica_backup.cpp:412:clear_backup_checkpoint(): [[email protected]:32801] clear all checkpoint dirs of policy(every_day)
D2019-10-10 01:05:28.471 (1570640728471839107 11631) replica.rep_long5.0404000d000ac4a7: replica_backup.cpp:425:clear_backup_checkpoint(): [[email protected]:32801] remove backup checkpoint dir(/home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469.tmp) succeed
可以看出,这个secondary在收到backup request后后台开始备份操作,很短的时间后又收到了clean request,导致立马删除了/home/work/ssd1/pegasus/azmbcloudsrv-storecom/replica/reps/3.16.pegasus/backup/backup_tmp.every_day.1570640707754.1570640708469.tmp文件夹,而core的产生正是
copy_checkpoint_to_dir_unsafe
函数在向该文件夹写入,也就是 rocksdb 的CreateCheckpointQuick
。测试
尝试过在测试环境中让secondary推迟copy_checkpoint_to_dir_unsafe随机秒数,但很难复现这个bug,通常得到的结果是No such file or directory等可以handle的错误。
因此,此bug是否完成修复,还需要观察。