Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

fix: fix the bug in restore #459

Merged
merged 5 commits into from
May 13, 2020
Merged

fix: fix the bug in restore #459

merged 5 commits into from
May 13, 2020

Conversation

levy5307
Copy link
Contributor

@levy5307 levy5307 commented May 12, 2020

In restore, _restore_status = ERR_CORRUPTION means that it encounters error, and the restore progress should be stopped.
And when file.md5sum is not equal with the corresponding md5sum in metadata, it means something is wrong with this file. So we should stop the restore progress by setting _restore_status = ERR_CORRUPTION. Or the restore progress will repeat forever, as our system shows now.

Manual Test

Action

Restore with corrupt checkpoint files.

restore_app -c onebox -p policy -a temp -i 2 -t 1589350598363 -b local_service -n result

Note that result is the new app which is created by restore progress.

Before Modification

There are a lot of errors in log file for each partition, which means it retried many times:

E2020-05-13 14:43:05.819 (1589352185819643053 06c2) replica.rep_long0.03040001000000d1: replica_restore.cpp:296:download_checkpoint(): [email protected]:34801: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:43:05.819 (1589352185819675493 06c2) replica.rep_long0.03040001000000d1: replica_init.cpp:85:newr(): try to restore replica [email protected]:34801 failed, error(ERR_CORRUPTION) 
E2020-05-13 14:43:15.824 (1589352195824503443 06c2) replica.rep_long0.03040001000000e6: replica_restore.cpp:296:download_checkpoint(): [email protected]:34801: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:43:15.824 (1589352195824548942 06c2) replica.rep_long0.03040001000000e6: replica_init.cpp:85:newr(): try to restore replica [email protected]:34801 failed, error(ERR_CORRUPTION)
E2020-05-13 14:43:25.827 (1589352205827476263 06c3) replica.rep_long1.03040001000000fd: replica_restore.cpp:296:download_checkpoint(): [email protected]:34801: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:43:25.827 (1589352205827637665 06c3) replica.rep_long1.03040001000000fd: replica_init.cpp:85:newr(): try to restore replica [email protected]:34801 failed, error(ERR_CORRUPTION)

And we can see the result app will exist forever.

>>> ls -d
[general_info]
app_id  status     app_name  app_type  partition_count  replica_count  is_stateful  create_time          drop_time  drop_expire  envs_count  
1       AVAILABLE  stat      pegasus   4                3              true         2020-05-13_14:16:12  -          -            0           
2       AVAILABLE  temp      pegasus   8                3              true         2020-05-13_14:16:12  -          -            0           
3       AVAILABLE  result    pegasus   8                3              true         2020-05-13_14:16:12  -          -            6           

[healthy_info]
app_id  app_name  partition_count  fully_healthy  unhealthy  write_unhealthy  read_unhealthy  
1       stat      4                4              0          0                0               
2       temp      8                8              0          0                0               
3       result    8                0              8          8                8       

After Modification

The error log occurs only once for each partition:

E2020-05-13 14:37:02.165 (1589351822165922452 787c) replica.rep_long1.03040001000001c5: replica_restore.cpp:296:download_checkpoint(): [email protected]:34802: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:37:02.165 (1589351822165971928 787c) replica.rep_long1.03040001000001c5: replica_init.cpp:85:newr(): try to restore replica [email protected]:34802 failed, error(ERR_CORRUPTION)

And the result app was deleted immediately when the error occurs.

>>> ls -d
[general_info]
app_id  status     app_name  app_type  partition_count  replica_count  is_stateful  create_time          drop_time  drop_expire  envs_count  
1       AVAILABLE  stat      pegasus   4                3              true         2020-05-13_14:35:12  -          -            0           
2       AVAILABLE  temp      pegasus   8                3              true         2020-05-13_14:35:12  -          -            0           

[healthy_info]
app_id  app_name  partition_count  fully_healthy  unhealthy  write_unhealthy  read_unhealthy  
1       stat      4                4              0          0                0               
2       temp      8                8              0          0                0               

acelyc111
acelyc111 previously approved these changes May 12, 2020
@neverchanje
Copy link
Contributor

Or the restore progress will repeat forever, as our system shows now.

How did it show? Do you have any manual tests or unit tests to verify this modification?

@hycdong
Copy link
Contributor

hycdong commented May 13, 2020

Or the restore progress will repeat forever, as our system shows now.

How did it show? Do you have any manual tests or unit tests to verify this modification?

Current restore process:

  • meta server create a new table
  • primary download checkpoint from remote provider
  • if download failed, meta will retry restore process

If downloading got ERR_CORRUPT, meta should not retry restore, because the checkpoints on remote provider is corrupted. In current implementation, meta doesn't handle this situation, which will lead that restore process repeated forever and could not stop.
I agree that It is better to add a unit test for this pull request.

@levy5307
Copy link
Contributor Author

levy5307 commented May 13, 2020

Or the restore progress will repeat forever, as our system shows now.

How did it show? Do you have any manual tests or unit tests to verify this modification?

Yes, I have done some manual tests to see whether the restore progress can stop or not.

@acelyc111 acelyc111 merged commit 704e1f0 into XiaoMi:master May 13, 2020
@levy5307 levy5307 deleted the stop-restore branch May 13, 2020 06:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants