fix: fix the bug in restore #459

levy5307 · 2020-05-12T07:56:42Z

In restore, _restore_status = ERR_CORRUPTION means that it encounters error, and the restore progress should be stopped.
And when file.md5sum is not equal with the corresponding md5sum in metadata, it means something is wrong with this file. So we should stop the restore progress by setting _restore_status = ERR_CORRUPTION. Or the restore progress will repeat forever, as our system shows now.

Manual Test

Action

Restore with corrupt checkpoint files.

restore_app -c onebox -p policy -a temp -i 2 -t 1589350598363 -b local_service -n result

Note that result is the new app which is created by restore progress.

Before Modification

There are a lot of errors in log file for each partition, which means it retried many times:

E2020-05-13 14:43:05.819 (1589352185819643053 06c2) replica.rep_long0.03040001000000d1: replica_restore.cpp:296:download_checkpoint(): [email protected]:34801: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:43:05.819 (1589352185819675493 06c2) replica.rep_long0.03040001000000d1: replica_init.cpp:85:newr(): try to restore replica [email protected]:34801 failed, error(ERR_CORRUPTION)

E2020-05-13 14:43:15.824 (1589352195824503443 06c2) replica.rep_long0.03040001000000e6: replica_restore.cpp:296:download_checkpoint(): [email protected]:34801: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:43:15.824 (1589352195824548942 06c2) replica.rep_long0.03040001000000e6: replica_init.cpp:85:newr(): try to restore replica [email protected]:34801 failed, error(ERR_CORRUPTION)

E2020-05-13 14:43:25.827 (1589352205827476263 06c3) replica.rep_long1.03040001000000fd: replica_restore.cpp:296:download_checkpoint(): [email protected]:34801: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:43:25.827 (1589352205827637665 06c3) replica.rep_long1.03040001000000fd: replica_init.cpp:85:newr(): try to restore replica [email protected]:34801 failed, error(ERR_CORRUPTION)

And we can see the result app will exist forever.

>>> ls -d
[general_info]
app_id  status     app_name  app_type  partition_count  replica_count  is_stateful  create_time          drop_time  drop_expire  envs_count  
1       AVAILABLE  stat      pegasus   4                3              true         2020-05-13_14:16:12  -          -            0           
2       AVAILABLE  temp      pegasus   8                3              true         2020-05-13_14:16:12  -          -            0           
3       AVAILABLE  result    pegasus   8                3              true         2020-05-13_14:16:12  -          -            6           

[healthy_info]
app_id  app_name  partition_count  fully_healthy  unhealthy  write_unhealthy  read_unhealthy  
1       stat      4                4              0          0                0               
2       temp      8                8              0          0                0               
3       result    8                0              8          8                8

After Modification

The error log occurs only once for each partition:

E2020-05-13 14:37:02.165 (1589351822165922452 787c) replica.rep_long1.03040001000001c5: replica_restore.cpp:296:download_checkpoint(): [email protected]:34802: checkpoint is damaged, chkpt = /home/mi/work
E2020-05-13 14:37:02.165 (1589351822165971928 787c) replica.rep_long1.03040001000001c5: replica_init.cpp:85:newr(): try to restore replica [email protected]:34802 failed, error(ERR_CORRUPTION)

And the result app was deleted immediately when the error occurs.

>>> ls -d
[general_info]
app_id  status     app_name  app_type  partition_count  replica_count  is_stateful  create_time          drop_time  drop_expire  envs_count  
1       AVAILABLE  stat      pegasus   4                3              true         2020-05-13_14:35:12  -          -            0           
2       AVAILABLE  temp      pegasus   8                3              true         2020-05-13_14:35:12  -          -            0           

[healthy_info]
app_id  app_name  partition_count  fully_healthy  unhealthy  write_unhealthy  read_unhealthy  
1       stat      4                4              0          0                0               
2       temp      8                8              0          0                0

src/dist/replication/lib/replica_restore.cpp

…tore

neverchanje · 2020-05-13T02:37:46Z

Or the restore progress will repeat forever, as our system shows now.

How did it show? Do you have any manual tests or unit tests to verify this modification?

hycdong · 2020-05-13T02:51:51Z

Or the restore progress will repeat forever, as our system shows now.

How did it show? Do you have any manual tests or unit tests to verify this modification?

Current restore process:

meta server create a new table
primary download checkpoint from remote provider
if download failed, meta will retry restore process

If downloading got ERR_CORRUPT, meta should not retry restore, because the checkpoints on remote provider is corrupted. In current implementation, meta doesn't handle this situation, which will lead that restore process repeated forever and could not stop.
I agree that It is better to add a unit test for this pull request.

levy5307 · 2020-05-13T03:09:14Z

Or the restore progress will repeat forever, as our system shows now.

How did it show? Do you have any manual tests or unit tests to verify this modification?

Yes, I have done some manual tests to see whether the restore progress can stop or not.

zhaoliwei added 2 commits May 12, 2020 15:45

fix bug

2b9de5a

Merge branch 'master' into stop-restore

b925ff8

hycdong reviewed May 12, 2020

View reviewed changes

src/dist/replication/lib/replica_restore.cpp Outdated Show resolved Hide resolved

acelyc111 previously approved these changes May 12, 2020

View reviewed changes

zhaoliwei added 2 commits May 13, 2020 10:09

fix

e5decef

Merge branch 'stop-restore' of github.com:levy5307/rdsn into stop-res…

decf764

…tore

levy5307 dismissed acelyc111’s stale review via decf764 May 13, 2020 02:11

hycdong approved these changes May 13, 2020

View reviewed changes

Merge branch 'master' into stop-restore

fcd7874

neverchanje approved these changes May 13, 2020

View reviewed changes

acelyc111 merged commit 704e1f0 into XiaoMi:master May 13, 2020

levy5307 deleted the stop-restore branch May 13, 2020 06:57

neverchanje mentioned this pull request May 14, 2020

Release 2.0.0 apache/incubator-pegasus#536

Closed

neverchanje added the 2.0.0 label Jun 5, 2020

neverchanje mentioned this pull request Jun 10, 2020

Release 1.12.4 apache/incubator-pegasus#547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix the bug in restore #459

fix: fix the bug in restore #459

levy5307 commented May 12, 2020 •

edited

Loading

neverchanje commented May 13, 2020

hycdong commented May 13, 2020 •

edited

Loading

levy5307 commented May 13, 2020 •

edited

Loading

fix: fix the bug in restore #459

fix: fix the bug in restore #459

Conversation

levy5307 commented May 12, 2020 • edited Loading

Manual Test

Action

Before Modification

After Modification

neverchanje commented May 13, 2020

hycdong commented May 13, 2020 • edited Loading

levy5307 commented May 13, 2020 • edited Loading

levy5307 commented May 12, 2020 •

edited

Loading

hycdong commented May 13, 2020 •

edited

Loading

levy5307 commented May 13, 2020 •

edited

Loading