feat(bulk-load): meta handle bulk_load_response #463

hycdong · 2020-05-15T06:23:10Z

The whole bulk load process is like:

client sends request to meta server to start bulk load
meta sends bulk_load_request to primary replicas
primary broadcasts group_bulk_load_request, collects bulk load states from secondary, and reports group bulk load states to meta
meta receive bulk_load_response, if bulk load not finished, jump to step2, resend bulk_load_request

This pull request is about meta receive bulk_load_response from primary, the step4 above.

if rpc meet error or resp.err is OBJECT_NOT_FOUND and INVALID_STATE, meta considers that bulk load meets some recoverable errors and tries to rollback to downloading and resend bulk load request
each node will limit bulk load downloading count, if resp.err is BUSY, it means that the node has already have enough replica executing downloading, there is no need to do any error handling, just resend bulk load request
if other errors happens, such as ERR_CORRUPT, meta will consider bulk load failed and resend bulk load request to replica server to cleanup bulk load states and context
if resp.err is ok and pass ballot check, meta will call different functions according to current app's bulk_load_status. If bulk load not finished, meta will resend bulk_load_request. Here is an exception, when bulk load status is paused, it means bulk load is paused, meta should not send request to replica to collect states, but bulk load is not finished yet.

There are two different interval to resend bulk load request to primary. Default value is 10 seconds, short interval is 5 seconds. Short interval will be used when app status is downloaded and ingesting (partition status may be ingesting). If partition is ingesting, meta server wishes to get states faster.

This pull request adds lots of function declarations, and the unit tests will be implemented in further pull request.

src/dist/replication/meta_server/meta_bulk_load_service.cpp

acelyc111 · 2020-05-18T02:44:42Z

src/dist/replication/meta_server/meta_bulk_load_service.cpp

+                 pid,
+                 err.to_string());
+        try_rollback_to_downloading(app_name, pid);
+        try_resend_bulk_load_request(app_name, pid, interval);


How to deal with a nonrecoverable error, e.g. files in remote storage is conrrupt?

On L353 ~ L363, the app and partitions' bulk load status will be set as bulk_load_failed (will be implemented in functionhandle_bulk_load_failed), and resend request to replica server to cleanup bulk load states and context.

src/dist/replication/meta_server/meta_bulk_load_service.cpp

add meta on_partition_bulk_load_reply

58cf8ec

hycdong added the component/bulk-load label May 15, 2020

refactor code

a8b06b4

hycdong marked this pull request as ready for review May 15, 2020 09:39

Merge branch 'master' into meta_on_partition_bulk_load_reply

287898e

acelyc111 reviewed May 18, 2020

View reviewed changes

small fix

03cb7a7

acelyc111 previously approved these changes May 18, 2020

View reviewed changes

Merge branch 'master' into meta_on_partition_bulk_load_reply

f78bcb0

levy5307 reviewed May 18, 2020

View reviewed changes

src/dist/replication/meta_server/meta_bulk_load_service.cpp Outdated Show resolved Hide resolved

fix log

98151a4

hycdong dismissed acelyc111’s stale review via 98151a4 May 18, 2020 07:44

levy5307 approved these changes May 18, 2020

View reviewed changes

acelyc111 approved these changes May 18, 2020

View reviewed changes

acelyc111 merged commit 466bca2 into XiaoMi:master May 18, 2020

hycdong added a commit to hycdong/rdsn that referenced this pull request May 19, 2020

merge feat(bulk-load): meta handle bulk_load_response XiaoMi#463

67b3470

hycdong deleted the meta_on_partition_bulk_load_reply branch June 3, 2020 06:29

hycdong mentioned this pull request Jun 18, 2020

feat(bulk-load): bulk load succeed part2 - meta handle bulk load succeed #508

Merged

This was referenced Jun 29, 2020

feat(bulk-load): pause bulk load part2 - implement bulk load pause #518

Merged

feat(bulk-load): restart bulk load #524

Merged

feat(bulk-load): cancel bulk load #531

Merged

This was referenced Jul 7, 2020

feat(bulk-load): handle bulk load failed and app unavailable #532

Merged

feat(bulk-load): handle recoverable errors during bulk load #535

Merged

hycdong mentioned this pull request Aug 17, 2020

Release 2.1.0 apache/incubator-pegasus#577

Closed

neverchanje added the 2.1.0 label Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bulk-load): meta handle bulk_load_response #463

feat(bulk-load): meta handle bulk_load_response #463

hycdong commented May 15, 2020 •

edited

Loading

acelyc111 May 18, 2020

hycdong May 18, 2020

feat(bulk-load): meta handle bulk_load_response #463

feat(bulk-load): meta handle bulk_load_response #463

Conversation

hycdong commented May 15, 2020 • edited Loading

acelyc111 May 18, 2020

Choose a reason for hiding this comment

hycdong May 18, 2020

Choose a reason for hiding this comment

hycdong commented May 15, 2020 •

edited

Loading