Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

feat(bulk-load): meta handle bulk_load_response #463

Merged
merged 6 commits into from
May 18, 2020

Conversation

hycdong
Copy link
Contributor

@hycdong hycdong commented May 15, 2020

The whole bulk load process is like:

  1. client sends request to meta server to start bulk load
  2. meta sends bulk_load_request to primary replicas
  3. primary broadcasts group_bulk_load_request, collects bulk load states from secondary, and reports group bulk load states to meta
  4. meta receive bulk_load_response, if bulk load not finished, jump to step2, resend bulk_load_request

This pull request is about meta receive bulk_load_response from primary, the step4 above.

  • if rpc meet error or resp.err is OBJECT_NOT_FOUND and INVALID_STATE, meta considers that bulk load meets some recoverable errors and tries to rollback to downloading and resend bulk load request
  • each node will limit bulk load downloading count, if resp.err is BUSY, it means that the node has already have enough replica executing downloading, there is no need to do any error handling, just resend bulk load request
  • if other errors happens, such as ERR_CORRUPT, meta will consider bulk load failed and resend bulk load request to replica server to cleanup bulk load states and context
  • if resp.err is ok and pass ballot check, meta will call different functions according to current app's bulk_load_status. If bulk load not finished, meta will resend bulk_load_request. Here is an exception, when bulk load status is paused, it means bulk load is paused, meta should not send request to replica to collect states, but bulk load is not finished yet.

There are two different interval to resend bulk load request to primary. Default value is 10 seconds, short interval is 5 seconds. Short interval will be used when app status is downloaded and ingesting (partition status may be ingesting). If partition is ingesting, meta server wishes to get states faster.

This pull request adds lots of function declarations, and the unit tests will be implemented in further pull request.

@hycdong hycdong marked this pull request as ready for review May 15, 2020 09:39
pid,
err.to_string());
try_rollback_to_downloading(app_name, pid);
try_resend_bulk_load_request(app_name, pid, interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to deal with a nonrecoverable error, e.g. files in remote storage is conrrupt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On L353 ~ L363, the app and partitions' bulk load status will be set as bulk_load_failed (will be implemented in functionhandle_bulk_load_failed), and resend request to replica server to cleanup bulk load states and context.

acelyc111
acelyc111 previously approved these changes May 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants