Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling when reading request and Region meta change in concurrency #1101

Merged

Conversation

JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Sep 17, 2020

What problem does this PR solve?

Issue Number: close #1095

Problem Summary:
If region meta changed between learner read and get streams from storage, we can not ensure the correctness of read data. We should retry those key ranges.
Before this PR, if the super batch is enabled and happens to this error, an error will directly be thrown to users. Users need to retry their queries. We should handle those retry inside TiFlash.

What is changed and how it works?

  • If we happen to RegionException after read from storage, and super batch is enabled, then
    • clear streams read from local
    • push those regions into region_retry and read from remote storage later
  • Refine some codes for FailPoints

Related changes

  • Need to cherry-pick to the release branch 4.0

Check List

Tests

  • Integration test

Side effects

  • Performance regression
    • If we happen to this situation, it takes more time to read those key-ranges from remote. And we don't apply filters on remote storage now

Release note

  • Improve error handling when reading request and Region meta change in concurrency

@JaySon-Huang JaySon-Huang added type/enhancement The issue or PR belongs to an enhancement. needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 labels Sep 17, 2020
@JaySon-Huang JaySon-Huang self-assigned this Sep 17, 2020
@JaySon-Huang JaySon-Huang changed the title Refine retry when validation fail Improve error handling when reading request and Region meta change in concurrency Sep 17, 2020
@JaySon-Huang JaySon-Huang force-pushed the refine_retry_when_validation_fail branch from f9fd60b to fd082d8 Compare September 17, 2020 06:28
@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

{
/// Recover from region exception when super batch is enable
if (dag.isBatchCop())
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even if there is only one region meet region error, TiFlash still needs to read all the region remotely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can optimize it.
For the first time read region [1,2,...,10], if region [1,2] validate fail, then read [3,4,...,10] from local again and push [1,2] to region_retry.
If the second time of local read fails again, no matter how many regions fail, push [3,4,...,10] to region_retry.

What do you think about it? @windtalker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean add an extra retry to handle this? Why can't we retry more than one time?

Copy link
Contributor Author

@JaySon-Huang JaySon-Huang Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that too much retry will make the whole process time out of control... Now we already have retry while doing learner read and reading from remote.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's retry it with limited times.

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Sep 18, 2020

I will add some tests about retrying to read from local storage later.

@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

Copy link
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 21, 2020
@JaySon-Huang JaySon-Huang removed the needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 label Sep 21, 2020
@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

1 similar comment
@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

@JaySon-Huang JaySon-Huang merged commit b6a151a into pingcap:master Sep 21, 2020
@JaySon-Huang JaySon-Huang deleted the refine_retry_when_validation_fail branch September 21, 2020 08:58
JaySon-Huang added a commit that referenced this pull request Sep 21, 2020
… concurrency (#1101) (#1109)

* Improve error handling when reading request and Region meta change in concurrency
* Add retry from local storage
* Move definitions of FailPoints into cpp file

Signed-off-by: JaySon-Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/LGT1 Indicates that a PR has LGTM 1. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
3 participants