Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copr: do not hide original error for MPP query that will not faillback to TiKV #29342

Merged
merged 2 commits into from
Nov 3, 2021

Conversation

windtalker
Copy link
Contributor

@windtalker windtalker commented Nov 2, 2021

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:
After #23078, in order to trigger fallback, when MPP query hit some error, the original error maybe hiden and instead just return a unified error named TiFlash server timeout, so in most of the case, when MPP query fails, user will get TiFlash server timeout, this is very confusing.

What is changed and how it works?

Only hide the original error message for mpp query if this query is allowed to fallback.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Return original error message for mpp query if that query is not allowed to fallback to TiKV.

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Nov 2, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • LittleFall
  • fzhedu

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 2, 2021
@windtalker windtalker requested a review from fzhedu November 2, 2021 06:47
@sre-bot
Copy link
Contributor

sre-bot commented Nov 2, 2021

Please follow PR Title Format:

  • pkg [, pkg2, pkg3]: what is changed

Or if the count of mainly changed packages are more than 3, use

  • *: what is changed

After you have format title, you can leave a comment /run-check_title to recheck it

@windtalker windtalker changed the title do not hide original error for MPP query that will not faillback to TiKV copr: do not hide original error for MPP query that will not faillback to TiKV Nov 2, 2021
@windtalker windtalker requested a review from LittleFall November 2, 2021 06:49
@zhouqiang-cl
Copy link
Contributor

/run-check_title

@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 2, 2021
Copy link
Contributor

@LittleFall LittleFall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will eliminate the origin error message.


When fallback takes effect, error from tiflash will be recorded in the warnings.

But when we want to support other error types besides ErrTiFlashServerTimeout, we need another way to record warnings.

cc @windtalker @xuyifangreeneyes

@xuyifangreeneyes
Copy link
Contributor

@LittleFall Why will the pr eliminate the origin error message?(or do you mean that #22459 will eliminate the origin error message). If fallback doesn't take effect, then the user receives the origin error message. If fallback takes effect, tidb server logs the original error and the user receives TiFlash server timeout warning.

Besides, when the query is allowed to fall back, maybe adding the original error in the warnings is better than adding the unified error TiFlash server timeout.

@windtalker
Copy link
Contributor Author

@LittleFall Why will the pr eliminate the origin error message?(or do you mean that #22459 will eliminate the origin error message). If fallback doesn't take effect, then the user receives the origin error message. If fallback takes effect, tidb server logs the original error and the user receives TiFlash server timeout warning.

Besides, when the query is allowed to fall back, maybe adding the original error in the warnings is better than adding the unified error TiFlash server timeout.

My mistake, it was #23078 that hide the original error to make fallback work more widely.

@LittleFall
Copy link
Contributor

@LittleFall Why will the pr eliminate the origin error message?(or do you mean that #22459 will eliminate the origin error message). If fallback doesn't take effect, then the user receives the origin error message. If fallback takes effect, tidb server logs the original error and the user receives TiFlash server timeout warning.

Besides, when the query is allowed to fall back, maybe adding the original error in the warnings is better than adding the unified error TiFlash server timeout.

You are right, we should not cover up the original error with tiflash server timeout. This is what 23078 does, not this pr.

I want to explain if we want to change the behavior back (throw the original error, instead of tiflash server timeout) in this pr, Do you have any suggestions about the implementation?

Copy link
Contributor

@fzhedu fzhedu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@fzhedu fzhedu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

@windtalker
Copy link
Contributor Author

windtalker commented Nov 3, 2021

what about this case? https://github.com/pingcap/tidb/pull/23078/files#diff-dec6808c7659579faf2c9c6f9d8b7a7e0d7095b61b9a07cfd2cbc9546a3c7c86R394

This is batchCop related, MPP query will never reach there.

Copy link
Contributor

@LittleFall LittleFall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Nov 3, 2021
@windtalker
Copy link
Contributor Author

/run-check_dev_2

@LittleFall
Copy link
Contributor

@LittleFall Why will the pr eliminate the origin error message?(or do you mean that #22459 will eliminate the origin error message). If fallback doesn't take effect, then the user receives the origin error message. If fallback takes effect, tidb server logs the original error and the user receives TiFlash server timeout warning.
Besides, when the query is allowed to fall back, maybe adding the original error in the warnings is better than adding the unified error TiFlash server timeout.

You are right, we should not cover up the original error with tiflash server timeout. This is what 23078 does, not this pr.

I want to explain if we want to change the behavior back (throw the original error, instead of tiflash server timeout) in this pr, Do you have any suggestions about the implementation?

it's not this pr's fault, record a issue: #29396

Copy link
Contributor

@fzhedu fzhedu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Nov 3, 2021
@fzhedu
Copy link
Contributor

fzhedu commented Nov 3, 2021

what about this case? https://github.com/pingcap/tidb/pull/23078/files#diff-dec6808c7659579faf2c9c6f9d8b7a7e0d7095b61b9a07cfd2cbc9546a3c7c86R394

This is batchCop related, MPP query will never reach there.

remote read can not reach here?

@windtalker windtalker force-pushed the tiflash_server_timeout branch from 7b82b48 to af6f452 Compare November 3, 2021 07:30
@windtalker
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: af6f452

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 3, 2021
@windtalker
Copy link
Contributor Author

/run-check_dev_2

@ti-chi-bot ti-chi-bot merged commit a0cd121 into pingcap:master Nov 3, 2021
@xuyifangreeneyes
Copy link
Contributor

I want to explain if we want to change the behavior back (throw the original error, instead of tiflash server timeout) in this pr, Do you have any suggestions about the implementation?

I haven't thought of a good way to solve the problem. Maybe we should list the errors that need fallback rather than convert all the errors to ErrTiFlashServerTimeout. But the errors can be different errors (such as rpc errors and so on) and it may be hard to list them. At least showing the original error instead of ErrTiFlashServerTimeout in the warnings as #29396 suggests is an improvement.

@windtalker windtalker deleted the tiflash_server_timeout branch November 16, 2021 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants