-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding: skip utf8 charset validation in some cases #31061
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Code Coverage Details: https://codecov.io/github/pingcap/tidb/commit/eea4fd7ef1647df44df1508cf1db787be772c06e |
parser/charset/encoding.go
Outdated
@@ -29,6 +29,16 @@ func IsSupportedEncoding(charset string) bool { | |||
return ok | |||
} | |||
|
|||
// FindEncodingUTF8AsNoop finds the encoding according to charset | |||
// except that utf-8 is treated as binary encoding. | |||
func FindEncodingUTF8AsNoop(charset string) Encoding { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add a test case for this function.
- The comment is still confusing to me at least, what is the case when "utf-8 is treated as binary encoding"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think the logic here is simple enough...?
- Binary encoding is a noop encoding. Different from utf-8 encoding, it means all the methods are trivial, including
Transform()
,IsValid()
and others. The cost should be O(1) for these operations. I am trying to avoid string validation(before parsing or writing result to client) to see if there is any improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary encoding is a noop encoding. Different from utf-8 encoding, it means all the methods are trivial, including Transform(), IsValid() and others. The cost should be O(1) for these operations. I am trying to avoid string validation(before parsing or writing result to client) to see if there is any improvement.
Thanks! I think this can be added to the comment in some way...
@Yui-Song Could you try to validate the result using your benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think this can be added to the comment in some way...
@Yui-Song Could you try to validate the result using your benchmark?
I have told @tangenta how to validate it easily with Benchbot, a platform provided by QA team to do benchmarks. All the benchmarks the Perf Team run daily could be run with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/run-build comment=true |
Could you add some performance result? @tangenta |
@bb7133 @Defined2014 I have compared the performance between e3c56b7(the commit before #30288) and 4f8a041(this PR). Here is the result:
I think the performance regression have been solved by this PR. |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 71e4389
|
/run-check_dev_2 |
What problem does this PR solve?
Issue Number: close #31014
Problem Summary:
I suspect that the utf8 validation in common code path is the main reason to cause performance regression, even if a faster version
utf8.Valid()
is used.What is changed and how it works?
This PR tries to reduce the string validation as less as possible.
Check List
Tests
Side effects
Documentation
Release note