encoding: skip utf8 charset validation in some cases #31061

tangenta · 2021-12-27T12:23:28Z

What problem does this PR solve?

Issue Number: close #31014

Problem Summary:

I suspect that the utf8 validation in common code path is the main reason to cause performance regression, even if a faster version utf8.Valid() is used.

What is changed and how it works?

This PR tries to reduce the string validation as less as possible.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

ti-chi-bot · 2021-12-27T12:23:29Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

Defined2014
xiongjiwei

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

sre-bot · 2021-12-27T12:40:03Z

Code Coverage Details: https://codecov.io/github/pingcap/tidb/commit/eea4fd7ef1647df44df1508cf1db787be772c06e

bb7133 · 2021-12-27T13:25:03Z

parser/charset/encoding.go

@@ -29,6 +29,16 @@ func IsSupportedEncoding(charset string) bool {
 	return ok
 }

+// FindEncodingUTF8AsNoop finds the encoding according to charset
+// except that utf-8 is treated as binary encoding.
+func FindEncodingUTF8AsNoop(charset string) Encoding {


Please add a test case for this function.

The comment is still confusing to me at least, what is the case when "utf-8 is treated as binary encoding"?

I think the logic here is simple enough...?

Binary encoding is a noop encoding. Different from utf-8 encoding, it means all the methods are trivial, including Transform(), IsValid() and others. The cost should be O(1) for these operations. I am trying to avoid string validation(before parsing or writing result to client) to see if there is any improvement.

Binary encoding is a noop encoding. Different from utf-8 encoding, it means all the methods are trivial, including Transform(), IsValid() and others. The cost should be O(1) for these operations. I am trying to avoid string validation(before parsing or writing result to client) to see if there is any improvement.

Thanks! I think this can be added to the comment in some way...

@Yui-Song Could you try to validate the result using your benchmark?

Thanks! I think this can be added to the comment in some way...

@Yui-Song Could you try to validate the result using your benchmark?

I have told @tangenta how to validate it easily with Benchbot, a platform provided by QA team to do benchmarks. All the benchmarks the Perf Team run daily could be run with it.

@Yui-Song For the benchmark today, could you share the profiling files for both 7555536 and this commit?

tangenta · 2021-12-27T14:34:08Z

/run-build comment=true

Defined2014 · 2021-12-28T02:22:37Z

Could you add some performance result? @tangenta

tangenta · 2021-12-28T08:46:15Z

@bb7133 @Defined2014 I have compared the performance between e3c56b7(the commit before #30288) and 4f8a041(this PR). Here is the result:

bench_start_time        bench_type      thread      tps_tpm     commit_hash
2021-12-28 15:39:24     shenma          200         5639        4f8a0413d39
2021-12-28 14:34:23     shenma          200         5570        e3c56b75eae

I think the performance regression have been solved by this PR.

xiongjiwei · 2021-12-28T09:10:33Z

/merge

ti-chi-bot · 2021-12-28T09:10:36Z

This pull request has been accepted and is ready to merge.

Commit hash: 71e4389

tangenta · 2021-12-28T09:40:25Z

/run-check_dev_2

encoding: skip utf8 charset validation in some cases

4f8a041

ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 27, 2021

tangenta requested review from xiongjiwei and Defined2014 and removed request for xiongjiwei December 27, 2021 12:33

bb7133 reviewed Dec 27, 2021

View reviewed changes

Defined2014 approved these changes Dec 28, 2021

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 28, 2021

*: add comment in detail

71e4389

xiongjiwei approved these changes Dec 28, 2021

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Dec 28, 2021

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 28, 2021

ti-chi-bot added 2 commits December 28, 2021 17:10

Merge branch 'master' into skip-utf8-valid

d4e2332

Merge branch 'master' into skip-utf8-valid

eea4fd7

ti-chi-bot merged commit 61d13b5 into pingcap:master Dec 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding: skip utf8 charset validation in some cases #31061

encoding: skip utf8 charset validation in some cases #31061

tangenta commented Dec 27, 2021 •

edited

Loading

ti-chi-bot commented Dec 27, 2021 •

edited

Loading

sre-bot commented Dec 27, 2021 •

edited

Loading

bb7133 Dec 27, 2021

tangenta Dec 27, 2021

bb7133 Dec 27, 2021

Yui-Song Dec 27, 2021 •

edited

Loading

bb7133 Dec 28, 2021

tangenta commented Dec 27, 2021

Defined2014 commented Dec 28, 2021 •

edited

Loading

tangenta commented Dec 28, 2021

xiongjiwei commented Dec 28, 2021

ti-chi-bot commented Dec 28, 2021

tangenta commented Dec 28, 2021

encoding: skip utf8 charset validation in some cases #31061

encoding: skip utf8 charset validation in some cases #31061

Conversation

tangenta commented Dec 27, 2021 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Dec 27, 2021 • edited Loading

sre-bot commented Dec 27, 2021 • edited Loading

bb7133 Dec 27, 2021

Choose a reason for hiding this comment

tangenta Dec 27, 2021

Choose a reason for hiding this comment

bb7133 Dec 27, 2021

Choose a reason for hiding this comment

Yui-Song Dec 27, 2021 • edited Loading

Choose a reason for hiding this comment

bb7133 Dec 28, 2021

Choose a reason for hiding this comment

tangenta commented Dec 27, 2021

Defined2014 commented Dec 28, 2021 • edited Loading

tangenta commented Dec 28, 2021

xiongjiwei commented Dec 28, 2021

ti-chi-bot commented Dec 28, 2021

tangenta commented Dec 28, 2021

tangenta commented Dec 27, 2021 •

edited

Loading

ti-chi-bot commented Dec 27, 2021 •

edited

Loading

sre-bot commented Dec 27, 2021 •

edited

Loading

Yui-Song Dec 27, 2021 •

edited

Loading

Defined2014 commented Dec 28, 2021 •

edited

Loading