Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding: skip utf8 charset validation in some cases #31061

Merged
merged 4 commits into from
Dec 28, 2021

Conversation

tangenta
Copy link
Contributor

@tangenta tangenta commented Dec 27, 2021

What problem does this PR solve?

Issue Number: close #31014

Problem Summary:

I suspect that the utf8 validation in common code path is the main reason to cause performance regression, even if a faster version utf8.Valid() is used.

What is changed and how it works?

This PR tries to reduce the string validation as less as possible.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Dec 27, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Defined2014
  • xiongjiwei

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 27, 2021
@tangenta tangenta requested review from xiongjiwei and Defined2014 and removed request for xiongjiwei December 27, 2021 12:33
@sre-bot
Copy link
Contributor

sre-bot commented Dec 27, 2021

@@ -29,6 +29,16 @@ func IsSupportedEncoding(charset string) bool {
return ok
}

// FindEncodingUTF8AsNoop finds the encoding according to charset
// except that utf-8 is treated as binary encoding.
func FindEncodingUTF8AsNoop(charset string) Encoding {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add a test case for this function.
  2. The comment is still confusing to me at least, what is the case when "utf-8 is treated as binary encoding"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think the logic here is simple enough...?
  2. Binary encoding is a noop encoding. Different from utf-8 encoding, it means all the methods are trivial, including Transform(), IsValid() and others. The cost should be O(1) for these operations. I am trying to avoid string validation(before parsing or writing result to client) to see if there is any improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary encoding is a noop encoding. Different from utf-8 encoding, it means all the methods are trivial, including Transform(), IsValid() and others. The cost should be O(1) for these operations. I am trying to avoid string validation(before parsing or writing result to client) to see if there is any improvement.

Thanks! I think this can be added to the comment in some way...

@Yui-Song Could you try to validate the result using your benchmark?

Copy link
Contributor

@Yui-Song Yui-Song Dec 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think this can be added to the comment in some way...

@Yui-Song Could you try to validate the result using your benchmark?

I have told @tangenta how to validate it easily with Benchbot, a platform provided by QA team to do benchmarks. All the benchmarks the Perf Team run daily could be run with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yui-Song For the benchmark today, could you share the profiling files for both 7555536 and this commit?

@tangenta
Copy link
Contributor Author

/run-build comment=true

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 28, 2021
@Defined2014
Copy link
Contributor

Defined2014 commented Dec 28, 2021

Could you add some performance result? @tangenta

@tangenta
Copy link
Contributor Author

@bb7133 @Defined2014 I have compared the performance between e3c56b7(the commit before #30288) and 4f8a041(this PR). Here is the result:

bench_start_time        bench_type      thread      tps_tpm     commit_hash
2021-12-28 15:39:24     shenma          200         5639        4f8a0413d39
2021-12-28 14:34:23     shenma          200         5570        e3c56b75eae

I think the performance regression have been solved by this PR.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Dec 28, 2021
@xiongjiwei
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 71e4389

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 28, 2021
@tangenta
Copy link
Contributor Author

/run-check_dev_2

@ti-chi-bot ti-chi-bot merged commit 61d13b5 into pingcap:master Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PR #30288 results in 5.4%-10% performance regression in read-heavy workloads
7 participants