Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource_control: support dynamic calibrate resource #43098

Merged

Conversation

CabinfeverB
Copy link
Contributor

@CabinfeverB CabinfeverB commented Apr 17, 2023

What problem does this PR solve?

Issue Number: ref #38825

Problem Summary:
The maximum RU estimated by this PR is based on an actual and running workload by user. And user can set the time point.

Similar to #42165, we only consider TiDB CPU or TiKV CPU as bottleneck. Also, the resource consuming is linear co-related with each other.
For each metrics sampling point, the PR calculates the RU quota at each point in time using RU statistics, tidb CPU statistics, and tikv statistics. Then removes the 10% maximum and 10% minimum, then calculates the average. In addition, if CPU resource utilization is low at some point in time, it will not be included in the calculation

And ref tikv/pd#6298, update pd client.

What is changed and how it works?

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)
    image

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

calibrate resource support dynamic calibrate for user-actual workload with specific time point.

Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Apr 17, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • glorv
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 17, 2023
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
parser/parser.y Outdated
@@ -14827,13 +14836,54 @@ PlanReplayerStmt:
* CALIBRATE RESOURCE
*******************************************************************/
CalibrateResourceStmt:
"CALIBRATE" "RESOURCE" CalibrateResourceWorkloadOption
"CALIBRATE" "RESOURCE" CalibrateResourceWorkloadOption DynamicCalibrateOptionListOpt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should only provide CalibrateResourceWorkloadOption or DynamicCalibrateOptionListOpt but not both

Copy link
Contributor Author

@CabinfeverB CabinfeverB Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are optional, but I don't kown how to check it in yacc. Maybe I can check it in calibrateResourceExec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update it

parser/ast/misc.go Outdated Show resolved Hide resolved
executor/calibrate_resource.go Outdated Show resolved Hide resolved
executor/calibrate_resource.go Outdated Show resolved Hide resolved
}
return nil
}
if len(e.optionList) == 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to replace this if-else with something like following:

var start, end, dur *ptr
for _, op := range e.optionList[0] {
  ...
}
if duration == nil { ...default for duration}
if start == nil { ...default-for-start }
if end == nil { ...default-for-end }

validate_start_end_duration()

...rest logics

And Since The static and dynamic branch have little common logic with each other, please wrap both of them with a separate function to avoid too long if..else.. block

executor/calibrate_resource.go Outdated Show resolved Hide resolved
if idx >= len(tikvCPUs) || idx >= len(tidbCPUs) {
break
}
tikvQuota := totalKVCPUQuota / tikvCPUs[idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if tikvCPUs[idx] == 0 here, I think check tikvCPUs[idx]/totalKVCPUQuota as the cpu usage percentage is a more ergonomic way

if tikvQuota > lowUsageThreshold {
lowCount++
tikvCPULowCOunt++
if tidbQuota > lowUsageThreshold {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if one of the two cpu usage is greater than the lowUsageThreshold, we should keep it. Maybe there are cluster topologies that tidb cpu quota >> tikv cpu
quota or vice verse, then no samples can be valid here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add valuableUsageThreshold? If one of the two cpu usage is greater than the valuableUsageThreshold, we can accept it

executor/calibrate_resource.go Outdated Show resolved Hide resolved
Comment on lines 357 to 375
func getRUPerSec(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, startTime, endTime string) ([]float64, error) {
query := fmt.Sprintf("SELECT value FROM METRICS_SCHEMA.resource_manager_resource_unit where time >= '%s' and time <= '%s' ORDER BY time desc", startTime, endTime)
logutil.BgLogger().Info("getRUPerSec", zap.String("query", query))
return getValuesFromMetrics(ctx, exec, query, "resource_manager_resource_unit")
}

func getTiDBCPUUsagePerSec(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, startTime, endTime string) ([]float64, error) {
query := fmt.Sprintf("SELECT sum(value) FROM METRICS_SCHEMA.process_cpu_usage where time >= '%s' and time <= '%s' and job like '%%tidb' GROUP BY time ORDER BY time desc", startTime, endTime)
logutil.BgLogger().Info("getTiDBCPUUsagePerSec", zap.String("getTiDBCPUUsagePerSec", query))
return getValuesFromMetrics(ctx, exec, query, "process_cpu_usage")
}

func getTiKVCPUUsagePerSec(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, startTime, endTime string) ([]float64, error) {
query := fmt.Sprintf("SELECT sum(value) FROM METRICS_SCHEMA.process_cpu_usage where time >= '%s' and time <= '%s' and job like '%%tikv' GROUP BY time ORDER BY time desc", startTime, endTime)
logutil.BgLogger().Info("getTiKVCPUUsagePerSec", zap.String("getTiKVCPUUsagePerSec", query))
return getValuesFromMetrics(ctx, exec, query, "process_cpu_usage")
}

func getNumberFromMetrics(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, query, metrics string) (float64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can uniform case of words in each statement.

executor/calibrate_resource.go Outdated Show resolved Hide resolved
executor/calibrate_resource.go Outdated Show resolved Hide resolved
return nil
}
if len(e.optionList) == 2 {
if e.optionList[0].Tp != ast.CalibrateStartTime || (e.optionList[1].Tp != ast.CalibrateEndTime && e.optionList[1].Tp != ast.CalibrateDuration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.optionList[1].Tp != ast.CalibrateEndTime && e.optionList[1].Tp != ast.CalibrateDuration
I'm not sure why the same parameter would have && judgment twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're trying to determine if it's wrong. The expression to assert truth is e.optionList[1].Tp == ast.CalibrateEndTime || e.optionList[1].Tp == ast.CalibrateDuration

case CalibrateEndTime:
ctx.WriteKeyWord("END_TIME ")
if err := n.Ts.Restore(ctx); err != nil {
return errors.Annotate(err, "An error occurred while splicing DynamicCalibrateResourceOption EndTime")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can return error directly, because there is same check at if err := option.Restore(ctx); err != nil {

parser/misc.go Outdated Show resolved Hide resolved
parser/parser.y Outdated Show resolved Hide resolved
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
@CabinfeverB
Copy link
Contributor Author

/test unit-test

1 similar comment
@CabinfeverB
Copy link
Contributor Author

/test unit-test

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2023
@hfxsd hfxsd assigned hfxsd and unassigned hfxsd Apr 18, 2023
Signed-off-by: Cabinfever_B <[email protected]>
@@ -99,7 +177,95 @@ func (e *calibrateResourceExec) Next(ctx context.Context, req *chunk.Chunk) erro

exec := e.ctx.(sqlexec.RestrictedSQLExecutor)
ctx = kv.WithInternalSourceType(ctx, kv.InternalTxnOthers)
if len(e.optionList) > 0 && e.workloadType != ast.WorkloadNone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can put this check in dynamicCalibrate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update it

@@ -715,6 +717,7 @@ import (
s3 "S3"
schedule "SCHEDULE"
staleness "STALENESS"
startTime "START_TIME"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there is a startTS located below, do we need to use startTS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, START_TIME is better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer start_time too. StartTs represent the pd txn timestamp in this context, but the start_time is a real datetime, better not to mix them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2023
Signed-off-by: Cabinfever_B <[email protected]>
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2023
Signed-off-by: Cabinfever_B <[email protected]>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 19, 2023
@CabinfeverB
Copy link
Contributor Author

/test unit-test

@HuSharp
Copy link
Contributor

HuSharp commented Apr 19, 2023

/retest

@@ -83,13 +91,72 @@ type baseResourceCost struct {
writeReqCount uint64
}

const (
valuableUsageThreshold = 0.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments for these constants

Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Copy link
Contributor

@glorv glorv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 19, 2023
Copy link
Contributor

@HuSharp HuSharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Member

@HuSharp: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@HuSharp
Copy link
Contributor

HuSharp commented Apr 19, 2023

/retest

go.mod Outdated
@@ -281,5 +281,6 @@ replace (
// fix potential security issue(CVE-2020-26160) introduced by indirect dependency.
github.com/dgrijalva/jwt-go => github.com/form3tech-oss/jwt-go v3.2.6-0.20210809144907-32ab6a8243d7+incompatible
github.com/pingcap/tidb/parser => ./parser
github.com/tikv/pd/client => github.com/CabinfeverB/pd/client v0.0.0-20230418121422-fb8aaee248a8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why replace it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two PRs will be merged in /pd/client, I want to wait until they are merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use a separate pr to update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

Signed-off-by: Cabinfever_B <[email protected]>
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 19, 2023
@nolouch
Copy link
Member

nolouch commented Apr 19, 2023

/test unit-test

Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
@glorv
Copy link
Contributor

glorv commented Apr 19, 2023

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 8172174

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 19, 2023
@ti-chi-bot ti-chi-bot merged commit 268901f into pingcap:master Apr 19, 2023
@CabinfeverB CabinfeverB deleted the resource_manager/dynamic_calibrate branch April 20, 2023 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants