-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource_control: support dynamic calibrate resource #43098
resource_control: support dynamic calibrate resource #43098
Conversation
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
parser/parser.y
Outdated
@@ -14827,13 +14836,54 @@ PlanReplayerStmt: | |||
* CALIBRATE RESOURCE | |||
*******************************************************************/ | |||
CalibrateResourceStmt: | |||
"CALIBRATE" "RESOURCE" CalibrateResourceWorkloadOption | |||
"CALIBRATE" "RESOURCE" CalibrateResourceWorkloadOption DynamicCalibrateOptionListOpt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should only provide CalibrateResourceWorkloadOption or DynamicCalibrateOptionListOpt but not both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are optional, but I don't kown how to check it in yacc. Maybe I can check it in calibrateResourceExec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update it
executor/calibrate_resource.go
Outdated
} | ||
return nil | ||
} | ||
if len(e.optionList) == 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to replace this if-else with something like following:
var start, end, dur *ptr
for _, op := range e.optionList[0] {
...
}
if duration == nil { ...default for duration}
if start == nil { ...default-for-start }
if end == nil { ...default-for-end }
validate_start_end_duration()
...rest logics
And Since The static and dynamic branch have little common logic with each other, please wrap both of them with a separate function to avoid too long if..else.. block
executor/calibrate_resource.go
Outdated
if idx >= len(tikvCPUs) || idx >= len(tidbCPUs) { | ||
break | ||
} | ||
tikvQuota := totalKVCPUQuota / tikvCPUs[idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if tikvCPUs[idx] == 0
here, I think check tikvCPUs[idx]/totalKVCPUQuota
as the cpu usage percentage is a more ergonomic way
executor/calibrate_resource.go
Outdated
if tikvQuota > lowUsageThreshold { | ||
lowCount++ | ||
tikvCPULowCOunt++ | ||
if tidbQuota > lowUsageThreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if one of the two cpu usage is greater than the lowUsageThreshold, we should keep it. Maybe there are cluster topologies that tidb cpu quota >> tikv cpu
quota or vice verse, then no samples can be valid here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add valuableUsageThreshold
? If one of the two cpu usage is greater than the valuableUsageThreshold
, we can accept it
executor/calibrate_resource.go
Outdated
func getRUPerSec(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, startTime, endTime string) ([]float64, error) { | ||
query := fmt.Sprintf("SELECT value FROM METRICS_SCHEMA.resource_manager_resource_unit where time >= '%s' and time <= '%s' ORDER BY time desc", startTime, endTime) | ||
logutil.BgLogger().Info("getRUPerSec", zap.String("query", query)) | ||
return getValuesFromMetrics(ctx, exec, query, "resource_manager_resource_unit") | ||
} | ||
|
||
func getTiDBCPUUsagePerSec(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, startTime, endTime string) ([]float64, error) { | ||
query := fmt.Sprintf("SELECT sum(value) FROM METRICS_SCHEMA.process_cpu_usage where time >= '%s' and time <= '%s' and job like '%%tidb' GROUP BY time ORDER BY time desc", startTime, endTime) | ||
logutil.BgLogger().Info("getTiDBCPUUsagePerSec", zap.String("getTiDBCPUUsagePerSec", query)) | ||
return getValuesFromMetrics(ctx, exec, query, "process_cpu_usage") | ||
} | ||
|
||
func getTiKVCPUUsagePerSec(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, startTime, endTime string) ([]float64, error) { | ||
query := fmt.Sprintf("SELECT sum(value) FROM METRICS_SCHEMA.process_cpu_usage where time >= '%s' and time <= '%s' and job like '%%tikv' GROUP BY time ORDER BY time desc", startTime, endTime) | ||
logutil.BgLogger().Info("getTiKVCPUUsagePerSec", zap.String("getTiKVCPUUsagePerSec", query)) | ||
return getValuesFromMetrics(ctx, exec, query, "process_cpu_usage") | ||
} | ||
|
||
func getNumberFromMetrics(ctx context.Context, exec sqlexec.RestrictedSQLExecutor, query, metrics string) (float64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can uniform case of words in each statement.
executor/calibrate_resource.go
Outdated
return nil | ||
} | ||
if len(e.optionList) == 2 { | ||
if e.optionList[0].Tp != ast.CalibrateStartTime || (e.optionList[1].Tp != ast.CalibrateEndTime && e.optionList[1].Tp != ast.CalibrateDuration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.optionList[1].Tp != ast.CalibrateEndTime && e.optionList[1].Tp != ast.CalibrateDuration
I'm not sure why the same parameter would have && judgment twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we're trying to determine if it's wrong. The expression to assert truth is e.optionList[1].Tp == ast.CalibrateEndTime || e.optionList[1].Tp == ast.CalibrateDuration
case CalibrateEndTime: | ||
ctx.WriteKeyWord("END_TIME ") | ||
if err := n.Ts.Restore(ctx); err != nil { | ||
return errors.Annotate(err, "An error occurred while splicing DynamicCalibrateResourceOption EndTime") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can return error directly, because there is same check at if err := option.Restore(ctx); err != nil {
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
/test unit-test |
1 similar comment
/test unit-test |
Signed-off-by: Cabinfever_B <[email protected]>
executor/calibrate_resource.go
Outdated
@@ -99,7 +177,95 @@ func (e *calibrateResourceExec) Next(ctx context.Context, req *chunk.Chunk) erro | |||
|
|||
exec := e.ctx.(sqlexec.RestrictedSQLExecutor) | |||
ctx = kv.WithInternalSourceType(ctx, kv.InternalTxnOthers) | |||
if len(e.optionList) > 0 && e.workloadType != ast.WorkloadNone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can put this check in dynamicCalibrate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update it
@@ -715,6 +717,7 @@ import ( | |||
s3 "S3" | |||
schedule "SCHEDULE" | |||
staleness "STALENESS" | |||
startTime "START_TIME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed there is a startTS
located below, do we need to use startTS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, START_TIME is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer start_time too. StartTs represent the pd txn timestamp in this context, but the start_time is a real datetime, better not to mix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
/test unit-test |
/retest |
@@ -83,13 +91,72 @@ type baseResourceCost struct { | |||
writeReqCount uint64 | |||
} | |||
|
|||
const ( | |||
valuableUsageThreshold = 0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for these constants
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@HuSharp: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/retest |
go.mod
Outdated
@@ -281,5 +281,6 @@ replace ( | |||
// fix potential security issue(CVE-2020-26160) introduced by indirect dependency. | |||
github.com/dgrijalva/jwt-go => github.com/form3tech-oss/jwt-go v3.2.6-0.20210809144907-32ab6a8243d7+incompatible | |||
github.com/pingcap/tidb/parser => ./parser | |||
github.com/tikv/pd/client => github.com/CabinfeverB/pd/client v0.0.0-20230418121422-fb8aaee248a8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why replace it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two PRs will be merged in /pd/client, I want to wait until they are merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better use a separate pr to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update
Signed-off-by: Cabinfever_B <[email protected]>
/test unit-test |
Signed-off-by: Cabinfever_B <[email protected]>
Signed-off-by: Cabinfever_B <[email protected]>
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 8172174
|
What problem does this PR solve?
Issue Number: ref #38825
Problem Summary:
The maximum RU estimated by this PR is based on an actual and running workload by user. And user can set the time point.
Similar to #42165, we only consider TiDB CPU or TiKV CPU as bottleneck. Also, the resource consuming is linear co-related with each other.
For each metrics sampling point, the PR calculates the RU quota at each point in time using RU statistics, tidb CPU statistics, and tikv statistics. Then removes the 10% maximum and 10% minimum, then calculates the average. In addition, if CPU resource utilization is low at some point in time, it will not be included in the calculation
And ref tikv/pd#6298, update pd client.
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.