Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planner: fix the inappropriate heuristic rule to estimate the EQ selectivity when out of range #18543

Merged
merged 7 commits into from
Aug 5, 2020

Conversation

qw4990
Copy link
Contributor

@qw4990 qw4990 commented Jul 14, 2020

What problem does this PR solve?

Issue Number: close #18461

Problem Summary: If the estimated value is out of range, an inappropriate heuristic rule sel = 1/NDV*(modifyRows/totalRows) is used, which may cause unexpected low sel when a few rows are modified.

What is changed and how it works?

Change this rule to:

func outOfRangeEQSelectivity(ndv, modifyRows, totalRows int64) float64 {
	if modifyRows == 0 {
		return 0 // it must be 0 since the histogram contains the whole data
	}
	if ndv < outOfRangeBetweenRate {
		ndv = outOfRangeBetweenRate // avoid inaccurate selectivity caused by small NDV
	}
	selectivity := 1 / float64(ndv) // TODO: After extracting TopN from histograms, we can minus the TopN fraction here.
	if selectivity*float64(totalRows) > float64(modifyRows) {
		selectivity = float64(modifyRows) / float64(totalRows)
	}
	return selectivity
}

Check List

Tests

  • Unit test

Release note

  • planner: fix the inappropriate heuristic rule to estimate the EQ selectivity when out of range

@qw4990 qw4990 requested a review from a team as a code owner July 14, 2020 07:44
@qw4990 qw4990 requested review from SunRunAway and removed request for a team July 14, 2020 07:44
@sre-bot
Copy link
Contributor

sre-bot commented Jul 14, 2020

@sre-bot
Copy link
Contributor

sre-bot commented Jul 14, 2020

@qw4990 qw4990 requested review from eurekaka and winoros July 14, 2020 08:04
@qw4990 qw4990 added sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement. labels Jul 14, 2020
@codecov
Copy link

codecov bot commented Jul 14, 2020

Codecov Report

Merging #18543 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #18543   +/-   ##
===========================================
  Coverage   79.4399%   79.4399%           
===========================================
  Files           546        546           
  Lines        148098     148098           
===========================================
  Hits         117649     117649           
  Misses        20971      20971           
  Partials       9478       9478           

@qw4990
Copy link
Contributor Author

qw4990 commented Jul 21, 2020

PTAL @eurekaka @winoros

Copy link
Contributor

@eurekaka eurekaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the rationale behind this new formula. For the old formula, selectivity = 1 / NDV * (ModifyRows / TotalRows), it assumes that the un-analyzed data rows are uniformly distributed and has similar distribution with analyzed data rows, this assumption sounds reasonable for me, while the new formula 1 / ndv indicates that the analyzed data rows contain the target value as well, which is contradictive with the truth that the target value is out of range / not in CMSketch.

@qw4990
Copy link
Contributor Author

qw4990 commented Jul 22, 2020

I don't understand the rationale behind this new formula. For the old formula, selectivity = 1 / NDV * (ModifyRows / TotalRows), it assumes that the un-analyzed data rows are uniformly distributed and has similar distribution with analyzed data rows, this assumption sounds reasonable for me, while the new formula 1 / ndv indicates that the analyzed data rows contain the target value as well, which is contradictive with the truth that the target value is out of range / not in CMSketch.

Actually which formula is more rational depends on what specific case it is.

If all ModifyRows rows are new-inserted and all of them are out-of-range, the formula 1/NDV is the correct formula of your assumption the un-analyzed data rows are uniformly distributed and has similar distribution with analyzed data rows.
Because for every new-inserted unique value, there should be Tot/NDV rows with the same value on it.

Since I can't find any specific case that makes 1/NDV * (Modi/Tot) be more rational than 1/NDV, and according to other DB's strategies(#18461), can we think 1/NDV is more robust than 1/NDV * (Modi/Tot)?

Copy link
Contributor

@eurekaka eurekaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 27, 2020
Copy link
Member

@winoros winoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-srebot ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Aug 5, 2020
@ti-srebot ti-srebot added the status/LGT2 Indicates that a PR has LGTM 2. label Aug 5, 2020
@qw4990
Copy link
Contributor Author

qw4990 commented Aug 5, 2020

/run-all-tests

@qw4990 qw4990 merged commit aeee152 into pingcap:master Aug 5, 2020
ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Aug 5, 2020
@ti-srebot
Copy link
Contributor

cherry pick to release-3.0 in PR #18994

@ti-srebot
Copy link
Contributor

cherry pick to release-3.1 in PR #18995

@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #18997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/statistics sig/planner SIG: Planner status/LGT2 Indicates that a PR has LGTM 2. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix the inappropriate heuristic method to estimate the equal condition selectivity
5 participants