-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt, sql: improve automatic statistics for small tables #56615
Comments
Always assuming 500 rows seems like a nice and simple solution, but it does have a few complications/drawbacks in addition to the ones mentioned above. In particular, we need to figure out what to do about the column statistics. Column statistics consist of histograms, distinct counts, and null counts, and we use them in the optimizer to estimate the selectivity of predicates, estimate the cardinality of GROUP BY operations, and more. How will we change the histograms, distinct counts, and null counts to match up with the row count of 500? If the stats show that we have Presumably we would want to ignore any existing column statistics if the row count is very small (e.g., 0 or 1 rows), but at what point should we actually use those stats? When there are at least 5 rows? 10 rows? This complication could be one reason to prefer the alternative idea listed above of allowing more frequent refreshes of small tables if there are fewer than 4-5 recent refreshes. |
Could you explain why always assuming a table has at least 500 rows would result in not using a lookup join when we should? If anything, I would expect it to have the opposite effect – using a lookup join when we shouldn't.
Are there cases that we know about where this would result in worse plans? |
That could happen if the small table is the input to the lookup join. We might choose a hash or merge join because we think the lookup join will be too expensive, when in fact that would be the better plan.
I don't know of a particular issue/example, but I would think that if we had to choose between doing a full table scan of a small table v. doing a bunch of index/lookup joins so that we could do a constrained scan of a larger table, that would not be worth it. There is definitely some tuning of the overall cost model that we can and should do, but it's dangerous to change the costs of different operators in isolation since it changes the relative costs of everything else. I'm not saying that increasing the cost of unconstrained scans is a bad idea, I just don't think we should do it without considering all the plan changes that will inevitably result. |
I see, like joining a 10 row table with a 10000 row table. 10 lookup joins is clearly better than a full scan of the big table, but 500 lookup joins could be too expensive. Maybe 500 is too much, maybe it should be around the smallest number where we prefer index joins for single-key accesses (over full table scans).
One problem with this is that the tombsones can expire without any mutations to the table, and we won't know to refresh the stats. Maybe it should be a historical average of what we've seen rather than what was there when the stats last ran. Increasing the cost of unconstrained scans seems too arbitrary to me. A constrained scan can have the same problem (especially something like |
That could work -- we could do a hybrid solution for the first issue: pick a minimum value that's larger than 1 but smaller than 500, and also trigger stats refreshes more frequently for small tables
That makes sense. We're already keeping the last 4-5 stats refreshes, so we could calculate the average over all of those refreshes. To add another data point about how it's difficult to find the correct relative cost between operators: Here's an example where a full table scan is a better plan than doing a constrained scan + index join: #46677. This PR was an attempt to fix similar issues by making index joins more expensive: #54768. |
Another idea from for the problem of no stats right when a new cluster starts up: determine some crude stats after data is first loaded based on the sizes of the tables on disk. The import process should also know how many rows were created. |
I have been playing around with various changes. Increasing the row count artificially is problematic - you have to figure out a way to do that to the distinct counts as well somehow, otherwise it doesn't work in many cases (e.g. it won't help with choosing lookup joins). A change like this causes many plan changes and seems pretty risky. I am also worried that it would lead to bad plans when tables are legitimately small. A less promising change seems to be adding a small penalty for full table scans. It is enough to dissuade unnecessary full scans on small tables (e.g. for FKs). It still changes a few plans that will need to be checked. |
This is what I was finding as well when I was working on the fix to #60493 (I initially tried to fix it by artificially inflating the row count). I think increasing the cost of full scans could help. I also want to explore the option of triggering more frequent stats refreshes for small tables, perhaps until there are at least 4 stats refreshes (which is how much history we keep around). I might have time to try that out next week.. |
The main motivation for this change is to avoid seemingly bad plans when the tables are empty or very small, in particular FK checks doing full scans instead of using an index. This has caused confusion with customers and users on multiple occasions. In addition, even if the plans are good for small tables, they are risky as the tables could get bigger than our stats show. Also, there are other hidden costs to scanning more than we need (like contention). This change makes a small cost adjustment - it adds the cost of 10 rows to all full scans. This is enough to discourage full scans on small tables, and to prefer constrained scans even when the row count is the same. Informs cockroachdb#56615. Release justification: low risk, high benefit changes to existing functionality. Release note (performance improvement): fixed cases where the optimizer was doing unnecessary full table scans when the table was very small (according to the last collected statistics).
The main motivation for this change is to avoid seemingly bad plans when the tables are empty or very small, in particular FK checks doing full scans instead of using an index. This has caused confusion with customers and users on multiple occasions. In addition, even if the plans are good for small tables, they are risky as the tables could get bigger than our stats show. Also, there are other hidden costs to scanning more than we need (like contention). This change makes a small cost adjustment - it adds the cost of 10 rows to all full scans. This is enough to discourage full scans on small tables, and to prefer constrained scans even when the row count is the same. Informs cockroachdb#56615. Release justification: low risk, high benefit changes to existing functionality. Release note (performance improvement): fixed cases where the optimizer was doing unnecessary full table scans when the table was very small (according to the last collected statistics).
The main motivation for this change is to avoid seemingly bad plans when the tables are empty or very small, in particular FK checks doing full scans instead of using an index. This has caused confusion with customers and users on multiple occasions. In addition, even if the plans are good for small tables, they are risky as the tables could get bigger than our stats show. Also, there are other hidden costs to scanning more than we need (like contention). This change makes a small cost adjustment - it adds the cost of 10 rows to all full scans. This is enough to discourage full scans on small tables, and to prefer constrained scans even when the row count is the same. Informs cockroachdb#56615. Release justification: low risk, high benefit changes to existing functionality. Release note (performance improvement): fixed cases where the optimizer was doing unnecessary full table scans when the table was very small (according to the last collected statistics).
61680: opt: increase cost of full scans r=RaduBerinde a=RaduBerinde The main motivation for this change is to avoid seemingly bad plans when the tables are empty or very small, in particular FK checks doing full scans instead of using an index. This has caused confusion with customers and users on multiple occasions. In addition, even if the plans are good for small tables, they are risky as the tables could get bigger than our stats show. Also, there are other hidden costs to scanning more than we need (like contention). This change makes a relatively small cost adjustment - it adds the cost of 10 rows to full scans. This is enough to discourage full scans on small tables, and to prefer constrained scans even when the row count is the same. Informs #56615. Fixes #56661. Release justification: low risk, high benefit changes to existing functionality. Release note (performance improvement): fixed cases where the optimizer was doing unnecessary full table scans when the table was very small (according to the last collected statistics). Co-authored-by: Radu Berinde <[email protected]>
The main motivation for this change is to avoid seemingly bad plans when the tables are empty or very small, in particular FK checks doing full scans instead of using an index. This has caused confusion with customers and users on multiple occasions. In addition, even if the plans are good for small tables, they are risky as the tables could get bigger than our stats show. Also, there are other hidden costs to scanning more than we need (like contention). This change makes a small cost adjustment - it adds the cost of 10 rows to all full scans. This is enough to discourage full scans on small tables, and to prefer constrained scans even when the row count is the same. Informs cockroachdb#56615. Release justification: low risk, high benefit changes to existing functionality. Release note (performance improvement): fixed cases where the optimizer was doing unnecessary full table scans when the table was very small (according to the last collected statistics).
#61680 is the change that we wanted in 21.1 for this issue. Moving out of the 21.1 bucket. |
We have recently had reports of the optimizer choosing a bad plan due to inaccurate or misleading statistics on small tables. This can happen for a couple of reasons:
CREATE TABLE AS ...
, the first statistics collection will show that there are 0 rows. To avoid constantly refreshing small tables, we have a setting that prevents refreshing until at least 500 rows have changed. So we won't refresh the stats for this table again until at least 500 rows have been inserted. Therefore, the table could have 499 rows, but the optimizer will still think it has 0 rows. (We actually assume that all tables have at least one row to prevent creating really bad plans, but an efficient plan for 1 row could be very different from an efficient plan for 499 rows).To fix the first issue, we have discussed a couple of ideas:
To fix the second issue, there are some additional options:
To deal effectively with this issue, we'll probably want to implement some combination of the above ideas.
cc @RaduBerinde, @nvanbenschoten
Epic: CRDB-16930
Jira issue: CRDB-2921
Jira issue: CRDB-13904
The text was updated successfully, but these errors were encountered: