-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cost-based access path selection for the multi-valued index #46539
Comments
/sig planner |
/remove-sig planner |
/sig planner |
/label sig/planner |
@time-and-fate: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Enhancement
Now the statistics collection on the multi-valued index is skipped in tidb. Because of user requirements, we need to support this to make estimation for multi-valued index access path more accurate.
This issue is for tracking the work on the series of functionalities and also acts as a simple design doc.
Background
Currently, statistics collection on the multi-valued index is directly disabled. So cost calculation for the multi-valued index access paths is done without actual statistics, which means it's based on "pseudo estimation" in tidb.
To achieve cost-based access path selection for the multi-valued index based on real statistics, there are generally several parts of work remaining needs to be done:
Among these things, many infrastructures are already there and we don't need to implement them again. Like:
We just need to enhance them, make them work correctly for the multi-valued index, and implement some new logic based on them.
Design/Implementation Considerations
Resource consumption of statistics collection
In the standard method to collect statistics in v2 analyze, we should use the same samples to build stats for all columns/indexes. If we implement collecting statistics for multi-valued index in this way, it will need tidb to load the samples of JSON columns to tidb. However, the JSON values may cost too much memory, and we can't handle such case very well now.
Therefore, we choose to reuse another infrastructure in tidb/tikv here, which is collect statistics on an index by a sequential full scan on tikv. Statistics are built during this process in tikv and then merged in tidb. In this way, we only need to collect and merge statistics in tidb, which would cost much less resources in tidb than collecting JSON value samples and building full statistics in tidb.
Specialness of the multi-valued index
As the definition of multi-valued index, the row count and NDV of this index may be higher than the table row count.
This needs us to treat it differently and carefully in many places. For example, when collecting (and storing) statistics, we need to avoid using it as the table-level row count. And when maintaining and using the statistics, we need to allow the row count and NDV of the multi-valued index to exceed the table-level row count and NDV.
Development
Tests
Unit tests
details omitted here
Statistics collection
Analyze a table with multi-valued indexes and valid data
Analyze a partitioned table with multi-valued indexes and valid data
Statistics loading
Estimation
Scenario tests
Construct a table with
Construct queries with WHERE conditions so that
After collecting statistics, tidb should choose the more efficient access path if the row count difference between the access paths are large.
The text was updated successfully, but these errors were encountered: