Support cost-based access path selection for the multi-valued index #46539

time-and-fate · 2023-08-31T01:41:59Z

Enhancement

Now the statistics collection on the multi-valued index is skipped in tidb. Because of user requirements, we need to support this to make estimation for multi-valued index access path more accurate.
This issue is for tracking the work on the series of functionalities and also acts as a simple design doc.

Background

Currently, statistics collection on the multi-valued index is directly disabled. So cost calculation for the multi-valued index access paths is done without actual statistics, which means it's based on "pseudo estimation" in tidb.
To achieve cost-based access path selection for the multi-valued index based on real statistics, there are generally several parts of work remaining needs to be done:

Support collecting (and storing) statistics for multi-valued index
Support loading and maintaining the statistics in the memory
Support using these statistics to estimate row count

Among these things, many infrastructures are already there and we don't need to implement them again. Like:

Build statistics on columns/indexes using table row samples or full index data.
Maintain the statistics in the system table.
Load the statistics metadata (like row count, NDV, etc.) into tidb memory (stats cache).
Load the concrete statistics (like histograms, TopN, etc.) into tidb memory according to the query content synchronously or asynchronously.
Use the statistics to estimate the row count of a range on an index.
Use independence assumption to handle estimation for the composite expressions.

We just need to enhance them, make them work correctly for the multi-valued index, and implement some new logic based on them.

Design/Implementation Considerations

Resource consumption of statistics collection

In the standard method to collect statistics in v2 analyze, we should use the same samples to build stats for all columns/indexes. If we implement collecting statistics for multi-valued index in this way, it will need tidb to load the samples of JSON columns to tidb. However, the JSON values may cost too much memory, and we can't handle such case very well now.
Therefore, we choose to reuse another infrastructure in tidb/tikv here, which is collect statistics on an index by a sequential full scan on tikv. Statistics are built during this process in tikv and then merged in tidb. In this way, we only need to collect and merge statistics in tidb, which would cost much less resources in tidb than collecting JSON value samples and building full statistics in tidb.

Specialness of the multi-valued index

As the definition of multi-valued index, the row count and NDV of this index may be higher than the table row count.
This needs us to treat it differently and carefully in many places. For example, when collecting (and storing) statistics, we need to avoid using it as the table-level row count. And when maintaining and using the statistics, we need to allow the row count and NDV of the multi-valued index to exceed the table-level row count and NDV.

Development

Support collecting statistics for multi-valued index
- Build and send a separate v2 analyze index task for each multi-valued index in this table.
- Store the analyze result
  - Properly handle the specialness of the multi-valued index that its NDV and total row count may be higher than the table NDV and the table row count.
  - Make the multi-valued index analyze task and the normal v2 analyze task cooperate well.
Support loading statistics for multi-valued index
- Support async load statistics for multi-valued index
- Support sync load statistics for multi-valued index
  - Find needed expression index stats by implementing the logic to find virtual columns that are dependent on histogram needed normal columns.
Support using these statistics to estimate row count for multi-valued index access path
- For range estimation, properly handle the specialness of the multi-valued index.
- For expression estimation, make sure related expressions can make use of statistics on the multi-valued index to estimate the row count instead of using pseudo estimation.

Tests

Unit tests

details omitted here

Statistics collection

Analyze a table with multi-valued indexes and valid data

The analyze should succeed.
The normal v2 table analyze task and each multi-valued index analyze task should be recorded by the analyze jobs.
The table-level and index-level metadata (row count, NDV, null count, etc.) should be correct.
For each multi-valued index, the concrete statistics (histogram buckets and topn entries) should be built and as expected.
The above items should be tested on all supported data types.

Analyze a partitioned table with multi-valued indexes and valid data

It should merge global statistics for the multi-valued index.
The global index-level metadata and the concrete statistics should be built and as expected.

Statistics loading

On a table with multi-valued index and collected statistics, run a query that can make use of the multi-valued index access path.
- It can successfully load statistics for the multi-valued index after the query finishes based on the async loading mechanism.
- It can successfully load statistics for the multi-valued index for the estimation to make use of it based on the sync loading mechanism.

Estimation

On a table with multi-valued index and loaded statistics, run a query that can make use of the multi-valued index access path.
- For range scan operator on the multi-valued index, it can make use of concrete statistics (histogram buckets and topn entries) to estimate the row count.
- For expression-based estimation, it can make use of corresponding multi-valued index statistics to handle expression that satisfies the multi-valued index.

Scenario tests

Construct a table with

multiple normal indexes and multi-valued indexes
nonuniform data distribution

Construct queries with WHERE conditions so that

both normal index and multi-valued index access paths are potential efficient access paths
several multi-valued index access paths are potential efficient access paths

After collecting statistics, tidb should choose the more efficient access path if the row count difference between the access paths are large.

time-and-fate · 2023-09-13T06:51:39Z

/sig planner

time-and-fate · 2023-09-13T06:53:16Z

/remove-sig planner

time-and-fate · 2023-09-13T06:54:08Z

/sig planner

time-and-fate · 2023-09-13T06:54:22Z

/label sig/planner

ti-chi-bot · 2023-09-13T06:54:24Z

@time-and-fate: The label(s) sig/planner cannot be applied. These labels are supported: fuzz/sqlancer, challenge-program, compatibility-breaker, first-time-contributor, contribution, require-LGT3, good first issue, correctness, duplicate, proposal, security, needs-more-info, needs-cherry-pick-release-5.3, needs-cherry-pick-release-5.4, needs-cherry-pick-release-6.0, needs-cherry-pick-release-6.1, needs-cherry-pick-release-6.2, needs-cherry-pick-release-6.3, needs-cherry-pick-release-6.4, needs-cherry-pick-release-6.5, needs-cherry-pick-release-6.6, needs-cherry-pick-release-7.0, needs-cherry-pick-release-7.1, needs-cherry-pick-release-7.2, needs-cherry-pick-release-7.3, affects-5.3, affects-5.4, affects-6.0, affects-6.1, affects-6.2, affects-6.3, affects-6.4, affects-6.5, affects-6.6, affects-7.0, affects-7.1, affects-7.2, affects-7.3, may-affects-5.3, may-affects-5.4, may-affects-6.0, may-affects-6.1, may-affects-6.2, may-affects-6.3, may-affects-6.4, may-affects-6.5, may-affects-6.6, may-affects-7.0, may-affects-7.1, may-affects-7.2, may-affects-7.3.

In response to this:

/label sig/planner

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ref #46539

…test=pr/2223 (#47529) ref #46539

…on (#49937) ref #46539

) close #46539

…gcap#49852) close pingcap#46539

time-and-fate added type/enhancement The issue or PR belongs to an enhancement. sig/planner SIG: Planner labels Aug 31, 2023

time-and-fate self-assigned this Aug 31, 2023

time-and-fate mentioned this issue Sep 6, 2023

*: support collect and async load stats for mv index #46731

Merged

12 tasks

time-and-fate changed the title ~~Support collecting and loading statistics for the multi-valued index~~ Support cost-based access path selection for the multi-valued index Sep 12, 2023

time-and-fate removed the sig/planner SIG: Planner label Sep 13, 2023

time-and-fate added the sig/planner SIG: Planner label Sep 13, 2023

ti-chi-bot bot pushed a commit that referenced this issue Sep 13, 2023

*: support collect and async load stats for mv index (#46731)

fd7ff09

ref #46539

time-and-fate mentioned this issue Sep 20, 2023

planner: support sync loading for mv index stats #47146

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this issue Sep 25, 2023

planner: support sync loading for mv index stats (#47146)

3a1112b

ref #46539

time-and-fate mentioned this issue Oct 10, 2023

planner: temporarily disable analyze mv index for the release | tidb-test=pr/2223 #47529

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this issue Oct 12, 2023

planner: temporarily disable analyze mv index for the release | tidb-…

8dbfdfb

…test=pr/2223 (#47529) ref #46539

time-and-fate mentioned this issue Dec 29, 2023

planner: re-enable analyze for mv index and code prepare for estimation | tidb-test=pr/2272 #49937

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jan 2, 2024

planner: re-enable analyze for mv index and code prepare for estimati…

2d7d83d

…on (#49937) ref #46539

ti-chi-bot bot closed this as completed in #49852 Jan 5, 2024

ti-chi-bot bot pushed a commit that referenced this issue Jan 5, 2024

planner, statistics: support estimation for mv index access path (#49852

bf166d9

) close #46539

AilinKid pushed a commit to AilinKid/tidb that referenced this issue Jan 17, 2024

planner, statistics: support estimation for mv index access path (pin…

bbe0fc1

…gcap#49852) close pingcap#46539

Oreoxmt mentioned this issue Feb 1, 2024

create-index: remove the description that no stats are collected for mv index pingcap/docs-cn#16472

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cost-based access path selection for the multi-valued index #46539

Support cost-based access path selection for the multi-valued index #46539

time-and-fate commented Aug 31, 2023 •

edited

Loading

time-and-fate commented Sep 13, 2023

time-and-fate commented Sep 13, 2023

time-and-fate commented Sep 13, 2023

time-and-fate commented Sep 13, 2023

ti-chi-bot bot commented Sep 13, 2023

Support cost-based access path selection for the multi-valued index #46539

Support cost-based access path selection for the multi-valued index #46539

Comments

time-and-fate commented Aug 31, 2023 • edited Loading

Enhancement

Background

Design/Implementation Considerations

Resource consumption of statistics collection

Specialness of the multi-valued index

Development

Tests

Unit tests

Statistics collection

Statistics loading

Estimation

Scenario tests

time-and-fate commented Sep 13, 2023

time-and-fate commented Sep 13, 2023

time-and-fate commented Sep 13, 2023

time-and-fate commented Sep 13, 2023

ti-chi-bot bot commented Sep 13, 2023

time-and-fate commented Aug 31, 2023 •

edited

Loading