Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cost-based access path selection for the multi-valued index #46539

Closed
time-and-fate opened this issue Aug 31, 2023 · 5 comments · Fixed by #49852
Closed

Support cost-based access path selection for the multi-valued index #46539

time-and-fate opened this issue Aug 31, 2023 · 5 comments · Fixed by #49852
Assignees
Labels
sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.

Comments

@time-and-fate
Copy link
Member

time-and-fate commented Aug 31, 2023

Enhancement

Now the statistics collection on the multi-valued index is skipped in tidb. Because of user requirements, we need to support this to make estimation for multi-valued index access path more accurate.
This issue is for tracking the work on the series of functionalities and also acts as a simple design doc.

Background

Currently, statistics collection on the multi-valued index is directly disabled. So cost calculation for the multi-valued index access paths is done without actual statistics, which means it's based on "pseudo estimation" in tidb.
To achieve cost-based access path selection for the multi-valued index based on real statistics, there are generally several parts of work remaining needs to be done:

  1. Support collecting (and storing) statistics for multi-valued index
  2. Support loading and maintaining the statistics in the memory
  3. Support using these statistics to estimate row count

Among these things, many infrastructures are already there and we don't need to implement them again. Like:

  1. Build statistics on columns/indexes using table row samples or full index data.
  2. Maintain the statistics in the system table.
  3. Load the statistics metadata (like row count, NDV, etc.) into tidb memory (stats cache).
  4. Load the concrete statistics (like histograms, TopN, etc.) into tidb memory according to the query content synchronously or asynchronously.
  5. Use the statistics to estimate the row count of a range on an index.
  6. Use independence assumption to handle estimation for the composite expressions.

We just need to enhance them, make them work correctly for the multi-valued index, and implement some new logic based on them.

Design/Implementation Considerations

Resource consumption of statistics collection

In the standard method to collect statistics in v2 analyze, we should use the same samples to build stats for all columns/indexes. If we implement collecting statistics for multi-valued index in this way, it will need tidb to load the samples of JSON columns to tidb. However, the JSON values may cost too much memory, and we can't handle such case very well now.
Therefore, we choose to reuse another infrastructure in tidb/tikv here, which is collect statistics on an index by a sequential full scan on tikv. Statistics are built during this process in tikv and then merged in tidb. In this way, we only need to collect and merge statistics in tidb, which would cost much less resources in tidb than collecting JSON value samples and building full statistics in tidb.

Specialness of the multi-valued index

As the definition of multi-valued index, the row count and NDV of this index may be higher than the table row count.
This needs us to treat it differently and carefully in many places. For example, when collecting (and storing) statistics, we need to avoid using it as the table-level row count. And when maintaining and using the statistics, we need to allow the row count and NDV of the multi-valued index to exceed the table-level row count and NDV.

Development

  • Support collecting statistics for multi-valued index
    • Build and send a separate v2 analyze index task for each multi-valued index in this table.
    • Store the analyze result
      • Properly handle the specialness of the multi-valued index that its NDV and total row count may be higher than the table NDV and the table row count.
      • Make the multi-valued index analyze task and the normal v2 analyze task cooperate well.
  • Support loading statistics for multi-valued index
    • Support async load statistics for multi-valued index
    • Support sync load statistics for multi-valued index
      • Find needed expression index stats by implementing the logic to find virtual columns that are dependent on histogram needed normal columns.
  • Support using these statistics to estimate row count for multi-valued index access path
    • For range estimation, properly handle the specialness of the multi-valued index.
    • For expression estimation, make sure related expressions can make use of statistics on the multi-valued index to estimate the row count instead of using pseudo estimation.

Tests

Unit tests

details omitted here

Statistics collection

Analyze a table with multi-valued indexes and valid data

  • The analyze should succeed.
  • The normal v2 table analyze task and each multi-valued index analyze task should be recorded by the analyze jobs.
  • The table-level and index-level metadata (row count, NDV, null count, etc.) should be correct.
  • For each multi-valued index, the concrete statistics (histogram buckets and topn entries) should be built and as expected.
  • The above items should be tested on all supported data types.

Analyze a partitioned table with multi-valued indexes and valid data

  • It should merge global statistics for the multi-valued index.
  • The global index-level metadata and the concrete statistics should be built and as expected.

Statistics loading

  • On a table with multi-valued index and collected statistics, run a query that can make use of the multi-valued index access path.
    • It can successfully load statistics for the multi-valued index after the query finishes based on the async loading mechanism.
    • It can successfully load statistics for the multi-valued index for the estimation to make use of it based on the sync loading mechanism.

Estimation

  • On a table with multi-valued index and loaded statistics, run a query that can make use of the multi-valued index access path.
    • For range scan operator on the multi-valued index, it can make use of concrete statistics (histogram buckets and topn entries) to estimate the row count.
    • For expression-based estimation, it can make use of corresponding multi-valued index statistics to handle expression that satisfies the multi-valued index.

Scenario tests

Construct a table with

  • multiple normal indexes and multi-valued indexes
  • nonuniform data distribution

Construct queries with WHERE conditions so that

  • both normal index and multi-valued index access paths are potential efficient access paths
  • several multi-valued index access paths are potential efficient access paths

After collecting statistics, tidb should choose the more efficient access path if the row count difference between the access paths are large.

@time-and-fate time-and-fate added type/enhancement The issue or PR belongs to an enhancement. sig/planner SIG: Planner labels Aug 31, 2023
@time-and-fate time-and-fate self-assigned this Aug 31, 2023
@time-and-fate time-and-fate changed the title Support collecting and loading statistics for the multi-valued index Support cost-based access path selection for the multi-valued index Sep 12, 2023
@time-and-fate
Copy link
Member Author

/sig planner

@time-and-fate
Copy link
Member Author

/remove-sig planner

@time-and-fate time-and-fate removed the sig/planner SIG: Planner label Sep 13, 2023
@time-and-fate
Copy link
Member Author

/sig planner

@time-and-fate
Copy link
Member Author

/label sig/planner

@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 13, 2023

@time-and-fate: The label(s) sig/planner cannot be applied. These labels are supported: fuzz/sqlancer, challenge-program, compatibility-breaker, first-time-contributor, contribution, require-LGT3, good first issue, correctness, duplicate, proposal, security, needs-more-info, needs-cherry-pick-release-5.3, needs-cherry-pick-release-5.4, needs-cherry-pick-release-6.0, needs-cherry-pick-release-6.1, needs-cherry-pick-release-6.2, needs-cherry-pick-release-6.3, needs-cherry-pick-release-6.4, needs-cherry-pick-release-6.5, needs-cherry-pick-release-6.6, needs-cherry-pick-release-7.0, needs-cherry-pick-release-7.1, needs-cherry-pick-release-7.2, needs-cherry-pick-release-7.3, affects-5.3, affects-5.4, affects-6.0, affects-6.1, affects-6.2, affects-6.3, affects-6.4, affects-6.5, affects-6.6, affects-7.0, affects-7.1, affects-7.2, affects-7.3, may-affects-5.3, may-affects-5.4, may-affects-6.0, may-affects-6.1, may-affects-6.2, may-affects-6.3, may-affects-6.4, may-affects-6.5, may-affects-6.6, may-affects-7.0, may-affects-7.1, may-affects-7.2, may-affects-7.3.

In response to this:

/label sig/planner

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@time-and-fate time-and-fate added the sig/planner SIG: Planner label Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
1 participant