From 1e4dc25496caf12ee9e2e04e4a610fb0a2d5abf0 Mon Sep 17 00:00:00 2001 From: Marcus Gartner Date: Thu, 7 May 2020 17:21:44 -0700 Subject: [PATCH] rfcs: partial indexes Release note: None --- docs/RFCS/20200507_partial_indexes.md | 318 ++++++++++++++++++++++++++ 1 file changed, 318 insertions(+) create mode 100644 docs/RFCS/20200507_partial_indexes.md diff --git a/docs/RFCS/20200507_partial_indexes.md b/docs/RFCS/20200507_partial_indexes.md new file mode 100644 index 000000000000..d16069d85b8f --- /dev/null +++ b/docs/RFCS/20200507_partial_indexes.md @@ -0,0 +1,318 @@ +- Feature Name: Partial Indexes +- Status: draft +- Start Date: 2020-05-07 +- Authors: mgartner +- RFC PR: [#48557](https://github.com/cockroachdb/cockroach/pull/48557) +- Cockroach Issue: [#9683](https://github.com/cockroachdb/cockroach/issues/9683) + +# Summary + +This RFC proposes the addition of partial indexes to CockroachDB. A partial +index is an index with a boolean predicate expression that only indexes rows in +which the predicate evaluates to true. + +Partial indexes are a common feature in RDBMSs. They can be beneficial in +multiple use-cases. Partial indexes can: + +- Allow users to reduce the total size of the set of indexes required to satisfy + queries, by both reducing the number of rows indexed, and reducing the number + of columns indexed. +- Avoid overhead of updating an index for rows that are mutated and don't + satisfy the predicate. +- Reduce the number of rows examined when scanning. +- Provide a mechanism for ensuring uniqueness on a subset of rows in a table, + when paired with unique indexes. + +# Guide-level Explanation + +Partial indexes are created by including a _predicate expression_ via `WHERE +` in a `CREATE INDEX` statement. For example: + +```sql +CREATE INDEX popular_products ON products (price) WHERE units_sold > 1000 +``` + +The `popular_products` index only indexes rows where the `units_sold` column has +a value greater than `1000`. + +Partial indexes can only be used to satisfy a query that has a _filter +expression_, `WHERE `, that implies the predicate expression. For +example, consider the following queries: + +```sql +SELECT max(price) FROM products + +SELECT max(price) FROM products WHERE review_count > 100 + +SELECT max(price) FROM products WHERE units_sold > 500 + +SELECT max(price) FROM products WHERE units_sold > 1500 +``` + +Only the last query can utilize `popular_products`. Its filter expression, +`units_sold > 1500`, _implies_ the predicate expression, `units_sold > 1000`. +Every value for `units_sold` that is greater than `1500` is also greater than +`1000`. Stated differently, the predicate expression _contains_ the filter +expression. + +Attempting to force a partial index to be used for a query that does not imply +the partial index's predicate will result in an error. + +There are some notable restrictions that are enforced on partial index +predicates. + +1. They must result in a boolean. +2. They can only refer to columns in the table being indexed. +3. Functions used within predicates cannot be impure. For example, `now()` is + not allowed because its result depends on more than its arguments. + +# Reference-level Explanation + +This design covers 5 major aspects of implementing partial indexes: parsing, +testing predicate implication, generating partial index scans, statistics, and +mutation. + +## Parsing + +In order to ensure that predicates are valid (e.g. they result in booleans and +contain no impure functions), we will use the same logic that validates `CHECK` +constraints, `sqlbase.SanitizeVarFreeExpr`. The restrictions for `CHECK` +constraints and partial index predicates are the same. + +## Testing Predicate Implication + +In order to use a partial index to satisfy a query, the filter expression of the +query must _imply_ that the partial index predicate is true. If the predicate is +not provably true, the rows to be returned may not exist in the partial index, +and it cannot be used. + +### Exact matches + +First, we will check if any conjuncted-expression in the filter is an exact +match to the predicate. + +For example, consider the filter expression `a > 10 AND b < 100` and the partial +index predicate `b < 100`. The second conjuncted expression in the filter, `b < +100`, is an exact match to the predicate `b < 100`. Therefore this filter +implies this predicate. + +We can test for pointer equality to check if the conjuncted-expressions are an +exact match. The `interner` ensures that identical expressions have the same +memory address. + +### Non-exact matches + +There are cases when an expression implies a predicate, but is not an exact +match. + +For example, `a > 10` implies `a > 0` because all values for `a` that satisfy +`a > 10` also satisfy `a > 0`. + +Constraints and constraint sets can be leveraged to help perform implication +checks. However, they are not a full solution. Constraint sets cannot represent +a disjunction with different columns on each side. + +Consider the following example: + +```sql +CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, review_count INT) +CREATE INDEX popular_prds ON t (price) WHERE units_sold > 1000 OR review_count > 100 +``` + +No constraint can be created for the top-level predicate expression of +`popular_prds`. + +Therefore, constraints alone cannot help us determine that `popular_prds` can be +scanned to satisfy any of the below queries: + +```sql +SELECT COUNT(id) FROM products WHERE units_sold > 1500 AND price > 100 + +SELECT COUNT(id) FROM products WHERE review_count > 200 AND price < 100 + +SELECT COUNT(id) FROM products WHERE (units_sold > 1000 OR review_count > 200) AND price < 100 +``` + +In order to accommodate for such expressions, we must walk the filter and +expression trees. At each predicate expression node, we will check if it is +implied by the filter expression node. + +Postgres's [predtest library](https://github.com/postgres/postgres/blob/c9d29775195922136c09cc980bb1b7091bf3d859/src/backend/optimizer/util/predtest.c#L251-L287) +uses this method to determine if a partial index can be used to satisfy a query. +The logic Postgres uses for testing implication of conjunctions, disjunctions, +and "atoms" (anything that is not an `AND` or `OR`) is as follows: + + ("=>" means "implies") + + atom A => atom B if: A contains B + atom A => AND-expr B if: A => each of B's children + atom A => OR-expr B if: A => any of B's children + + AND-expr A => atom B if: any of A's children => B + AND-expr A => AND-expr B if: A => each of B's children + AND-expr A => OR-expr B if: A => any of B's children OR + any of A's children => B + + OR-expr A => atom B if: each of A's children => B + OR-expr A => AND-expr B if: A => each of B's children + OR-expr A => OR-expr B if: each of A's children => any of B's children + +The time complexity of this check is `O(P * F)`, where `P` is the number of +nodes in the predicate expression and `F` is the number of nodes in the filter +expression. + +## Generating Partial Index Scans + +We will consider utilizing partial indexes for both unconstrained and +constrained scans. Therefore, we'll need to modify both the `GenerateIndexScans` +and `GenerateConstrainedScans` exploration rules (or make new, similar rules). + +In addition, we'll need to update exploration rules for zig-zag joins and +inverted index scans. + +We'll remove redundant filters from the expression when generating a scan over a +partial index. For example: + +```sql +CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, units_in_stock INT) +CREATE INDEX idx1 ON products (price) WHERE units_sold > 1000 + +SELECT * FROM products WHERE price > 20 AND units_sold > 1000 AND units_in_stock > 0 +``` + +When generating the constrained scan over `idx1`, the `units_sold > 1000` filter +can be removed from the outer `Select`, such that only the `units_in_stock > 0` +filter remains. + +## Statistics + +The statistics builder must take into account the predicate expression, in +addition to the filter expression, when generating statistics for partial index +scan. This is because the number of rows examined via a partial index scan is +dependent on the predicate expression. + +For example, consider the following table, indexes, and query: + +```sql +CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, type TEXT) +CREATE INDEX idx1 ON t (price) WHERE units_sold > 1000 +CREATE INDEX idx2 ON t (price) WHERE units_sold > 1000 AND type = 'toy' + +SELECT COUNT(*) FROM products where units_sold > 1000 AND type = 'toy' AND price > 20 +``` + +A scan on `idx1` will scan `[/1001 - ]`. A scan on on `idx2` will have the same +scan, `[/1001 - ]`, but will examine fewer rows—only those where `type = 'toy'`. +Therefore, the optimizer cannot rely solely on the scan constraints to determine +the number of rows returned from scanning a partial index. It must also take +into account the selectivity of the predicate to correctly determine that +scanning `idx2` is a lower-cost plan than scanning `idx1`. + +We can estimate the number of rows returned from scanning a partial index with +the following formula: + + num_rows = rows_in_table * selectivity(predicate_expression) * selectivity(scan_constraint) + +This formula is similar one described by Michael Stonebraker in +["The Case For Partial Indexes"](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf). +It has been simplified such that it does not make special considerations for +columns both in the partial index column set and in the partial index +predicate. In a series of examples, this proved to have an insignificant effect +on the resulting estimate. + +## Mutation + +Partial indexes only index rows that satisfy the partial index's predicate +expression. In order to maintain this property, `INSERT`s, `UPDATE`s, and +`DELETE`s to a table must update the partial index in the event that they change +the candidacy of a row. + +In order for the execution engine to determine when a partial index needs to be +updated, the optimizer will project boolean columns that represent whether or not +partial indexes will be updated. This will operate similarly to `CHECK` +constraint verification. + +### Insert + +If the row being inserted satisfies the predicate, write to the partial index. + +### Delete + +If the row being deleted satisfies the predicate, delete it from the partial +index. + +### Updates + +Updates will require two columns to be projected for each partial index. The +first is true if the old version of the row is in the index and needs to be +deleted. The second is true if the new version of the row needs to be written to +the index. + +If the current version of the row matches the predicate, and the updated version +of the row does not match the predicate, delete it from the partial index. + +If the current version of the row does not match the predicate, and the updated +version of the row does match, write it to the partial index. + +If both versions of the row match the predicate, delete the old entry and insert +the new. Note that if the columns indexed in the partial index are not updated, +there is no need to perform a delete and update. + +# Alternatives considered + +## Disallow `OR` operators in partial index predicates + +**This alternative is not being considered because it would make CRDB partial +indexes incompatible with Postgres's partial indexes.** + +Testing for predicate implication could be simplified by disallowing `OR` +operators in partial index predicates. A predicate expression without `OR` can +always be represented by a constraint. Therefore, to test if a filter implies +the predicate, we simply check if any of the filter's constraints contain the +predicate constraint. Walking the expression trees would not be required. + +[SQL Server imposes this limitation for its form of partial +indexes](https://docs.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql?view=sql-server-ver15). +Such an expression could always be represented by a constraint. Therefore, to +test if a filter implies the predicate, we simply check if any of the filter's +constraints contain the predicate constraint. + +Note that the `IN` operator would still be allowed, which provides a form of +disjunction. The `IN` operator can easily be supported because it represents a +disjunction on only one column, which a constraint _can_ represent. + +# Work Items + +Below is a list of the steps (PRs) to implement partial indexes, roughly +ordered. + +- [ ] Add partial index predicate to internal index data structures, add parser + support for `WHERE `, add a cluster flag for gating this + defaulted to "off" +- [ ] Add simple equality implication check to optimizer when generating index + scans, in GenerateIndexScans. +- [ ] Same, for GenerateConstrainedScans. +- [ ] Add support for updating partial indexes on inserts. +- [ ] Add support for updating partial indexes on deletes. +- [ ] Add support for updating partial indexes on updates and upserts. +- [ ] Add support for backfilling partial indexes. +- [ ] Update the statistics builder to account for the selectivity of the partial index + predicate. +- [ ] Add more advance implication logic for filter and predicate expressions. +- [ ] Add support in other index exploration rules: + - [ ] GenerateInvertedIndexScans + - [ ] GenerateZigZagJoin + - [ ] GenerateInvertedIndexZigZagJoin + +# Resources + +- [Postgres partial indexes documentation](https://www.postgresql.org/docs/current/indexes-partial.html) +- [Postgres CREATE INDEX documentation](https://www.postgresql.org/docs/12/sql-createindex.html) +- [Postgres predicate test source code](https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/predtest.c) +- ["The Case For Partial Indexes", Michael Stonebraker](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf) +- [Use the Index Luke - Partial Indexes](https://use-the-index-luke.com/sql/where-clause/partial-and-filtered-indexes) + +# Unresolved questions + +- Is special work required for partial unique indexes? + - What about support for `ON CONFLICT`?