rfcs: partial indexes

Release note: None
cockroachdb · May 18, 2020 · efe6123 · efe6123
1 parent 61f6b0f
commit efe6123
Showing 1 changed file with 388 additions and 0 deletions.
diff --git a/docs/RFCS/20200507_partial_indexes.md b/docs/RFCS/20200507_partial_indexes.md
@@ -0,0 +1,388 @@
+- Feature Name: Partial Indexes
+- Status: draft
+- Start Date: 2020-05-07
+- Authors: mgartner
+- RFC PR: [#48557](https://github.com/cockroachdb/cockroach/pull/48557)
+- Cockroach Issue: [#9683](https://github.com/cockroachdb/cockroach/issues/9683)
+
+# Summary
+
+This RFC proposes the addition of partial indexes to CockroachDB. A partial
+index is an index with a boolean predicate expression that only indexes rows in
+which the predicate evaluates to true.
+
+Partial indexes are a common feature in RDBMSs. They can be beneficial in
+multiple use-cases. Partial indexes can:
+
+- Improve read performance, like normal indexes, without overhead for writes
+  that don't match the predicate expression.
+- Scan fewer rows than normal indexes on the same columns due to the partial
+  index indexing a subset of rows.
+- Reduce the total size of the set of indexes required to satisfy queries, by
+  both reducing the number of rows indexed, and reducing the number of columns
+  indexed.
+- Ensure uniqueness on a subset of rows in a table via `CREATE UNIQUE INDEX
+  ...`.
+
+# Guide-level Explanation
+
+## Usage
+
+Partial indexes are created by including a _predicate expression_ via `WHERE
+<predicate>` in a `CREATE INDEX` statement. For example:
+
+```sql
+CREATE INDEX popular_products ON products (price) WHERE units_sold > 1000
+```
+
+The `popular_products` index only indexes rows where the `units_sold` column has
+a value greater than `1000`.
+
+Partial indexes can only be used to satisfy a query that has a _filter
+expression_, `WHERE <filter>`, that implies the predicate expression. For
+example, consider the following queries:
+
+```sql
+SELECT max(price) FROM products
+
+SELECT max(price) FROM products WHERE review_count > 100
+
+SELECT max(price) FROM products WHERE units_sold > 500
+
+SELECT max(price) FROM products WHERE units_sold > 1500
+```
+
+Only the last query can utilize the partial index `popular_products`. Its filter
+expression, `units_sold > 1500`, _implies_ the predicate expression, `units_sold
+> 1000`. Every value for `units_sold` that is greater than `1500` is also
+greater than `1000`. Stated differently, the predicate expression _contains_ the
+filter expression.
+
+Note that CRDB, like Postgres, will perform a best-effort attempt to prove that
+a query filter expression implies a partial index predicate. It is not
+guaranteed to prove implication of arbitrarily complex expressions.
+
+## Valid Predicate Expressions
+
+There are some notable restrictions that are enforced on partial index
+predicates.
+
+1. They must result in a boolean.
+2. They can only refer to columns in the table being indexed.
+3. Functions used within predicates must be immutable. For example, `now()` is
+   not allowed because its result depends on more than its arguments.
+
+## Index Hints
+
+Index hinting with the `table@index` syntax with partial indexes behaves
+differently from hinting normal indexes. Hinting a partial index will behave as
+follows:
+
+* If the query filter expression implies the partial index predicate, the
+  partial index will be used in the query plan.
+* If not, the index hint will be ignored, and the optimizer will pick the best
+  index to satisfy the query.
+
+Different behavior is required because a partial index cannot always be used to
+satisfy a query. If CRDB cannot ascertain that the query filter predicate
+implies the partial index predicate, rows that should be in the result may not
+exist in the index. In this case, if the partial index was used, the query
+results could be incorrect.
+
+# Reference-level Explanation
+
+This design covers 5 major aspects of implementing partial indexes: parsing,
+testing predicate implication, generating partial index scans, statistics, and
+mutation.
+
+## Parsing
+
+In order to ensure that predicates are valid (e.g., they result in booleans and
+contain no impure functions), we will use the same logic that validates `CHECK`
+constraints, `sqlbase.SanitizeVarFreeExpr`. The restrictions for `CHECK`
+constraints and partial index predicates are the same.
+
+## Testing Predicate Implication
+
+In order to use a partial index to satisfy a query, the filter expression of the
+query must _imply_ that the partial index predicate is true. If the predicate is
+not provably true, the rows to be returned may not exist in the partial index,
+and it cannot be used. Note that other indexes, partial or not, could still be
+used to satisfy the query.
+
+### Exact matches
+
+First, we will check if any conjuncted-expression in the filter is an exact
+match to the predicate.
+
+For example, consider the filter expression `a > 10 AND b < 100` and the partial
+index predicate `b < 100`. The second conjuncted expression in the filter, `b <
+100`, is an exact match to the predicate `b < 100`. Therefore this filter
+implies this predicate.
+
+We can test for pointer equality to check if the conjuncted-expressions are an
+exact match. The `interner` ensures that identical expressions have the same
+memory address.
+
+### Non-exact matches
+
+There are cases when an expression implies a predicate, but is not an exact
+match.
+
+For example, `a > 10` implies `a > 0` because all values for `a` that satisfy
+`a > 10` also satisfy `a > 0`.
+
+Constraints and constraint sets can be leveraged to help perform implication
+checks. However, they are not a full solution. Constraint sets cannot represent
+a disjunction with different columns on each side.
+
+Consider the following example:
+
+```sql
+CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, review_count INT)
+CREATE INDEX popular_prds ON t (price) WHERE units_sold > 1000 OR review_count > 100
+```
+
+No constraint can be created for the top-level predicate expression of
+`popular_prds`.
+
+Therefore, constraints alone cannot help us determine that `popular_prds` can be
+scanned to satisfy any of the below queries:
+
+```sql
+SELECT COUNT(id) FROM products WHERE units_sold > 1500 AND price > 100
+
+SELECT COUNT(id) FROM products WHERE review_count > 200 AND price < 100
+
+SELECT COUNT(id) FROM products WHERE (units_sold > 1000 OR review_count > 200) AND price < 100
+```
+
+In order to accommodate for such expressions, we must walk the filter and
+expression trees. At each predicate expression node, we will check if it is
+implied by the filter expression node.
+
+Postgres's [predtest library](https://github.com/postgres/postgres/blob/c9d29775195922136c09cc980bb1b7091bf3d859/src/backend/optimizer/util/predtest.c#L251-L287)
+uses this method to determine if a partial index can be used to satisfy a query.
+The logic Postgres uses for testing implication of conjunctions, disjunctions,
+and "atoms" (anything that is not an `AND` or `OR`) is as follows:
+
+    ("=>" means "implies")
+
+    atom A => atom B if:          A contains B
+    atom A => AND-expr B if:      A => each of B's children
+    atom A => OR-expr B if:       A => any of B's children
+
+    AND-expr A => atom B if:      any of A's children => B
+    AND-expr A => AND-expr B if:  A => each of B's children
+    AND-expr A => OR-expr B if:   A => any of B's children OR
+                                    any of A's children => B
+
+    OR-expr A => atom B if:       each of A's children => B
+    OR-expr A => AND-expr B if:   A => each of B's children
+    OR-expr A => OR-expr B if:    each of A's children => any of B's children
+
+Because atoms will not contain any `AND` or `OR` expressions, we can generate a
+`constraint.Span` for each of them in order to check for containment. There may
+be edge-cases which cannot be handled by `constraint.Span`, such as `IS NULL`
+expressions or multi-column values, like tuples.
+
+At a high-level, to test whether or not `atom A => atom B`, we can perform the
+following tests, in order:
+
+1. If the atoms are equal (pointer equality), then `A => B`.
+2. If the column referenced in `A` is not the column referenced in `B`, then `A`
+   does not imply `B`.
+3. If the `constraint.Span` of `A` is contained by the `constraint.Span` of `B`,
+   then `A => B`.
+
+There may be special considerations required for handling multi-column atoms,
+such as `(a, b) > (1, 2)`. Multi-column spans should be helpful in proving
+containment, though it should be noted that Postgres only supports simple
+multiple column implications.
+
+The time complexity of this check is `O(P * F)`, where `P` is the number of
+nodes in the predicate expression and `F` is the number of nodes in the filter
+expression.
+
+## Generating Partial Index Scans
+
+We will consider utilizing partial indexes for both unconstrained and
+constrained scans. Therefore, we'll need to modify both the `GenerateIndexScans`
+and `GenerateConstrainedScans` exploration rules (or make new, similar rules).
+
+In addition, we'll need to update exploration rules for zig-zag joins and
+inverted index scans.
+
+We'll remove redundant filters from the expression when generating a scan over a
+partial index. For example:
+
+```sql
+CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, units_in_stock INT)
+CREATE INDEX idx1 ON products (price) WHERE units_sold > 1000
+
+SELECT * FROM products WHERE price > 20 AND units_sold > 1000 AND units_in_stock > 0
+```
+
+When generating the constrained scan over `idx1`, the `units_sold > 1000` filter
+can be removed from the outer `Select`, such that only the `units_in_stock > 0`
+filter remains.
+
+Only conjuncted filter expressions that _exactly_ match the
+predicate expression can be removed. For example, a filter expression
+`units_sold > 1200` could not be removed. This filter would remain and be
+applied after the scan in order to remove any rows returned by the scan with
+`units_sold` between `1000` and `1200`.
+
+## Statistics
+
+The statistics builder must take into account the predicate expression, in
+addition to the filter expression, when generating statistics for a partial
+index scan. This is because the number of rows examined via a partial index scan
+is dependent on the predicate expression.
+
+For example, consider the following table, indexes, and query:
+
+```sql
+CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, type TEXT)
+CREATE INDEX idx1 ON t (price) WHERE units_sold > 1000
+CREATE INDEX idx2 ON t (price) WHERE units_sold > 1000 AND type = 'toy'
+
+SELECT COUNT(*) FROM products where units_sold > 1000 AND type = 'toy' AND price > 20
+```
+
+A scan on `idx1` will scan `[/1001 - ]`. A scan on on `idx2` will have the same
+scan, `[/1001 - ]`, but will examine fewer rows - only those where `type = 'toy'`.
+Therefore, the optimizer cannot rely solely on the scan constraints to determine
+the number of rows returned from scanning a partial index. It must also take
+into account the selectivity of the predicate to correctly determine that
+scanning `idx2` is a lower-cost plan than scanning `idx1`.
+
+We can estimate the number of rows returned from scanning a partial index with
+the following formula:
+
+    num_rows = rows_in_table * selectivity(predicate_expression) * selectivity(scan_constraint)
+
+This formula is similar one described by Michael Stonebraker in
+["The Case For Partial Indexes"](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf).
+It has been simplified such that it does not make special considerations for
+columns both in the partial index column set and in the partial index
+predicate. In a series of examples, this proved to have an insignificant effect
+on the resulting estimate.
+
+## Mutation
+
+Partial indexes only index rows that satisfy the partial index's predicate
+expression. In order to maintain this property, `INSERT`s, `UPDATE`s, and
+`DELETE`s to a table must update the partial index in the event that they change
+the candidacy of a row. Partial indexes must also be updated if an `UPDATE`d row
+matches the predicate both before and after the update, and the value of the
+indexed columns change.
+
+In order for the execution engine to determine when a partial index needs to be
+updated, the optimizer will project boolean columns that represent whether or not
+partial indexes will be updated. This will operate similarly to `CHECK`
+constraint verification.
+
+### Insert
+
+If the row being inserted satisfies the predicate, write to the partial index.
+
+### Delete
+
+If the row being deleted satisfies the predicate, delete it from the partial
+index.
+
+### Updates
+
+Updates will require two columns to be projected for each partial index. The
+first is true if the old version of the row is in the index and needs to be
+deleted. The second is true if the new version of the row needs to be written to
+the index.
+
+Consider the following table of possibilities, where:
+
+ * `r` is the version of the row before the update
+ * `r'` is the version of the row after the update
+ * `pred_match(r)` is `true` when `r` matches the partial index predicate
+
+|                Case                  | Delete `r` from index | Insert `r'` to index |
+| ------------------------------------ | --------------------- | -------------------- |
+| `pred_match(r) AND !pred_match(r')`  | `True`                | `False`              |
+| `!pred_match(r) AND pred_match(r')`  | `False`               | `True`               |
+| `pred_match(r) AND pred_match(r')`*  | `True`                | `True`               |
+| `!pred_match(r) AND !pred_match(r')` | `False`               | `False`              |
+
+
+*Note that in the case that the row was already in the partial index and will
+remain in the partial index after the update, the index only needs to be updated
+(delete `r` and insert `r'`) if the value of the indexed columns changes. If the
+value of the indexed columns is not changing, there is no need to update the
+index.
+
+### Primary Key Changes
+
+If a primary key change occurs, the partial index will need to be rewritten so
+that the values in the index store the new primary key. This is similar to other
+secondary indexes, only that the added complexity of checking if rows belong in
+the partial index must be considered.
+
+# Alternatives considered
+
+## Disallow `OR` operators in partial index predicates
+
+**This alternative is not being considered because it would make CRDB partial
+indexes incompatible with Postgres's partial indexes.**
+
+Testing for predicate implication could be simplified by disallowing `OR`
+operators in partial index predicates. A predicate expression without `OR` can
+always be represented by a constraint. Therefore, to test if a filter implies
+the predicate, we simply check if any of the filter's constraints contain the
+predicate constraint. Walking the expression trees would not be required.
+
+[SQL Server imposes this limitation for its form of partial
+indexes](https://docs.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql?view=sql-server-ver15).
+Such an expression could always be represented by a constraint. Therefore, to
+test if a filter implies the predicate, we simply check if any of the filter's
+constraints contain the predicate constraint.
+
+Note that the `IN` operator would still be allowed, which provides a form of
+disjunction. The `IN` operator can easily be supported because it represents a
+disjunction on only one column, which a constraint _can_ represent.
+
+# Work Items
+
+Below is a list of the steps (PRs) to implement partial indexes, roughly
+ordered.
+
+- [ ] Add partial index predicate to internal index data structures, add parser
+  support for `WHERE <predicate>`, add a cluster flag for gating this
+  defaulted to "off"
+- [ ] Add simple equality implication check to optimizer when generating index
+  scans, in GenerateIndexScans.
+- [ ] Same, for GenerateConstrainedScans.
+- [ ] Add support for updating partial indexes on inserts.
+- [ ] Add support for updating partial indexes on deletes.
+- [ ] Add support for updating partial indexes on updates and upserts.
+- [ ] Add support for backfilling partial indexes.
+- [ ] Update the statistics builder to account for the selectivity of the partial index
+  predicate.
+- [ ] Add more advanced implication logic for filter and predicate expressions.
+- [ ] Add support in other index exploration rules:
+  - [ ] GenerateInvertedIndexScans
+  - [ ] GenerateZigZagJoin
+  - [ ] GenerateInvertedIndexZigZagJoin
+- [ ]  Add support for using partial indexes in Lookup Joins
+- [ ] [Stretch goal] Add support for `ON CONFLICT WHERE [index_predicate] DO
+  ...` for identifying conflict behavior for uniquer partial indexes.
+  - More info in the [Postgres
+    docs](https://www.postgresql.org/docs/9.5/sql-insert.html#SQL-ON-CONFLICT)
+    and [this blog
+    post](https://medium.com/@betakuang/why-postgresqls-on-conflict-cannot-find-my-partial-unique-index-552327b85e1)
+
+# Resources
+
+- [Postgres partial indexes documentation](https://www.postgresql.org/docs/current/indexes-partial.html)
+- [Postgres CREATE INDEX documentation](https://www.postgresql.org/docs/12/sql-createindex.html)
+- [Postgres predicate test source code](https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/predtest.c)
+- ["The Case For Partial Indexes", Michael Stonebraker](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf)
+- [Use the Index Luke - Partial Indexes](https://use-the-index-luke.com/sql/where-clause/partial-and-filtered-indexes)