Skip to content

Commit

Permalink
rfcs: partial indexes
Browse files Browse the repository at this point in the history
Release note: None
  • Loading branch information
mgartner committed May 8, 2020
1 parent aa07ae0 commit 1e4dc25
Showing 1 changed file with 318 additions and 0 deletions.
318 changes: 318 additions & 0 deletions docs/RFCS/20200507_partial_indexes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
- Feature Name: Partial Indexes
- Status: draft
- Start Date: 2020-05-07
- Authors: mgartner
- RFC PR: [#48557](https://github.com/cockroachdb/cockroach/pull/48557)
- Cockroach Issue: [#9683](https://github.com/cockroachdb/cockroach/issues/9683)

# Summary

This RFC proposes the addition of partial indexes to CockroachDB. A partial
index is an index with a boolean predicate expression that only indexes rows in
which the predicate evaluates to true.

Partial indexes are a common feature in RDBMSs. They can be beneficial in
multiple use-cases. Partial indexes can:

- Allow users to reduce the total size of the set of indexes required to satisfy
queries, by both reducing the number of rows indexed, and reducing the number
of columns indexed.
- Avoid overhead of updating an index for rows that are mutated and don't
satisfy the predicate.
- Reduce the number of rows examined when scanning.
- Provide a mechanism for ensuring uniqueness on a subset of rows in a table,
when paired with unique indexes.

# Guide-level Explanation

Partial indexes are created by including a _predicate expression_ via `WHERE
<predicate>` in a `CREATE INDEX` statement. For example:

```sql
CREATE INDEX popular_products ON products (price) WHERE units_sold > 1000
```

The `popular_products` index only indexes rows where the `units_sold` column has
a value greater than `1000`.

Partial indexes can only be used to satisfy a query that has a _filter
expression_, `WHERE <filter>`, that implies the predicate expression. For
example, consider the following queries:

```sql
SELECT max(price) FROM products

SELECT max(price) FROM products WHERE review_count > 100

SELECT max(price) FROM products WHERE units_sold > 500

SELECT max(price) FROM products WHERE units_sold > 1500
```

Only the last query can utilize `popular_products`. Its filter expression,
`units_sold > 1500`, _implies_ the predicate expression, `units_sold > 1000`.
Every value for `units_sold` that is greater than `1500` is also greater than
`1000`. Stated differently, the predicate expression _contains_ the filter
expression.

Attempting to force a partial index to be used for a query that does not imply
the partial index's predicate will result in an error.

There are some notable restrictions that are enforced on partial index
predicates.

1. They must result in a boolean.
2. They can only refer to columns in the table being indexed.
3. Functions used within predicates cannot be impure. For example, `now()` is
not allowed because its result depends on more than its arguments.

# Reference-level Explanation

This design covers 5 major aspects of implementing partial indexes: parsing,
testing predicate implication, generating partial index scans, statistics, and
mutation.

## Parsing

In order to ensure that predicates are valid (e.g. they result in booleans and
contain no impure functions), we will use the same logic that validates `CHECK`
constraints, `sqlbase.SanitizeVarFreeExpr`. The restrictions for `CHECK`
constraints and partial index predicates are the same.

## Testing Predicate Implication

In order to use a partial index to satisfy a query, the filter expression of the
query must _imply_ that the partial index predicate is true. If the predicate is
not provably true, the rows to be returned may not exist in the partial index,
and it cannot be used.

### Exact matches

First, we will check if any conjuncted-expression in the filter is an exact
match to the predicate.

For example, consider the filter expression `a > 10 AND b < 100` and the partial
index predicate `b < 100`. The second conjuncted expression in the filter, `b <
100`, is an exact match to the predicate `b < 100`. Therefore this filter
implies this predicate.

We can test for pointer equality to check if the conjuncted-expressions are an
exact match. The `interner` ensures that identical expressions have the same
memory address.

### Non-exact matches

There are cases when an expression implies a predicate, but is not an exact
match.

For example, `a > 10` implies `a > 0` because all values for `a` that satisfy
`a > 10` also satisfy `a > 0`.

Constraints and constraint sets can be leveraged to help perform implication
checks. However, they are not a full solution. Constraint sets cannot represent
a disjunction with different columns on each side.

Consider the following example:

```sql
CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, review_count INT)
CREATE INDEX popular_prds ON t (price) WHERE units_sold > 1000 OR review_count > 100
```

No constraint can be created for the top-level predicate expression of
`popular_prds`.

Therefore, constraints alone cannot help us determine that `popular_prds` can be
scanned to satisfy any of the below queries:

```sql
SELECT COUNT(id) FROM products WHERE units_sold > 1500 AND price > 100

SELECT COUNT(id) FROM products WHERE review_count > 200 AND price < 100

SELECT COUNT(id) FROM products WHERE (units_sold > 1000 OR review_count > 200) AND price < 100
```

In order to accommodate for such expressions, we must walk the filter and
expression trees. At each predicate expression node, we will check if it is
implied by the filter expression node.

Postgres's [predtest library](https://github.com/postgres/postgres/blob/c9d29775195922136c09cc980bb1b7091bf3d859/src/backend/optimizer/util/predtest.c#L251-L287)
uses this method to determine if a partial index can be used to satisfy a query.
The logic Postgres uses for testing implication of conjunctions, disjunctions,
and "atoms" (anything that is not an `AND` or `OR`) is as follows:

("=>" means "implies")

atom A => atom B if: A contains B
atom A => AND-expr B if: A => each of B's children
atom A => OR-expr B if: A => any of B's children

AND-expr A => atom B if: any of A's children => B
AND-expr A => AND-expr B if: A => each of B's children
AND-expr A => OR-expr B if: A => any of B's children OR
any of A's children => B

OR-expr A => atom B if: each of A's children => B
OR-expr A => AND-expr B if: A => each of B's children
OR-expr A => OR-expr B if: each of A's children => any of B's children

The time complexity of this check is `O(P * F)`, where `P` is the number of
nodes in the predicate expression and `F` is the number of nodes in the filter
expression.

## Generating Partial Index Scans

We will consider utilizing partial indexes for both unconstrained and
constrained scans. Therefore, we'll need to modify both the `GenerateIndexScans`
and `GenerateConstrainedScans` exploration rules (or make new, similar rules).

In addition, we'll need to update exploration rules for zig-zag joins and
inverted index scans.

We'll remove redundant filters from the expression when generating a scan over a
partial index. For example:

```sql
CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, units_in_stock INT)
CREATE INDEX idx1 ON products (price) WHERE units_sold > 1000

SELECT * FROM products WHERE price > 20 AND units_sold > 1000 AND units_in_stock > 0
```

When generating the constrained scan over `idx1`, the `units_sold > 1000` filter
can be removed from the outer `Select`, such that only the `units_in_stock > 0`
filter remains.

## Statistics

The statistics builder must take into account the predicate expression, in
addition to the filter expression, when generating statistics for partial index
scan. This is because the number of rows examined via a partial index scan is
dependent on the predicate expression.

For example, consider the following table, indexes, and query:

```sql
CREATE TABLE products (id INT PRIMARY KEY, price INT, units_sold INT, type TEXT)
CREATE INDEX idx1 ON t (price) WHERE units_sold > 1000
CREATE INDEX idx2 ON t (price) WHERE units_sold > 1000 AND type = 'toy'

SELECT COUNT(*) FROM products where units_sold > 1000 AND type = 'toy' AND price > 20
```

A scan on `idx1` will scan `[/1001 - ]`. A scan on on `idx2` will have the same
scan, `[/1001 - ]`, but will examine fewer rows—only those where `type = 'toy'`.
Therefore, the optimizer cannot rely solely on the scan constraints to determine
the number of rows returned from scanning a partial index. It must also take
into account the selectivity of the predicate to correctly determine that
scanning `idx2` is a lower-cost plan than scanning `idx1`.

We can estimate the number of rows returned from scanning a partial index with
the following formula:

num_rows = rows_in_table * selectivity(predicate_expression) * selectivity(scan_constraint)

This formula is similar one described by Michael Stonebraker in
["The Case For Partial Indexes"](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf).
It has been simplified such that it does not make special considerations for
columns both in the partial index column set and in the partial index
predicate. In a series of examples, this proved to have an insignificant effect
on the resulting estimate.

## Mutation

Partial indexes only index rows that satisfy the partial index's predicate
expression. In order to maintain this property, `INSERT`s, `UPDATE`s, and
`DELETE`s to a table must update the partial index in the event that they change
the candidacy of a row.

In order for the execution engine to determine when a partial index needs to be
updated, the optimizer will project boolean columns that represent whether or not
partial indexes will be updated. This will operate similarly to `CHECK`
constraint verification.

### Insert

If the row being inserted satisfies the predicate, write to the partial index.

### Delete

If the row being deleted satisfies the predicate, delete it from the partial
index.

### Updates

Updates will require two columns to be projected for each partial index. The
first is true if the old version of the row is in the index and needs to be
deleted. The second is true if the new version of the row needs to be written to
the index.

If the current version of the row matches the predicate, and the updated version
of the row does not match the predicate, delete it from the partial index.

If the current version of the row does not match the predicate, and the updated
version of the row does match, write it to the partial index.

If both versions of the row match the predicate, delete the old entry and insert
the new. Note that if the columns indexed in the partial index are not updated,
there is no need to perform a delete and update.

# Alternatives considered

## Disallow `OR` operators in partial index predicates

**This alternative is not being considered because it would make CRDB partial
indexes incompatible with Postgres's partial indexes.**

Testing for predicate implication could be simplified by disallowing `OR`
operators in partial index predicates. A predicate expression without `OR` can
always be represented by a constraint. Therefore, to test if a filter implies
the predicate, we simply check if any of the filter's constraints contain the
predicate constraint. Walking the expression trees would not be required.

[SQL Server imposes this limitation for its form of partial
indexes](https://docs.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql?view=sql-server-ver15).
Such an expression could always be represented by a constraint. Therefore, to
test if a filter implies the predicate, we simply check if any of the filter's
constraints contain the predicate constraint.

Note that the `IN` operator would still be allowed, which provides a form of
disjunction. The `IN` operator can easily be supported because it represents a
disjunction on only one column, which a constraint _can_ represent.

# Work Items

Below is a list of the steps (PRs) to implement partial indexes, roughly
ordered.

- [ ] Add partial index predicate to internal index data structures, add parser
support for `WHERE <predicate>`, add a cluster flag for gating this
defaulted to "off"
- [ ] Add simple equality implication check to optimizer when generating index
scans, in GenerateIndexScans.
- [ ] Same, for GenerateConstrainedScans.
- [ ] Add support for updating partial indexes on inserts.
- [ ] Add support for updating partial indexes on deletes.
- [ ] Add support for updating partial indexes on updates and upserts.
- [ ] Add support for backfilling partial indexes.
- [ ] Update the statistics builder to account for the selectivity of the partial index
predicate.
- [ ] Add more advance implication logic for filter and predicate expressions.
- [ ] Add support in other index exploration rules:
- [ ] GenerateInvertedIndexScans
- [ ] GenerateZigZagJoin
- [ ] GenerateInvertedIndexZigZagJoin

# Resources

- [Postgres partial indexes documentation](https://www.postgresql.org/docs/current/indexes-partial.html)
- [Postgres CREATE INDEX documentation](https://www.postgresql.org/docs/12/sql-createindex.html)
- [Postgres predicate test source code](https://github.com/postgres/postgres/blob/master/src/backend/optimizer/util/predtest.c)
- ["The Case For Partial Indexes", Michael Stonebraker](https://dsf.berkeley.edu/papers/ERL-M89-17.pdf)
- [Use the Index Luke - Partial Indexes](https://use-the-index-luke.com/sql/where-clause/partial-and-filtered-indexes)

# Unresolved questions

- Is special work required for partial unique indexes?
- What about support for `ON CONFLICT`?

0 comments on commit 1e4dc25

Please sign in to comment.