Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: partial indexes #48557

Merged
merged 1 commit into from
May 22, 2020
Merged

Conversation

mgartner
Copy link
Collaborator

@mgartner mgartner commented May 8, 2020

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mgartner mgartner force-pushed the rfc-partial-indexes branch 8 times, most recently from 4e6b352 to 06129d5 Compare May 8, 2020 00:52
@mgartner mgartner force-pushed the rfc-partial-indexes branch 3 times, most recently from 37e6732 to 1e4dc25 Compare May 8, 2020 16:34
Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice RFC! I didn't carefully review, but the execution portions look good to me.

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, very nice, just a few comments, nothing blocking

- Avoid overhead of updating an index for rows that are mutated and don't
satisfy the predicate.
- Reduce the number of rows examined when scanning.
- Provide a mechanism for ensuring uniqueness on a subset of rows in a table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I hadn't thought of this benefit.


("=>" means "implies")

atom A => atom B if: A contains B
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we use constraint.Span to do atom "contains" comparisons? Seems like the Span might be good way to represent atoms, even if we can't reuse the other objects. This also has a ton of overlap with Sumeer's inverted index representation, which also has spans as "atoms", and then allows AND/OR to arbitrarily compose.

Copy link
Collaborator Author

@mgartner mgartner May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I was originally thinking of using constraint.Constraint, but we only need constraint.Span if we determine the columns being operated on are the same. I'll update the RFC.

However, I'm just now realizing there are some tricky cases involving NULL that we may have to handle without the aid of constraint.Span.

For example, imagine these predicates:

a IS NOT NULL

Postgres proves that the following filters imply this:

  • a IS NOT NULL
  • a = 1
  • a > 3

(a, b) IS NULL

Postgres proves that the following filters imply this:

  • (a, b) IS NULL
  • a IS NULL and b IS NULL

(a, b) IS NOT NULL

Postgres proves that the following filters imply this:

  • (a, b) IS NOT NULL
  • a IS NOT NULL AND b IS NOT NULL
  • a = 1 AND b = 2
  • a > 5 and b < 7

For the tuple cases, it might be pragmatic to only handle cases when an atom matches exactly (pointer equality), and leave it at that until we hear from a customer. Heck, it might even be good to push users towards using an equivalent predicate expression that's easier to comprehend, like a IS NOT NULL AND b IS NOT NULL.

Also, the PR for fixing <tuple> IS NULL logic may make (2) and (3) easier to deal with, if we normalize-away the tuples: #48299 (review)

Postgres's [predtest library](https://github.com/postgres/postgres/blob/c9d29775195922136c09cc980bb1b7091bf3d859/src/backend/optimizer/util/predtest.c#L251-L287)
uses this method to determine if a partial index can be used to satisfy a query.
The logic Postgres uses for testing implication of conjunctions, disjunctions,
and "atoms" (anything that is not an `AND` or `OR`) is as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does PG's predtest library handle tuple comparisons, like (a,b)>(1,2)? In particular, I'm wondering if it handles atoms having multiple columns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like they only handle the case when the tuple "atom" expression is an exact match, except for <tuple> IS NULL and <tuple> IS NOT NULL mentioned in my other comment.

marcus=# create table t (k int, a int, b int);
CREATE TABLE
marcus=# create index i on t (k) where (a, b) >= (1, 2);
CREATE INDEX
marcus=# explain select k from t where (a, b) >= (1, 2) and k = 1;
                           QUERY PLAN
----------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=4.13..11.26 rows=3 width=4)
   Recheck Cond: ((k = 1) AND (ROW(a, b) >= ROW(1, 2)))
   ->  Bitmap Index Scan on i  (cost=0.00..4.13 rows=3 width=0)
         Index Cond: (k = 1)
(4 rows)

marcus=# explain select k from t where (a, b) > (1, 2) and k = 1;
                    QUERY PLAN
--------------------------------------------------
 Seq Scan on t  (cost=0.00..45.70 rows=3 width=4)
   Filter: ((k = 1) AND (ROW(a, b) > ROW(1, 2)))
(2 rows)

marcus=# explain select k from t where a >= 1 and b >= 2 and k = 1;
                    QUERY PLAN
--------------------------------------------------
 Seq Scan on t  (cost=0.00..45.70 rows=1 width=4)
   Filter: ((a >= 1) AND (b >= 2) AND (k = 1))
(2 rows)

marcus=# explain select k from t where a > 3 and b > 4 and k = 1;
                    QUERY PLAN
--------------------------------------------------
 Seq Scan on t  (cost=0.00..45.70 rows=1 width=4)
   Filter: ((a > 3) AND (b > 4) AND (k = 1))
(2 rows)

docs/RFCS/20200507_partial_indexes.md Show resolved Hide resolved
docs/RFCS/20200507_partial_indexes.md Outdated Show resolved Hide resolved

## Mutation

Partial indexes only index rows that satisfy the partial index's predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tricky thing to make sure you consider is the effect of a partial index that is currently being added or removed and/or if an existing partial index is having columns added or removed. It should follow the same patterns as regular indexes, but make a special point of carefully testing those kinds of variations.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Really nice writeup!

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @awoods187, @mgartner, and @RaduBerinde)


docs/RFCS/20200507_partial_indexes.md, line 66 at r1 (raw file):

1. They must result in a boolean.
2. They can only refer to columns in the table being indexed.
3. Functions used within predicates cannot be impure. For example, `now()` is

[nit] since we've moved from impure to immutable/stable/volatile, maybe this should be changed to "must be immutable"?


docs/RFCS/20200507_partial_indexes.md, line 171 at r1 (raw file):

In addition, we'll need to update exploration rules for zig-zag joins and
inverted index scans.

What about lookup joins? Is it possible to use a partial index as the lookup index?


docs/RFCS/20200507_partial_indexes.md, line 190 at r1 (raw file):

The statistics builder must take into account the predicate expression, in
addition to the filter expression, when generating statistics for partial index

[nit] a partial index scan


docs/RFCS/20200507_partial_indexes.md, line 221 at r1 (raw file):

columns both in the partial index column set and in the partial index
predicate. In a series of examples, this proved to have an insignificant effect
on the resulting estimate.

I think we may want to try to combine the predicates before calculating selectivity to get a more accurate estimate. For example, if the index is on units_sold > 1000 but the filter is on units_sold > 1200, we probably want to calculate the selectivity of units sold > 1200 only, not (selectivity of units_sold > 1000) * (selectivity of units_sold > 1200) . Since we currently use constraints for this calculation, this would be as simple as performing an intersection of the constraints before calculating the selectivity.


docs/RFCS/20200507_partial_indexes.md, line 228 at r1 (raw file):

expression. In order to maintain this property, `INSERT`s, `UPDATE`s, and
`DELETE`s to a table must update the partial index in the event that they change
the candidacy of a row.

[nit] it's not just candidacy, right? You still need to update the row even if it remains in the index.


docs/RFCS/20200507_partial_indexes.md, line 301 at r1 (raw file):

- [ ] Update the statistics builder to account for the selectivity of the partial index
  predicate.
- [ ] Add more advance implication logic for filter and predicate expressions.

advance -> advanced


docs/RFCS/20200507_partial_indexes.md, line 303 at r1 (raw file):

- [ ] Add more advance implication logic for filter and predicate expressions.
- [ ] Add support in other index exploration rules:
  - [ ] GenerateInvertedIndexScans

Are there any special considerations needed for inverted indexes?

@mgartner mgartner force-pushed the rfc-partial-indexes branch from 1e4dc25 to cefd371 Compare May 13, 2020 22:09
@mgartner mgartner requested a review from a team as a code owner May 13, 2020 22:09
@mgartner mgartner force-pushed the rfc-partial-indexes branch from cefd371 to 33c0e7b Compare May 13, 2020 22:27
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @rytaft)


docs/RFCS/20200507_partial_indexes.md, line 66 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] since we've moved from impure to immutable/stable/volatile, maybe this should be changed to "must be immutable"?

Done.


docs/RFCS/20200507_partial_indexes.md, line 190 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] a partial index scan

Done.


docs/RFCS/20200507_partial_indexes.md, line 301 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

advance -> advanced

Done.

@mgartner mgartner force-pushed the rfc-partial-indexes branch from 33c0e7b to 1606d7c Compare May 13, 2020 22:47
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @rytaft)


docs/RFCS/20200507_partial_indexes.md, line 228 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] it's not just candidacy, right? You still need to update the row even if it remains in the index.

Yes you're right. But only in the case that the value of the index columns changes.

I added a comment here and rewrote the Update section below to make this more clear.

@mgartner mgartner force-pushed the rfc-partial-indexes branch from 1606d7c to 2f9ce42 Compare May 13, 2020 22:50
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @rytaft)


docs/RFCS/20200507_partial_indexes.md, line 303 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Are there any special considerations needed for inverted indexes?

I don't foresee needing special considerations, at the moment. But there could very well be some gotchas. AFAIK As long as the filter matches the predicate, the partial inverted index can be used.

Copy link
Contributor

@awoods187 awoods187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @mgartner, @RaduBerinde, and @rytaft)


docs/RFCS/20200507_partial_indexes.md, line 16 at r2 (raw file):

Partial indexes are a common feature in RDBMSs. They can be beneficial in
multiple use-cases. Partial indexes can:

might be worth upleveling the overview. at a high level, partial indexes let you get the same read performance as regular indexes with improve write performance provided you only need strong read performance on predicate values


docs/RFCS/20200507_partial_indexes.md, line 22 at r2 (raw file):

- Avoid overhead of updating an index for rows that are mutated and don't
  satisfy the predicate.
- Reduce the number of rows examined when scanning.

this shouldnt be any different when compared to a normal index though right?


docs/RFCS/20200507_partial_indexes.md, line 52 at r2 (raw file):

Only the last query can utilize popular_products. Its filter expression,

utilize the partial index popular_product


docs/RFCS/20200507_partial_indexes.md, line 58 at r2 (raw file):

expression.

Attempting to force a partial index to be used for a query that does not imply

it might be good to introduce the concept that the optimizer will attempt to use the partial index by default but explain when you may need to force a partial index. Then you can mention that you can force a partial index like a normal index using the hint syntax @. Although to be honest I'm struggling to think of a case in which you would want to force a partial index that wouldn't be identifiable by the CBO


docs/RFCS/20200507_partial_indexes.md, line 77 at r2 (raw file):

## Parsing

In order to ensure that predicates are valid (e.g. they result in booleans and

nit e.g.,


docs/RFCS/20200507_partial_indexes.md, line 87 at r2 (raw file):

query must _imply_ that the partial index predicate is true. If the predicate is
not provably true, the rows to be returned may not exist in the partial index,
and it cannot be used.

its implied and kinda obvious but perhaps worth noting the CBO will try to use another index and if none are available the PK


docs/RFCS/20200507_partial_indexes.md, line 176 at r2 (raw file):

The time complexity of this check is `O(P * F)`, where `P` is the number of
nodes in the predicate expression and `F` is the number of nodes in the filter
expression.

is there ever the case where a partial index is more expensive than a full index for reads if this time complexity check is too long?


docs/RFCS/20200507_partial_indexes.md, line 242 at r2 (raw file):

predicate. In a series of examples, this proved to have an insignificant effect
on the resulting estimate.

I assume this would extensible in the future if we found uses cases in which we needed the additional complexity?


docs/RFCS/20200507_partial_indexes.md, line 292 at r2 (raw file):

value of the indexed columns is not changing, there is no need to update the
index.

What happens when a PK change occurs? I assume partial indexes are like normal secondary indexes and get re-written but its more mentioning.

Similarly, is there any interaction with partitioning? I assume you can partition a partial index just like a secondary index?

Can you set up a partial index with a where clause that has the same locality as the compound primary key in a partitioned or replicated table? If so would this be an index this is only built/stored in that locality?

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @mgartner, @RaduBerinde, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 303 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I don't foresee needing special considerations, at the moment. But there could very well be some gotchas. AFAIK As long as the filter matches the predicate, the partial inverted index can be used.

We'll probably need to think about the interaction of partial inverted indexes with the new invertedexpr.SpanExpression that @sumeerbhola is building. For example, even if we can confirm that the filter condition implies the index predicate, we might need to remove any sub-expressions from the filter that aren't part of the partial index. Probably not necessary to add details to this RFC, but it's something that I'm sure we'll have to wrestle with a bit at some point...

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @mgartner, @RaduBerinde, and @rytaft)


docs/RFCS/20200507_partial_indexes.md, line 303 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

We'll probably need to think about the interaction of partial inverted indexes with the new invertedexpr.SpanExpression that @sumeerbhola is building. For example, even if we can confirm that the filter condition implies the index predicate, we might need to remove any sub-expressions from the filter that aren't part of the partial index. Probably not necessary to add details to this RFC, but it's something that I'm sure we'll have to wrestle with a bit at some point...

I think the decomposition rules stated earlier are general enough to apply to any type of data. So we would use the same logic to decide whether
user expression => inverted index predicate

What is special about the inverted index is the row structure, which is orthogonal.


docs/RFCS/20200507_partial_indexes.md, line 147 at r2 (raw file):

    ("=>" means "implies")

    atom A => atom B if:          A contains B

Is an atom a single column set, or a multi-column set -- I was imagining single column, but the mention of constraint.Span which allows multi-column made me less sure.


docs/RFCS/20200507_partial_indexes.md, line 248 at r2 (raw file):

expression. In order to maintain this property, `INSERT`s, `UPDATE`s, and
`DELETE`s to a table must update the partial index in the event that they change
the candidacy of a row. Partial indexes must also be updated if an `UPDATE`d row

(just trying to make sure I understood this correctly)
for UPDATE, unlike a regular index update, that only happens if one of the indexed columns is changed, here if any of the columns involved in a partial index predicate changes we need to read all the columns needed for that predicate computation (potentially from different column families) and then compute the predicate and if changed from false => true write to the index (which may actually be a different set of columns from the ones updated or needed for the predicate computation)? So a user could construct partial index predicates that are actually quite expensive in terms of the read cost?

Copy link
Contributor

@awoods187 awoods187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @bdarnell, @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 58 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Brilliant. I've added this approach and why its required to the doc.

SGTM


docs/RFCS/20200507_partial_indexes.md, line 176 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Scanning a partial index is not guaranteed to always be more efficient than scanning a full index. Here's an example:

CREATE TABLE orders (id INT PRIMARY KEY, user_id INT, price INT)
CREATE INDEX orders_user_id ON orders (user_id)
CREATE INDEX orders_price_20s ON orders (id) WHERE price >= 20 and price < 30

SELECT COUNT(*) FROM orders WHERE price >= 20 AND price < 30 AND user_id = 123

In this case, it'd likely be more selective to use the orders_user_id to retrieve the orders for user 123 and then filter by price afterwords, rather than using orders_price_20s and retrieving all the orders from all users in the price range, and then filtering by user id.

However, I expect this check to be very fast. It's proportional to the number of expressions in the predicate and filter expressions. I doubt that performing this check would ever tip the scales such that the cost of this check plus scanning the partial index become more costly than no-check (ignore partial indexes) plus a full index. In other words, the cost of this check should always be very low, and the potential gains of using a partial index vs a full index could be very high.

that makes sense--that's helpful!

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @bdarnell, @mgartner, @RaduBerinde, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 221 at r1 (raw file):

just for my curiosity, is that the correlated columns and FD_strength_scaled commit?

It's actually this one: 25d515d (the second commit in that PR).

because the column is completely correlated with itself, so one can't multiply selectivities, which requires independence, yes?

Yea, that's a good way to look at it.

@mgartner mgartner force-pushed the rfc-partial-indexes branch from e2d8f17 to 836e598 Compare May 18, 2020 19:53
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 147 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Good idea. I was originally thinking of using constraint.Constraint, but we only need constraint.Span if we determine the columns being operated on are the same. I'll update the RFC.

However, I'm just now realizing there are some tricky cases involving NULL that we may have to handle without the aid of constraint.Span.

For example, imagine these predicates:

a IS NOT NULL

Postgres proves that the following filters imply this:

  • a IS NOT NULL
  • a = 1
  • a > 3

(a, b) IS NULL

Postgres proves that the following filters imply this:

  • (a, b) IS NULL
  • a IS NULL and b IS NULL

(a, b) IS NOT NULL

Postgres proves that the following filters imply this:

  • (a, b) IS NOT NULL
  • a IS NOT NULL AND b IS NOT NULL
  • a = 1 AND b = 2
  • a > 5 and b < 7

For the tuple cases, it might be pragmatic to only handle cases when an atom matches exactly (pointer equality), and leave it at that until we hear from a customer. Heck, it might even be good to push users towards using an equivalent predicate expression that's easier to comprehend, like a IS NOT NULL AND b IS NOT NULL.

Also, the PR for fixing <tuple> IS NULL logic may make (2) and (3) easier to deal with, if we normalize-away the tuples: #48299 (review)

We talked more and multi-column spans should help with some of this, but there may be special considerations for null cases or others. I've added a sentence or two about this to the RFC.


docs/RFCS/20200507_partial_indexes.md, line 171 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

What about lookup joins? Is it possible to use a partial index as the lookup index?

Yes it should be possible, as long as we can prove that every row being looked up will exist in the partial index. This could be difficult for more complex join expressions.

For example:

CREATE TABLE ab (a INT PRIMARY KEY, b INT)
CREATE TABLE xy (x INT PRIMARY KEY, y INT, z INT)
CREATE INDEX i ON xy (y) WHERE z >10

The following join and filter make it fairly obvious to see that the index i could be used for a lookup-join.

SELECT * FROM ab JOIN xy ON a = y WHERE z = 10

But for this query, it may be harder to prove that i can be used:

SELECT * FROM ab JOIN xy ON a = y AND b = z WHERE b = 10

I've added a work item to the list for utilizing partial indexes during lookup joins.


docs/RFCS/20200507_partial_indexes.md, line 147 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I was imagining single columns originally, but in order to support more complex tuple comparisons, we will need multi-column sets. The being said, Postgres's support for proving containment

Looks like my comment got cut off. I believe I wrote something like "The being said, Postgres's support for proving containment for multi-column atoms is fairly simple. It doesn't support complex proofs of implication for them."


docs/RFCS/20200507_partial_indexes.md, line 292 at r2 (raw file):

What happens when a PK change occurs? I assume partial indexes are like normal secondary indexes and get re-written but its more mentioning.

Correct. I've added a section mentioning this so that it is explicit.

Similarly, is there any interaction with partitioning? I assume you can partition a partial index just like a secondary index?

Can you set up a partial index with a where clause that has the same locality as the compound primary key in a partitioned or replicated table? If so would this be an index this is only built/stored in that locality?

That should be possible, yes. I need to do some more research and figure out if there are special considerations that need to be made in order to support partial partitioned indexes.


docs/RFCS/20200507_partial_indexes.md, line 83 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Probably -- we produce an error in other cases when the hinted plan is not possible. For example, if a lookup join was hinted but not possible, we return the following error: "ERROR: could not produce a query plan conforming to the LOOKUP JOIN hint".

I wasn't aware that an error was produce in that case. Thanks for pointing it out.

In those lookup join hint cases, is it possible that placeholder values can change whether the error is produced or not?

That is what persuaded this decision for partial indexes. It seemed odd to have a query statement with some placeholders that could error or not depending on those values. It puts the responsibility on the user to catch the error in the application and retry without the hint, or prove that the partial index can be used in the application layer before adding the hint. Both of those seem non-ideal.

But I'm not sure how index hints are used in the real world. Are they used in production applications, or for testing query plans and indexes?

@RaduBerinde and @awoods187 looping you in from the discussion we had on this elsewhere. Any new thoughts?

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 171 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Yes it should be possible, as long as we can prove that every row being looked up will exist in the partial index. This could be difficult for more complex join expressions.

For example:

CREATE TABLE ab (a INT PRIMARY KEY, b INT)
CREATE TABLE xy (x INT PRIMARY KEY, y INT, z INT)
CREATE INDEX i ON xy (y) WHERE z >10

The following join and filter make it fairly obvious to see that the index i could be used for a lookup-join.

SELECT * FROM ab JOIN xy ON a = y WHERE z = 10

But for this query, it may be harder to prove that i can be used:

SELECT * FROM ab JOIN xy ON a = y AND b = z WHERE b = 10

I've added a work item to the list for utilizing partial indexes during lookup joins.

Sounds good.


docs/RFCS/20200507_partial_indexes.md, line 83 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I wasn't aware that an error was produce in that case. Thanks for pointing it out.

In those lookup join hint cases, is it possible that placeholder values can change whether the error is produced or not?

That is what persuaded this decision for partial indexes. It seemed odd to have a query statement with some placeholders that could error or not depending on those values. It puts the responsibility on the user to catch the error in the application and retry without the hint, or prove that the partial index can be used in the application layer before adding the hint. Both of those seem non-ideal.

But I'm not sure how index hints are used in the real world. Are they used in production applications, or for testing query plans and indexes?

@RaduBerinde and @awoods187 looping you in from the discussion we had on this elsewhere. Any new thoughts?

I think placeholders could theoretically change whether or not the error is produced for lookup joins, since we don't trigger that error until execbuilder time (i.e., after placeholders have been substituted). I don't know of an example off hand where the placeholder value would sometimes error and sometimes not, though....

Ideally index hints would just be used for testing, but there are definitely some customers that use them in production if the optimizer would otherwise choose the wrong plan.


docs/RFCS/20200507_partial_indexes.md, line 324 at r4 (raw file):

### Primary Key Changes

If a primary key change occurs, the partial index will need to be rewritten to

[nit] to so -> so

@mgartner mgartner force-pushed the rfc-partial-indexes branch from 836e598 to efe6123 Compare May 18, 2020 20:30
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 324 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] to so -> so

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @sumeerbhola)

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 83 at r3 (raw file):
A placeholder-dependent error wouldn't be a great experience, and I think I'd prefer to issue the error unless the query can be statically determined to be satisfiable with the partial index independent of any placeholder value.

I'm assuming it would always be possible to write the query to do this, such as WHERE col > $1 AND col > 100. This would trade the error for possibly missing results, but at least it would make things explicit for the user.

But I'm not sure how index hints are used in the real world. Are they used in production applications, or for testing query plans and indexes?

I have no real-world data, but they're generally used by users who (rightly or wrongly) believe they can plan their query better than the optimizer (I'm one of these users; given the choice I'd rather write my own plan tree than a SQL statement). In OLAP situations, this could mean that our optimizer isn't choosing the right plan without the hint and this puts it on the right track. In OLTP applications, it's more of an assertion about worst-case running time ("this page has an SLA to return in 250ms, so I know that this query must hit an index and be processed in O(1) time, and I'd rather get an error than have it start down an O(n) path").

With that in mind, errors that are dependent on placeholder values aren't great, but they're not the worst outcome. The hint could then be a promise from the user to use a placeholder value that makes the partial index work, and it's ok to return an error if they give a value out of bounds.

@mgartner mgartner force-pushed the rfc-partial-indexes branch 2 times, most recently from 6c93ce6 to ed78903 Compare May 20, 2020 18:09
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 221 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

just for my curiosity, is that the correlated columns and FD_strength_scaled commit?

It's actually this one: 25d515d (the second commit in that PR).

because the column is completely correlated with itself, so one can't multiply selectivities, which requires independence, yes?

Yea, that's a good way to look at it.

I've rewritten this section so that the exact statistics calculations are TBD. I've also included a few sentences and examples about the special case mentioned by @rytaft above where the predicate includes indexed columns, so that it is not forgotten.


docs/RFCS/20200507_partial_indexes.md, line 292 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

What happens when a PK change occurs? I assume partial indexes are like normal secondary indexes and get re-written but its more mentioning.

Correct. I've added a section mentioning this so that it is explicit.

Similarly, is there any interaction with partitioning? I assume you can partition a partial index just like a secondary index?

Can you set up a partial index with a where clause that has the same locality as the compound primary key in a partitioned or replicated table? If so would this be an index this is only built/stored in that locality?

That should be possible, yes. I need to do some more research and figure out if there are special considerations that need to be made in order to support partial partitioned indexes.

I've added a work item to add full support for partitioned partial indexes, so we don't lose track of it.

@mgartner mgartner force-pushed the rfc-partial-indexes branch from ed78903 to 59df835 Compare May 20, 2020 18:19
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaduBerinde @ajwerner I added a work item to "consider using partial indexes for auto-generated indexes used for foreign keys.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 319 at r6 (raw file):

column, the selectivity of each is _not_ independent. The number of rows
examined is approximately the total number of rows in the table, multiplied by
the selectivity of `[/201 - ]`. It would be inaccurate to naively to instead

[nit] to naively to instead

@mgartner mgartner force-pushed the rfc-partial-indexes branch from 59df835 to 9e2469f Compare May 20, 2020 18:39
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 319 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] to naively to instead

Done.

@mgartner mgartner force-pushed the rfc-partial-indexes branch from 9e2469f to 9e2abb0 Compare May 20, 2020 18:40
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r7.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @sumeerbhola)

@awoods187
Copy link
Contributor


docs/RFCS/20200507_partial_indexes.md, line 292 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I've added a work item to add full support for partitioned partial indexes, so we don't lose track of it.

awesome! im excited about this.

Copy link
Contributor

@awoods187 awoods187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, and @sumeerbhola)

@mgartner mgartner force-pushed the rfc-partial-indexes branch 2 times, most recently from 9949abe to 3788946 Compare May 22, 2020 17:32
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andy-kimball, @awoods187, @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


docs/RFCS/20200507_partial_indexes.md, line 83 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A placeholder-dependent error wouldn't be a great experience, and I think I'd prefer to issue the error unless the query can be statically determined to be satisfiable with the partial index independent of any placeholder value.

I'm assuming it would always be possible to write the query to do this, such as WHERE col > $1 AND col > 100. This would trade the error for possibly missing results, but at least it would make things explicit for the user.

But I'm not sure how index hints are used in the real world. Are they used in production applications, or for testing query plans and indexes?

I have no real-world data, but they're generally used by users who (rightly or wrongly) believe they can plan their query better than the optimizer (I'm one of these users; given the choice I'd rather write my own plan tree than a SQL statement). In OLAP situations, this could mean that our optimizer isn't choosing the right plan without the hint and this puts it on the right track. In OLTP applications, it's more of an assertion about worst-case running time ("this page has an SLA to return in 250ms, so I know that this query must hit an index and be processed in O(1) time, and I'd rather get an error than have it start down an O(n) path").

With that in mind, errors that are dependent on placeholder values aren't great, but they're not the worst outcome. The hint could then be a promise from the user to use a placeholder value that makes the partial index work, and it's ok to return an error if they give a value out of bounds.

I've decided for now on using the hinted partial index if the filter implies the predicate, and erring if the filter does not provably imply the predicate.

Release note: None
@mgartner mgartner force-pushed the rfc-partial-indexes branch from 3788946 to 41a1f4b Compare May 22, 2020 17:34
@mgartner
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 22, 2020

Build succeeded

@craig craig bot merged commit 54acaaf into cockroachdb:master May 22, 2020
@mgartner mgartner deleted the rfc-partial-indexes branch May 22, 2020 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants