-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial indexes #8242
Partial indexes #8242
Conversation
e489fe2
to
7a895d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!!!
I left a few comments. Other than those, I think the only other thing potentially missing is that the optimizer is not perfect in proving that some query filters imply partial index predicates. From the RFC:
Note that CRDB, like Postgres, will perform a best-effort attempt to prove that a query filter expression implies a partial index predicate. It is not guaranteed to prove implication of arbitrarily complex expressions.
In other words, false negatives are possible (where a filter theoretically implies a predicate, but cannot be proven by the optimizer in practice). It should be very unlikely, but it is possible, and calling it out in our docs may help prevent confusion. Here's one example:
[email protected]:58391/defaultdb> CREATE TABLE t (a INT, b INT, c INT, INDEX (a) WHERE b = 1 OR c = 2 OR b = 3);
CREATE TABLE
Server Execution Time: 4.515ms
Network Latency: 1.406ms
[email protected]:58391/defaultdb> EXPLAIN SELECT a FROM t WHERE b IN (1, 3) OR c = 2;
tree | field | description
------------+---------------+---------------------------
| distribution | full
| vectorized | false
filter | |
│ | filter | (b IN (1, 3)) OR (c = 2)
└── scan | |
| missing stats |
| table | t@primary
| spans | FULL SCAN
(8 rows)
v20.2/sql-feature-support.md
Outdated
Multi-column indexes | ✓ | Common Extension | We do not limit on the number of columns indexes can include | ||
Covering indexes | ✓ | Common Extension | [Storing Columns documentation](create-index.html#store-columns) | ||
Inverted indexes | ✓ | Common Extension | [Inverted Indexes documentation](inverted-indexes.html) | ||
Partial indexes | ✓ | Common Extension | [Partial indexes documentation](partial-indexes.html) | ||
Multiple indexes per query | Planned | Common Extension | Use multiple indexes to filter the table's values for a single query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to partial indexes but I noticed this "Multiple indexes per query" is marked planned. Since cockroachdb/cockroach#2142 was fixed by cockroachdb/cockroach#47094, there is a case where a single query can use multiple indexes (example below). This will be new in 20.2.
[email protected]:58391/defaultdb> create table t (k int primary key, a int, b int, index a_idx (a), index b_idx (b));
CREATE TABLE
Server Execution Time: 2.755ms
Network Latency: 810µs
[email protected]:58391/defaultdb> explain select k from t where a = 10 or b = 20;
tree | field | description
-----------------------+---------------+--------------
| distribution | local
| vectorized | false
distinct | |
│ | distinct on | k
└── union all | |
├── index join | |
│ │ | table | t@primary
│ └── scan | |
│ | missing stats |
│ | table | t@a_idx
│ | spans | [/10 - /10]
└── index join | |
│ | table | t@primary
└── scan | |
| missing stats |
| table | t@b_idx
| spans | [/20 - /20]
(17 rows)
Server Execution Time: 154µs
Network Latency: 311µs
v20.2/partial-indexes.md
Outdated
|
||
- They contain fewer rows than full indexes, making them less expensive to create and store on a cluster. | ||
- Read queries on rows included in a partial index only scan the rows in the partial index. This contrasts with queries on columns in full indexes, which must scan all rows in the indexed column. | ||
- Write queries on rows implied by a partial index only modify rows in the partial index. This contrasts with write queries on columns in full indexes, which must modify the larger set of rows that make up a full-column index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this last bullet point is confusing. The advantage of partial indexes in regards to writes is the the overhead of writing to an index is only incurred for rows that must be added or removed from the partial index, whereas a non-partial index incurs this overhead for every row. For example, if we have an INDEX (a) WHERE b = 'foo'
, and we INSERT INTO t (a, b) VALUES (1, 'bar')
, there is no overhead of writing to the partial index because that row does not belong.
Would something like below be more clear?
With a partial index, write queries only incur the overhead of an index write when the row satisfies the predicate. This contrasts with full indexes, which incur the overhead of an index write for all rows when the indexed column is modified.
v20.2/partial-indexes.md
Outdated
- [Functions](functions-and-operators.html) used in predicates must be immutable. For example, the `now()` function is not allowed in predicates because its value depends on more than its arguments. | ||
|
||
{{site.data.alerts.callout_info}} | ||
Partial indexes cannot be created at [table creation](create-table.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to create them in a CREATE TABLE
statement. Let me know if you ran into a case that didn't work.
[email protected]:58391/defaultdb> create table t (a int, index (a) where a > 0);
CREATE TABLE
Server Execution Time: 3.143ms
Network Latency: 998µs
[email protected]:58391/defaultdb> show create table t;
table_name | create_statement
-------------+------------------------------------------------
t | CREATE TABLE public.t (
| a INT8 NULL,
| INDEX t_a_idx (a ASC) WHERE a > 0:::INT8,
| FAMILY "primary" (a, rowid)
| )
(1 row)
Server Execution Time: 5.263ms
Network Latency: 231µs
v20.2/partial-indexes.md
Outdated
{{site.data.alerts.end}} | ||
|
||
{{site.data.alerts.callout_info}} | ||
CockroachDB returns an error if there are multiple unique or exclusion constraints matching the `ON CONFLICT` specification. See [tracking issue](https://github.com/cockroachdb/cockroach/issues/53170). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only the case for ON CONFLICT ... DO UPDATE
, but not for ON CONFLICT ... DO NOTHING
. There should be no issues with ON CONFLICT ... DO NOTHING
.
We'll probably also want to document the new WHERE
clause syntax in the INSERT ON CONFLICT
statement. There's some examples here. This is particularly confusing, so I'm happy to explain more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see now that you have this correct below in "Known Limitations". I think this Note should include the "DO UPDATE" clarification or be removed.
|
||
{% include copy-clipboard.html %} | ||
~~~ sql | ||
> CREATE INDEX ON rides (city, revenue) WHERE revenue > 80; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] there is no query plan in this section that takes advantage of revenue
as an indexed column. If you added another example like SELECT * FROM rides WHERE city = 'new york' AND revenue >= 100 AND revenue < 150
, the query plan should be a constrained scan over the partial index, rather than a FULL SCAN
.
I'm not sure it's necessary but it might be a nice example to highlight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR, @mgartner!
I think I addressed all of your comments. I also added a note about the false negatives. I don't think we need to point out an example, but I agree that adding a disclaimer note will be helpful. Those changes are all in the "mgartner feedback" commit.
re: docs unrelated to partial indexes, I've added some simple updates to separate commits. I'd prefer to have separate PRs for unrelated docs updates, especially if they are more involved, but separate commits for these small updates should be fine.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)
v20.2/partial-indexes.md, line 19 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
I think this last bullet point is confusing. The advantage of partial indexes in regards to writes is the the overhead of writing to an index is only incurred for rows that must be added or removed from the partial index, whereas a non-partial index incurs this overhead for every row. For example, if we have an
INDEX (a) WHERE b = 'foo'
, and weINSERT INTO t (a, b) VALUES (1, 'bar')
, there is no overhead of writing to the partial index because that row does not belong.Would something like below be more clear?
With a partial index, write queries only incur the overhead of an index write when the row satisfies the predicate. This contrasts with full indexes, which incur the overhead of an index write for all rows when the indexed column is modified.
Gotcha. I rewrote that last bullet, using a lot of your wording.
v20.2/partial-indexes.md, line 58 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
It should be possible to create them in a
CREATE TABLE
statement. Let me know if you ran into a case that didn't work.[email protected]:58391/defaultdb> create table t (a int, index (a) where a > 0); CREATE TABLE Server Execution Time: 3.143ms Network Latency: 998µs [email protected]:58391/defaultdb> show create table t; table_name | create_statement -------------+------------------------------------------------ t | CREATE TABLE public.t ( | a INT8 NULL, | INDEX t_a_idx (a ASC) WHERE a > 0:::INT8, | FAMILY "primary" (a, rowid) | ) (1 row) Server Execution Time: 5.263ms Network Latency: 231µs
Ah! Okay. Removed this note. Didn't run into any issues.
v20.2/partial-indexes.md, line 80 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Ahh I see now that you have this correct below in "Known Limitations". I think this Note should include the "DO UPDATE" clarification or be removed.
I updated the Known Limitations bullet and removed the note.
v20.2/partial-indexes.md, line 149 at r1 (raw file):
there is no query plan in this section that takes advantage of
revenue
as an indexed column.
I'm not sure I understand what you mean by this. All of the queries filter on revenue
. Do you mean "takes advantage of city
as an indexed column"?
I added an example that includes the city in the filter clause.
v20.2/sql-feature-support.md, line 80 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Unrelated to partial indexes but I noticed this "Multiple indexes per query" is marked planned. Since cockroachdb/cockroach#2142 was fixed by cockroachdb/cockroach#47094, there is a case where a single query can use multiple indexes (example below). This will be new in 20.2.
[email protected]:58391/defaultdb> create table t (k int primary key, a int, b int, index a_idx (a), index b_idx (b)); CREATE TABLE Server Execution Time: 2.755ms Network Latency: 810µs [email protected]:58391/defaultdb> explain select k from t where a = 10 or b = 20; tree | field | description -----------------------+---------------+-------------- | distribution | local | vectorized | false distinct | | │ | distinct on | k └── union all | | ├── index join | | │ │ | table | t@primary │ └── scan | | │ | missing stats | │ | table | t@a_idx │ | spans | [/10 - /10] └── index join | | │ | table | t@primary └── scan | | | missing stats | | table | t@b_idx | spans | [/20 - /20] (17 rows) Server Execution Time: 154µs Network Latency: 311µs
I don't see a docs issue opened for this, so I'll just sneak in an update to this table into this PR, making this support "partial" (in a separate commit).
I'd rather document updates for fully in separate PRs. I opened an issue for this: #8260.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)
v20.2/partial-indexes.md, line 58 at r1 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
Ah! Okay. Removed this note. Didn't run into any issues.
Should the CREATE TABLE
docs be updated so that the index_def
syntax graph also has an opt_where_clause
at the end?
v20.2/partial-indexes.md, line 149 at r1 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
there is no query plan in this section that takes advantage of
revenue
as an indexed column.I'm not sure I understand what you mean by this. All of the queries filter on
revenue
. Do you mean "takes advantage ofcity
as an indexed column"?I added an example that includes the city in the filter clause.
Sorry for creating confusion.
This is a step in the right direction. Before, all the spans were FULL SCAN
s meaning that they scan the entire partial index. This is a useful case, but another useful example that I wanted to highlight would be a scan over just a small part of the partial index.
Now that you've added city = 'new york'
, you can see that the span is constrained to [/'new york' - /'new york']
. ✔️
If you want to take it a step further to show all the indexed columns being used to constrain the spans, you can change the query filter to WHERE city = 'new york' AND revenue >= 100 AND revenue < 150
.
Because revenue >= 100 AND revenue < 150
implies revenue > 80
, the partial index can be used. But, it will need to still apply the revenue filter to remove rows where revenue is between 80 and 99 and 150 and +infinity. Luckily, the revenue column is the second indexed column, and the first indexed column, city, is constrained to a single value by the city = 'new york'
filter. So the scan over the partial index would constrain both indexed columns, city
and revenue
, with the span [/'new york'/100 - /'new york'/149]
.
I think this better shows the full potential of a 2-column partial index. But it's up to you if you want to include it. You may want to reword (or remove?) the EXPLAIN SELECT city, revenue FROM rides WHERE revenue > 95;
example since that is similar to my suggested example, but it does a FULL SCAN
because the first indexed column, city
, is not constrained by the query filter—a crucial difference that may be worth calling out if both examples are on the page.
4bbb4fb
to
8713029
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgartner Thanks for iterating on this! Just updated the PR again. See the latest commit.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)
v20.2/partial-indexes.md, line 58 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Should the
CREATE TABLE
docs be updated so that theindex_def
syntax graph also has anopt_where_clause
at the end?
Good catch! Just added it.
v20.2/partial-indexes.md, line 149 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Sorry for creating confusion.
This is a step in the right direction. Before, all the spans were
FULL SCAN
s meaning that they scan the entire partial index. This is a useful case, but another useful example that I wanted to highlight would be a scan over just a small part of the partial index.Now that you've added
city = 'new york'
, you can see that the span is constrained to[/'new york' - /'new york']
. ✔️If you want to take it a step further to show all the indexed columns being used to constrain the spans, you can change the query filter to
WHERE city = 'new york' AND revenue >= 100 AND revenue < 150
.Because
revenue >= 100 AND revenue < 150
impliesrevenue > 80
, the partial index can be used. But, it will need to still apply the revenue filter to remove rows where revenue is between 80 and 99 and 150 and +infinity. Luckily, the revenue column is the second indexed column, and the first indexed column, city, is constrained to a single value by thecity = 'new york'
filter. So the scan over the partial index would constrain both indexed columns,city
andrevenue
, with the span[/'new york'/100 - /'new york'/149]
.I think this better shows the full potential of a 2-column partial index. But it's up to you if you want to include it. You may want to reword (or remove?) the
EXPLAIN SELECT city, revenue FROM rides WHERE revenue > 95;
example since that is similar to my suggested example, but it does aFULL SCAN
because the first indexed column,city
, is not constrained by the query filter—a crucial difference that may be worth calling out if both examples are on the page.
Ahh. I see. I've updated the example again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained
v20.2/partial-indexes.md, line 149 at r1 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
Ahh. I see. I've updated the example again!
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- just a couple of nits. Nice job!
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @ericharmeling and @lnhsingh)
v20.2/partial-indexes.md, line 3 at r3 (raw file):
--- title: Partial Indexes summary: Partial indexes
nit: Add a more descriptive summary. A variation of the first sentence or two of the doc is good
v20.2/partial-indexes.md, line 7 at r3 (raw file):
--- <span class="version-tag">New in v20.2:</span> Partial indexes allow you to specify a subset of rows and columns to add to an [index](indexes.html). Partial indexes include the subset of rows in a table that evaluate to true on a boolean *predicate expression* (i.e. a `WHERE` filter) defined at [index creation](#creation).
nit: add ,
after i.e.
v20.2/partial-indexes.md, line 186 at r3 (raw file):
~~~ Note that query's `SELECT` statement queries all columns in the `rides` table, not just the indexed columns. As a result, an "index join" is required on both the primary index and the partial index.
Note that query's
> Note that the query's
8713029
to
c378ac4
Compare
Fixes #7747.
CREATE INDEX
syntax diagram and parameters.This PR will likely fix some future "opt" release note issues.