Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial indexes #8242

Merged
merged 1 commit into from
Sep 14, 2020
Merged

Partial indexes #8242

merged 1 commit into from
Sep 14, 2020

Conversation

ericharmeling
Copy link
Contributor

Fixes #7747.

  • Added a new page for Partial Indexes.
  • Updated CREATE INDEX syntax diagram and parameters.

This PR will likely fix some future "opt" release note issues.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mgartner
Copy link
Contributor

mgartner commented Sep 8, 2020

Not related to your changes, but I noticed that it isn't mentioned that inverted indexes can be created on ARRAY types here:
image

Copy link
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!!!

I left a few comments. Other than those, I think the only other thing potentially missing is that the optimizer is not perfect in proving that some query filters imply partial index predicates. From the RFC:

Note that CRDB, like Postgres, will perform a best-effort attempt to prove that a query filter expression implies a partial index predicate. It is not guaranteed to prove implication of arbitrarily complex expressions.

In other words, false negatives are possible (where a filter theoretically implies a predicate, but cannot be proven by the optimizer in practice). It should be very unlikely, but it is possible, and calling it out in our docs may help prevent confusion. Here's one example:

[email protected]:58391/defaultdb> CREATE TABLE t (a INT, b INT, c INT, INDEX (a) WHERE b = 1 OR c = 2 OR b = 3);
CREATE TABLE

Server Execution Time: 4.515ms
Network Latency: 1.406ms

[email protected]:58391/defaultdb> EXPLAIN SELECT a FROM t WHERE b IN (1, 3) OR c = 2;
    tree    |     field     |       description
------------+---------------+---------------------------
            | distribution  | full
            | vectorized    | false
  filter    |               |
   │        | filter        | (b IN (1, 3)) OR (c = 2)
   └── scan |               |
            | missing stats |
            | table         | t@primary
            | spans         | FULL SCAN
(8 rows)

Multi-column indexes | ✓ | Common Extension | We do not limit on the number of columns indexes can include
Covering indexes | ✓ | Common Extension | [Storing Columns documentation](create-index.html#store-columns)
Inverted indexes | ✓ | Common Extension | [Inverted Indexes documentation](inverted-indexes.html)
Partial indexes | ✓ | Common Extension | [Partial indexes documentation](partial-indexes.html)
Multiple indexes per query | Planned | Common Extension | Use multiple indexes to filter the table's values for a single query
Copy link
Contributor

@mgartner mgartner Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to partial indexes but I noticed this "Multiple indexes per query" is marked planned. Since cockroachdb/cockroach#2142 was fixed by cockroachdb/cockroach#47094, there is a case where a single query can use multiple indexes (example below). This will be new in 20.2.

[email protected]:58391/defaultdb> create table t (k int primary key, a int, b int, index a_idx (a), index b_idx (b));
CREATE TABLE

Server Execution Time: 2.755ms
Network Latency: 810µs

[email protected]:58391/defaultdb> explain select k from t where a = 10 or b = 20;
          tree         |     field     | description
-----------------------+---------------+--------------
                       | distribution  | local
                       | vectorized    | false
  distinct             |               |
   │                   | distinct on   | k
   └── union all       |               |
        ├── index join |               |
        │    │         | table         | t@primary
        │    └── scan  |               |
        │              | missing stats |
        │              | table         | t@a_idx
        │              | spans         | [/10 - /10]
        └── index join |               |
             │         | table         | t@primary
             └── scan  |               |
                       | missing stats |
                       | table         | t@b_idx
                       | spans         | [/20 - /20]
(17 rows)

Server Execution Time: 154µs
Network Latency: 311µs


- They contain fewer rows than full indexes, making them less expensive to create and store on a cluster.
- Read queries on rows included in a partial index only scan the rows in the partial index. This contrasts with queries on columns in full indexes, which must scan all rows in the indexed column.
- Write queries on rows implied by a partial index only modify rows in the partial index. This contrasts with write queries on columns in full indexes, which must modify the larger set of rows that make up a full-column index.
Copy link
Contributor

@mgartner mgartner Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last bullet point is confusing. The advantage of partial indexes in regards to writes is the the overhead of writing to an index is only incurred for rows that must be added or removed from the partial index, whereas a non-partial index incurs this overhead for every row. For example, if we have an INDEX (a) WHERE b = 'foo', and we INSERT INTO t (a, b) VALUES (1, 'bar'), there is no overhead of writing to the partial index because that row does not belong.

Would something like below be more clear?

With a partial index, write queries only incur the overhead of an index write when the row satisfies the predicate. This contrasts with full indexes, which incur the overhead of an index write for all rows when the indexed column is modified.

- [Functions](functions-and-operators.html) used in predicates must be immutable. For example, the `now()` function is not allowed in predicates because its value depends on more than its arguments.

{{site.data.alerts.callout_info}}
Partial indexes cannot be created at [table creation](create-table.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to create them in a CREATE TABLE statement. Let me know if you ran into a case that didn't work.

[email protected]:58391/defaultdb> create table t (a int, index (a) where a > 0);
CREATE TABLE

Server Execution Time: 3.143ms
Network Latency: 998µs

[email protected]:58391/defaultdb> show create table t;
  table_name |               create_statement
-------------+------------------------------------------------
  t          | CREATE TABLE public.t (
             |     a INT8 NULL,
             |     INDEX t_a_idx (a ASC) WHERE a > 0:::INT8,
             |     FAMILY "primary" (a, rowid)
             | )
(1 row)

Server Execution Time: 5.263ms
Network Latency: 231µs

{{site.data.alerts.end}}

{{site.data.alerts.callout_info}}
CockroachDB returns an error if there are multiple unique or exclusion constraints matching the `ON CONFLICT` specification. See [tracking issue](https://github.com/cockroachdb/cockroach/issues/53170).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only the case for ON CONFLICT ... DO UPDATE, but not for ON CONFLICT ... DO NOTHING. There should be no issues with ON CONFLICT ... DO NOTHING.

We'll probably also want to document the new WHERE clause syntax in the INSERT ON CONFLICT statement. There's some examples here. This is particularly confusing, so I'm happy to explain more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see now that you have this correct below in "Known Limitations". I think this Note should include the "DO UPDATE" clarification or be removed.


{% include copy-clipboard.html %}
~~~ sql
> CREATE INDEX ON rides (city, revenue) WHERE revenue > 80;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] there is no query plan in this section that takes advantage of revenue as an indexed column. If you added another example like SELECT * FROM rides WHERE city = 'new york' AND revenue >= 100 AND revenue < 150, the query plan should be a constrained scan over the partial index, rather than a FULL SCAN.

I'm not sure it's necessary but it might be a nice example to highlight.

Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, @mgartner!

I think I addressed all of your comments. I also added a note about the false negatives. I don't think we need to point out an example, but I agree that adding a disclaimer note will be helpful. Those changes are all in the "mgartner feedback" commit.

re: docs unrelated to partial indexes, I've added some simple updates to separate commits. I'd prefer to have separate PRs for unrelated docs updates, especially if they are more involved, but separate commits for these small updates should be fine.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


v20.2/partial-indexes.md, line 19 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I think this last bullet point is confusing. The advantage of partial indexes in regards to writes is the the overhead of writing to an index is only incurred for rows that must be added or removed from the partial index, whereas a non-partial index incurs this overhead for every row. For example, if we have an INDEX (a) WHERE b = 'foo', and we INSERT INTO t (a, b) VALUES (1, 'bar'), there is no overhead of writing to the partial index because that row does not belong.

Would something like below be more clear?

With a partial index, write queries only incur the overhead of an index write when the row satisfies the predicate. This contrasts with full indexes, which incur the overhead of an index write for all rows when the indexed column is modified.

Gotcha. I rewrote that last bullet, using a lot of your wording.


v20.2/partial-indexes.md, line 58 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

It should be possible to create them in a CREATE TABLE statement. Let me know if you ran into a case that didn't work.

[email protected]:58391/defaultdb> create table t (a int, index (a) where a > 0);
CREATE TABLE

Server Execution Time: 3.143ms
Network Latency: 998µs

[email protected]:58391/defaultdb> show create table t;
  table_name |               create_statement
-------------+------------------------------------------------
  t          | CREATE TABLE public.t (
             |     a INT8 NULL,
             |     INDEX t_a_idx (a ASC) WHERE a > 0:::INT8,
             |     FAMILY "primary" (a, rowid)
             | )
(1 row)

Server Execution Time: 5.263ms
Network Latency: 231µs

Ah! Okay. Removed this note. Didn't run into any issues.


v20.2/partial-indexes.md, line 80 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Ahh I see now that you have this correct below in "Known Limitations". I think this Note should include the "DO UPDATE" clarification or be removed.

I updated the Known Limitations bullet and removed the note.


v20.2/partial-indexes.md, line 149 at r1 (raw file):

there is no query plan in this section that takes advantage of revenue as an indexed column.

I'm not sure I understand what you mean by this. All of the queries filter on revenue. Do you mean "takes advantage of city as an indexed column"?

I added an example that includes the city in the filter clause.


v20.2/sql-feature-support.md, line 80 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Unrelated to partial indexes but I noticed this "Multiple indexes per query" is marked planned. Since cockroachdb/cockroach#2142 was fixed by cockroachdb/cockroach#47094, there is a case where a single query can use multiple indexes (example below). This will be new in 20.2.

[email protected]:58391/defaultdb> create table t (k int primary key, a int, b int, index a_idx (a), index b_idx (b));
CREATE TABLE

Server Execution Time: 2.755ms
Network Latency: 810µs

[email protected]:58391/defaultdb> explain select k from t where a = 10 or b = 20;
          tree         |     field     | description
-----------------------+---------------+--------------
                       | distribution  | local
                       | vectorized    | false
  distinct             |               |
   │                   | distinct on   | k
   └── union all       |               |
        ├── index join |               |
        │    │         | table         | t@primary
        │    └── scan  |               |
        │              | missing stats |
        │              | table         | t@a_idx
        │              | spans         | [/10 - /10]
        └── index join |               |
             │         | table         | t@primary
             └── scan  |               |
                       | missing stats |
                       | table         | t@b_idx
                       | spans         | [/20 - /20]
(17 rows)

Server Execution Time: 154µs
Network Latency: 311µs

I don't see a docs issue opened for this, so I'll just sneak in an update to this table into this PR, making this support "partial" (in a separate commit).

I'd rather document updates for fully in separate PRs. I opened an issue for this: #8260.

Copy link
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)


v20.2/partial-indexes.md, line 58 at r1 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

Ah! Okay. Removed this note. Didn't run into any issues.

Should the CREATE TABLE docs be updated so that the index_def syntax graph also has an opt_where_clause at the end?


v20.2/partial-indexes.md, line 149 at r1 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

there is no query plan in this section that takes advantage of revenue as an indexed column.

I'm not sure I understand what you mean by this. All of the queries filter on revenue. Do you mean "takes advantage of city as an indexed column"?

I added an example that includes the city in the filter clause.

Sorry for creating confusion.

This is a step in the right direction. Before, all the spans were FULL SCANs meaning that they scan the entire partial index. This is a useful case, but another useful example that I wanted to highlight would be a scan over just a small part of the partial index.

Now that you've added city = 'new york', you can see that the span is constrained to [/'new york' - /'new york']. ✔️

If you want to take it a step further to show all the indexed columns being used to constrain the spans, you can change the query filter to WHERE city = 'new york' AND revenue >= 100 AND revenue < 150.

Because revenue >= 100 AND revenue < 150 implies revenue > 80, the partial index can be used. But, it will need to still apply the revenue filter to remove rows where revenue is between 80 and 99 and 150 and +infinity. Luckily, the revenue column is the second indexed column, and the first indexed column, city, is constrained to a single value by the city = 'new york' filter. So the scan over the partial index would constrain both indexed columns, city and revenue, with the span [/'new york'/100 - /'new york'/149].

I think this better shows the full potential of a 2-column partial index. But it's up to you if you want to include it. You may want to reword (or remove?) the EXPLAIN SELECT city, revenue FROM rides WHERE revenue > 95; example since that is similar to my suggested example, but it does a FULL SCAN because the first indexed column, city, is not constrained by the query filter—a crucial difference that may be worth calling out if both examples are on the page.

Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgartner Thanks for iterating on this! Just updated the PR again. See the latest commit.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


v20.2/partial-indexes.md, line 58 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Should the CREATE TABLE docs be updated so that the index_def syntax graph also has an opt_where_clause at the end?

Good catch! Just added it.


v20.2/partial-indexes.md, line 149 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Sorry for creating confusion.

This is a step in the right direction. Before, all the spans were FULL SCANs meaning that they scan the entire partial index. This is a useful case, but another useful example that I wanted to highlight would be a scan over just a small part of the partial index.

Now that you've added city = 'new york', you can see that the span is constrained to [/'new york' - /'new york']. ✔️

If you want to take it a step further to show all the indexed columns being used to constrain the spans, you can change the query filter to WHERE city = 'new york' AND revenue >= 100 AND revenue < 150.

Because revenue >= 100 AND revenue < 150 implies revenue > 80, the partial index can be used. But, it will need to still apply the revenue filter to remove rows where revenue is between 80 and 99 and 150 and +infinity. Luckily, the revenue column is the second indexed column, and the first indexed column, city, is constrained to a single value by the city = 'new york' filter. So the scan over the partial index would constrain both indexed columns, city and revenue, with the span [/'new york'/100 - /'new york'/149].

I think this better shows the full potential of a 2-column partial index. But it's up to you if you want to include it. You may want to reword (or remove?) the EXPLAIN SELECT city, revenue FROM rides WHERE revenue > 95; example since that is similar to my suggested example, but it does a FULL SCAN because the first indexed column, city, is not constrained by the query filter—a crucial difference that may be worth calling out if both examples are on the page.

Ahh. I see. I've updated the example again!

Copy link
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: This looks great!

Reviewed 4 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


v20.2/partial-indexes.md, line 149 at r1 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

Ahh. I see. I've updated the example again!

Looks great!

Copy link
Contributor

@lnhsingh lnhsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: - just a couple of nits. Nice job!

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ericharmeling and @lnhsingh)


v20.2/partial-indexes.md, line 3 at r3 (raw file):

---
title: Partial Indexes
summary: Partial indexes

nit: Add a more descriptive summary. A variation of the first sentence or two of the doc is good


v20.2/partial-indexes.md, line 7 at r3 (raw file):

---

<span class="version-tag">New in v20.2:</span> Partial indexes allow you to specify a subset of rows and columns to add to an [index](indexes.html). Partial indexes include the subset of rows in a table that evaluate to true on a boolean *predicate expression* (i.e. a `WHERE` filter) defined at [index creation](#creation).

nit: add , after i.e.


v20.2/partial-indexes.md, line 186 at r3 (raw file):

~~~

Note that query's `SELECT` statement queries all columns in the `rides` table, not just the indexed columns. As a result, an "index join" is required on both the primary index and the partial index.

Note that query's > Note that the query's

@ericharmeling ericharmeling merged commit ca166f2 into master Sep 14, 2020
@ericharmeling ericharmeling deleted the partial-indexes branch September 14, 2020 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partial indexes
4 participants