Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt,sql: ensure validation of implicitly-partitioned partial unique indexes is efficient #65473

Closed
rytaft opened this issue May 19, 2021 · 0 comments · Fixed by #66565
Closed
Assignees
Labels
A-multiregion Related to multi-region C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-multiregion T-sql-queries SQL Queries Team

Comments

@rytaft
Copy link
Collaborator

rytaft commented May 19, 2021

#65355 ensured that validation of implicitly partitioned full unique indexes is efficient. However, it did not improve efficiency for partial indexes. For example, validating the index UNIQUE (a) WHERE b > 0 results in this plan, which uses a hash aggregate:

validating unique constraint "uniq_idx" ("regional_by_row_table" [[a]]) with query "SELECT a FROM [73 AS tbl] WHERE a IS NOT NULL AND (b > 0:::INT8) GROUP BY a HAVING count(*) > 1  LIMIT 1" and plan

distribution: full
vectorized: false

• project
│ columns: (a)
│ estimated row count: 1 (missing stats)
│
└── • limit
    │ columns: (a, count_rows)
    │ estimated row count: 1 (missing stats)
    │ count: 1
    │
    └── • filter
        │ columns: (a, count_rows)
        │ estimated row count: 33 (missing stats)
        │ filter: count_rows > 1
        │
        └── • group
            │ columns: (a, count_rows)
            │ estimated row count: 98 (missing stats)
            │ aggregate 0: count_rows()
            │ group by: a
            │
            └── • project
                │ columns: (a)
                │
                └── • filter
                    │ columns: (a, b)
                    │ estimated row count: 333 (missing stats)
                    │ filter: b > 0
                    │
                    └── • scan
                          columns: (a, b)
                          estimated row count: 111 (missing stats)
                          table: regional_by_row_table@primary
                          spans: /"@"-/"\\xc0"/PrefixEnd

Epic CRDB-2528

@rytaft rytaft added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-multiregion Related to multi-region T-multiregion T-sql-queries SQL Queries Team labels May 19, 2021
@rytaft rytaft self-assigned this Jun 2, 2021
rytaft added a commit to rytaft/cockroach that referenced this issue Jun 16, 2021
This commit adds a new exploration rule, EliminateIndexJoinInsideGroupBy.
EliminateIndexJoinInsideGroupBy removes an IndexJoin operator if it can be
proven that the removal does not affect the output of the parent grouping
operator. This is the case if:

1. Only columns from the index join's input are being used by the grouping
   operator.

2. The OrderingChoice of the grouping operator can be expressed with only
   columns from the index join's input. Or in other words, at least one column
   in every ordering group is one of the output columns from the index join's
   input.

This rule is useful when using partial indexes. When generating partial index
scans, expressions can be removed from filters because they exactly match
expressions in partial index predicates and there is no need to apply the
filter after the scan. Columns referenced in the removed expressions may no
longer need to be fetched.

Consider the example:

  CREATE TABLE t (i INT, s STRING, INDEX (i) WHERE s IN ('foo','bar'))

  SELECT DISTINCT i FROM t WHERE s IN ('foo','bar')

The normalized expression for the SELECT query is:

  distinct-on
   ├── columns: i:1
   ├── grouping columns: i:1
   └── select
        ├── columns: i:1 s:2!null
        ├── scan t
        │    └── columns: i:1 s:2
        └── filters
             └── s:2 IN ('foo','bar')

GeneratePartialIndexScans will generate this expression:

  distinct-on
   ├── columns: i:1
   ├── grouping columns: i:1
   └── index-join t
        ├── columns: i:1 s:2!null
        └── scan t@secondary,partial
             └── columns: i:1 rowid:4!null

The IndexJoin is created because the Select expression in the previous
expression required s in order to apply the (s IN ('foo','bar')) filter.
However, because rows in the partial index are already filtered by
(s IN ('foo','bar')), column s does not need to be fetched. The IndexJoin
can be eliminated, resulting in the expression:

  distinct-on
   ├── columns: i:1
   ├── grouping columns: i:1
   └── scan t@secondary,partial
        └── columns: i:1 rowid:4!null

This rule is needed to ensure that SplitGroupByScanIntoUnionScans can
fire in order to create an efficient plan to validate new implicitly
partitioned partial unique indexes.

Fixes cockroachdb#65473

Release note (performance improvement): Validation of a new partial
UNIQUE index in a REGIONAL BY ROW table no longer requires an inefficient
and memory-intensive hash aggregation query. The optimizer can now plan
the validation query so that it uses all streaming operations, which are
much more efficient.
craig bot pushed a commit that referenced this issue Jun 24, 2021
66565: opt: eliminate unnecessary index join inside group by r=rytaft a=rytaft

This commit adds a new exploration rule, `EliminateIndexJoinInsideGroupBy`.
`EliminateIndexJoinInsideGroupBy` removes an `IndexJoin` operator if it can be
proven that the removal does not affect the output of the parent grouping
operator. This is the case if:

1. Only columns from the index join's input are being used by the grouping
   operator.

2. The `OrderingChoice` of the grouping operator can be expressed with only
   columns from the index join's input. Or in other words, at least one column
   in every ordering group is one of the output columns from the index join's
   input.

This rule is useful when using partial indexes. When generating partial index
scans, expressions can be removed from filters because they exactly match
expressions in partial index predicates and there is no need to apply the
filter after the scan. Columns referenced in the removed expressions may no
longer need to be fetched.

Consider the example:
```
  CREATE TABLE t (i INT, s STRING, INDEX (i) WHERE s IN ('foo','bar'))

  SELECT DISTINCT i FROM t WHERE s IN ('foo','bar')
```
The normalized expression for the `SELECT` query is:
```
  distinct-on
   ├── columns: i:1
   ├── grouping columns: i:1
   └── select
        ├── columns: i:1 s:2!null
        ├── scan t
        │    └── columns: i:1 s:2
        └── filters
             └── s:2 IN ('foo','bar')
```
`GeneratePartialIndexScans` will generate this expression:
```
  distinct-on
   ├── columns: i:1
   ├── grouping columns: i:1
   └── index-join t
        ├── columns: i:1 s:2!null
        └── scan t@secondary,partial
             └── columns: i:1 rowid:4!null
```
The `IndexJoin` is created because the `Select` expression in the previous
expression required `s` in order to apply the `(s IN ('foo','bar'))` filter.
However, because rows in the partial index are already filtered by
`(s IN ('foo','bar'))`, column s does not need to be fetched. The `IndexJoin`
can be eliminated, resulting in the expression:
```
  distinct-on
   ├── columns: i:1
   ├── grouping columns: i:1
   └── scan t@secondary,partial
        └── columns: i:1 rowid:4!null
```
This rule is needed to ensure that `SplitGroupByScanIntoUnionScans` can
fire in order to create an efficient plan to validate new implicitly
partitioned partial unique indexes.

Fixes #65473

Release note (performance improvement): Validation of a new partial
`UNIQUE` index in a `REGIONAL BY ROW` table no longer requires an inefficient
and memory-intensive hash aggregation query. The optimizer can now plan
the validation query so that it uses all streaming operations, which are
much more efficient.

66840: backupccl: fix error during span-merging optimization r=pbardea a=pbardea

This commit fixes a bug where a backup may crash if a full backup with
revision history is taken on a table that has index-merging
opportunities on descriptor revisions that lie before the GC TTL of the
table.

Since the merging of index-spans during backup planning is entirely an
optimization (in an effort to reduce the number of spans we place a
protectedts over), it is safe to ignore such errors and treat it as
un-mergable.

Fixes #66797.

Release note (bug fix): Fix an error that backup would produce in some
full backups with revision history. Previously some full backups would
erroneously produce an error in the form of `batch timestamp
<some_timestamp> must be after replica GC threshold
<some_other_timestamp>`.

Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Paul Bardea <[email protected]>
@craig craig bot closed this as completed in 7cf7e67 Jun 24, 2021
rytaft added a commit to rytaft/cockroach that referenced this issue Jul 6, 2021
This commit updates the exploration rule EliminateIndexJoinInsideGroupBy
and renames it to EliminateIndexJoinOrProjectInsideGroupBy. The rule now
removes either an IndexJoin or Project operator if it can be proven that
the removal does not affect the output of the parent grouping operator.

Removal of a Project is needed in cases where the partial index predicate
constrains some columns to be constant, and therefore provides those
columns as constant projections. If the projected columns are not actually
needed by the GroupBy, however, the Project is not necessary and interferes
with other rules matching, such as SplitGroupByScanIntoUnionScans.

Informs cockroachdb#65473

Release note (performance improvement): Improved the efficiency of validation
for some partial unique indexes in REGIONAL BY ROW tables by improving the
query plan to use all streaming operations.
craig bot pushed a commit that referenced this issue Jul 6, 2021
66989: roachtest: check metrics during multi-region TPC-C runs r=ajstorm a=otan

See individual commits for details.

Resolves #65911
Resolves #66953

67263: opt,ccl: eliminate Project inside GroupBy, add tests for validation of partial unique indexes r=rytaft a=rytaft

**opt: add a test-only check for `NullsAreDistinct` in `GroupingPrivate`**

This commit adds a check that `NullsAreDistinct` is true iff the grouping
operator is `UpsertDistinctOn` or `EnsureUpsertDistinctOn`.

Release note: None

**opt: add exploration rule to eliminate `Project` inside `GroupBy`**

This commit updates the exploration rule `EliminateIndexJoinInsideGroupBy`
and renames it to `EliminateIndexJoinOrProjectInsideGroupBy`. The rule now
removes either an `IndexJoin` or `Project` operator if it can be proven that
the removal does not affect the output of the parent grouping operator.

Removal of a `Project` is needed in cases where the partial index predicate
constrains some columns to be constant, and therefore provides those
columns as constant projections. If the projected columns are not actually
needed by the `GroupBy`, however, the `Project` is not necessary and interferes
with other rules matching, such as `SplitGroupByScanIntoUnionScans`.

Informs #65473

Release note (performance improvement): Improved the efficiency of validation
for some partial unique indexes in `REGIONAL BY ROW` tables by improving the
query plan to use all streaming operations.

**ccl: add tests for validation of partial unique indexes**

This commit adds tests to ensure that the expected query plan is used
to validate new partial unique indexes in `REGIONAL BY ROW` tables.

Informs #65473

Release note: None

Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-multiregion T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant