Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: enable usage of inverted index with OR clauses #47340

Closed
andy-kimball opened this issue Apr 10, 2020 · 3 comments
Closed

opt: enable usage of inverted index with OR clauses #47340

andy-kimball opened this issue Apr 10, 2020 · 3 comments
Assignees

Comments

@andy-kimball
Copy link
Contributor

This is a new issue derived from comments made by @Bessonov on issue #2142.

CREATE TABLE t (
	id UUID NOT NULL DEFAULT gen_random_uuid(),
	perms JSONB NOT NULL,
	CONSTRAINT "primary" PRIMARY KEY (id ASC),
	INVERTED INDEX ix_perms (perms)
);

> explain (opt) select * from t where perms->'scope'->'xx2000' @> '[{"type":"column"}]' or perms->'scope'->'xx4000' @> '[{"type":"column"}]';
                                                               text                                                               
+--------------------------------------------------------------------------------------------------------------------------------+
  select                                                                                                                          
   ├── scan t                                                                                                                     
   └── filters                                                                                                                    
        └── (perms @> '{"scope": {"xx2000": [{"type": "column"}]}}') OR (perms @> '{"scope": {"xx4000": [{"type": "column"}]}}')

EXPECTED: The ix_perms index should be used.
ACTUAL: The primary index is used instead.

We could fix this in one of two ways:

  1. Extend the work for sql: better query planning for OR expressions #2142 to use UNION in the case where both OR predicates reference the same columns. With regular indexes, that's not necessary because index selection code creates multiple spans for this case. But with inverted indexes, multiple spans can create duplicates.
  2. Update inverted index selection code to generate multiple spans; in this case, we'd need to add a DISTINCT operator to remove any duplicates.
@steeling
Copy link

Do you know what the SQL query would be that takes advantage of this feature on inverted indexes? Would it be:

SELECT * FROM t WHERE t.int_arr && ARRAY[1,2,3];

@mgartner
Copy link
Collaborator

mgartner commented May 22, 2020

Initially, this feature would support using inverted indexes when there is an explicit OR disjunction, like:

SELECT * FROM t WHERE t.int_arr @> ARRAY[1] OR t.int_arr @> ARRAY[2]

&& is the "overlap" operator, which is an implicit disjunction—it returns true if the left and right have at least one element in common. We should be able to extend this disjunction work to also optimize the && operator.

Ideally the new "InvertedExpr" (work in progress) can handle an arbitrary number of disjunctions, which should make this straight forward. If necessary, we could normalize a && ARRAY[1, 2] to a @> ARRAY[1] OR a @> ARRAY[2].

rytaft added a commit to rytaft/cockroach that referenced this issue Nov 16, 2020
This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs cockroachdb#47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.
rytaft added a commit to rytaft/cockroach that referenced this issue Nov 18, 2020
This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs cockroachdb#47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.
rytaft added a commit to rytaft/cockroach that referenced this issue Nov 19, 2020
This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs cockroachdb#47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.
rytaft added a commit to rytaft/cockroach that referenced this issue Nov 19, 2020
This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs cockroachdb#47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.
rytaft added a commit to rytaft/cockroach that referenced this issue Nov 23, 2020
This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs cockroachdb#47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.
rytaft added a commit to rytaft/cockroach that referenced this issue Nov 25, 2020
This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs cockroachdb#47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.
craig bot pushed a commit that referenced this issue Nov 25, 2020
56732: sql,util,opt: add support for inverted filter with JSON and Array columns r=rytaft a=rytaft

**sql,util: return `unique` boolean from `EncodeContainingInvertedIndexSpans`**

This commit adds a result parameter for `EncodeContainingInvertedIndexSpans`
called `unique`. The function returns `unique=true` if the spans are
guaranteed not to produce duplicate primary keys. Otherwise, returns
`unique=false`. The logic is based on the fact that spans for a given JSON
or Array datum will not produce duplicate primary keys if there is exactly one
JSON or Array path ending in a scalar (i.e., not an empty object or array).

**sql,opt: add support for inverted filter with JSON and Array columns**

This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (`@>`) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs #47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'`), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.

57050: storage: delete mvccGetInternal, use mvccGet directly r=nvanbenschoten a=nvanbenschoten

This commit deletes a major chunk of legacy code in MVCC mutation code. `mvccPutInternal` was avoiding using mvccGet because this used to result in a cgo call. This led to a large amount of duplicate logic. Now that `mvccGet` calls into the `pebbleIterator` in Go, using it is much less expensive, so we can remove the re-implementation.

With a bit of tuning in follow-on commits, this results in a negligible performance hit which is well worth the elimination of duplicate code:
```
name                                                  old time/op    new time/op    delta
MVCCConditionalPut_Pebble/Create/valueSize=10-16        7.33µs ± 8%    7.04µs ± 7%     ~     (p=0.075 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=100-16       7.99µs ± 8%    7.89µs ± 8%     ~     (p=0.631 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=1000-16      8.51µs ± 7%    8.47µs ± 4%     ~     (p=0.529 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=10000-16     23.7µs ± 6%    23.8µs ± 5%     ~     (p=0.897 n=10+8)
MVCCConditionalPut_Pebble/Replace/valueSize=10-16       12.0µs ± 6%    12.0µs ± 7%     ~     (p=0.704 n=10+9)
MVCCConditionalPut_Pebble/Replace/valueSize=100-16      13.2µs ± 7%    13.0µs ± 9%     ~     (p=0.604 n=9+10)
MVCCConditionalPut_Pebble/Replace/valueSize=10000-16    28.8µs ± 7%    30.2µs ± 3%   +4.91%  (p=0.008 n=10+9)
MVCCConditionalPut_Pebble/Replace/valueSize=1000-16     11.5µs ± 6%    12.1µs ± 6%   +5.22%  (p=0.007 n=10+10)

name                                                  old speed      new speed      delta
MVCCConditionalPut_Pebble/Create/valueSize=10-16      1.37MB/s ± 8%  1.42MB/s ± 8%     ~     (p=0.085 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=100-16     12.5MB/s ± 7%  12.7MB/s ± 9%     ~     (p=0.645 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=1000-16     118MB/s ± 8%   118MB/s ± 4%     ~     (p=0.529 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=10000-16    423MB/s ± 6%   421MB/s ± 5%     ~     (p=0.897 n=10+8)
MVCCConditionalPut_Pebble/Replace/valueSize=10-16      835kB/s ± 5%   833kB/s ± 7%     ~     (p=0.734 n=10+9)
MVCCConditionalPut_Pebble/Replace/valueSize=100-16    7.56MB/s ± 7%  7.69MB/s ±10%     ~     (p=0.619 n=9+10)
MVCCConditionalPut_Pebble/Replace/valueSize=10000-16   347MB/s ± 7%   331MB/s ± 3%   -4.80%  (p=0.008 n=10+9)
MVCCConditionalPut_Pebble/Replace/valueSize=1000-16   87.1MB/s ± 6%  82.8MB/s ± 6%   -4.96%  (p=0.007 n=10+10)

name                                                  old alloc/op   new alloc/op   delta
MVCCConditionalPut_Pebble/Create/valueSize=10-16          366B ± 3%      366B ± 2%     ~     (p=1.000 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=100-16         909B ± 2%      905B ± 2%     ~     (p=0.492 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=1000-16      6.42kB ± 2%    6.46kB ± 1%     ~     (p=0.197 n=10+10)
MVCCConditionalPut_Pebble/Create/valueSize=10000-16     79.8kB ± 8%    77.7kB ±10%     ~     (p=0.190 n=10+10)
MVCCConditionalPut_Pebble/Replace/valueSize=1000-16     8.37kB ± 6%    9.05kB ± 4%   +8.20%  (p=0.000 n=10+9)
MVCCConditionalPut_Pebble/Replace/valueSize=10-16         390B ± 3%      435B ± 4%  +11.41%  (p=0.000 n=10+10)
MVCCConditionalPut_Pebble/Replace/valueSize=10000-16    72.0kB ± 7%    81.0kB ± 3%  +12.38%  (p=0.000 n=9+10)
MVCCConditionalPut_Pebble/Replace/valueSize=100-16      1.02kB ± 3%    1.16kB ± 2%  +14.05%  (p=0.000 n=10+10)

name                                                  old allocs/op  new allocs/op  delta
MVCCConditionalPut_Pebble/Create/valueSize=10-16          2.00 ± 0%      2.00 ± 0%     ~     (all equal)
MVCCConditionalPut_Pebble/Create/valueSize=100-16         2.00 ± 0%      2.00 ± 0%     ~     (all equal)
MVCCConditionalPut_Pebble/Create/valueSize=1000-16        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
MVCCConditionalPut_Pebble/Create/valueSize=10000-16       3.00 ± 0%      3.00 ± 0%     ~     (all equal)
MVCCConditionalPut_Pebble/Replace/valueSize=10-16         4.00 ± 0%      4.00 ± 0%     ~     (all equal)
MVCCConditionalPut_Pebble/Replace/valueSize=100-16        4.00 ± 0%      4.00 ± 0%     ~     (all equal)
MVCCConditionalPut_Pebble/Replace/valueSize=1000-16       4.00 ± 0%      4.00 ± 0%     ~     (all equal)
MVCCConditionalPut_Pebble/Replace/valueSize=10000-16      6.00 ± 0%      6.00 ± 0%     ~     (all equal)
```

If we want to continue pulling on this – 2 of the 4 remaining heap allocations here are due to the key copy in [`pebble.Iterator.SeekPrefixGE`](https://github.com/cockroachdb/pebble/blob/2da62788eedefce8dfbeef56a2324603606e18fb/iterator.go#L386). Caching this buffer across calls to `SeekPrefixGE` seems trivial. Caching the buffer across other method calls seems trickier due to the way we use the existence of the slice to indicate "prefix iteration". Maybe someone closer to this code will have good ideas about what to do about this.

Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 22, 2021
GenerateInvertedIndexScans now generates InvertedConstraints from
equality expressions with a JSON FetchVal operator on the left side and
a constant on the right side, like `j->'a' = '1'`. Previously
Constraints were built for these expressions with the `idxconstraint`
package.

By building InvertedConstraints from FetchVal operators, the optimizer
now has the ability to generate inverted index scans with conjunctive
and disjunctive filters containing FetchVal operators. For example, the
optimizer will plan an inverted index scan with the query below.

    CREATE TABLE t (k INT PRIMARY KEY, j JSON, INVERTED INDEX (j))

    SELECT k FROM t WHERE j->'a' = '1' AND j->'b' = 2

This change also brings us a step closer to cleaning up the
`idxconstraint` by removing code related to inverted indexes.

Informs cockroachdb#47340

Release note (performance improvement): The query optimizer now plans
scans over inverted indexes on JSON columns for query filters that
constrain the JSON column with equality and fetch value operators (`->`)
inside conjunctions and disjunctions, like
`j->'a' = '1' AND j->'b' = '2'`.
mgartner added a commit to mgartner/cockroach that referenced this issue Jan 26, 2021
GenerateInvertedIndexScans now generates InvertedConstraints from
equality expressions with a JSON FetchVal operator on the left side and
a constant on the right side, like `j->'a' = '1'`. Previously
Constraints were built for these expressions with the `idxconstraint`
package.

By building InvertedConstraints from FetchVal operators, the optimizer
now has the ability to generate inverted index scans with conjunctive
and disjunctive filters containing FetchVal operators. For example, the
optimizer will plan an inverted index scan with the query below.

    CREATE TABLE t (k INT PRIMARY KEY, j JSON, INVERTED INDEX (j))

    SELECT k FROM t WHERE j->'a' = '1' AND j->'b' = 2

This change also brings us a step closer to cleaning up the
`idxconstraint` by removing code related to inverted indexes.

Informs cockroachdb#47340

Release note (performance improvement): The query optimizer now plans
scans over inverted indexes on JSON columns for query filters that
constrain the JSON column with equality and fetch value operators (`->`)
inside conjunctions and disjunctions, like
`j->'a' = '1' AND j->'b' = '2'`.
craig bot pushed a commit that referenced this issue Jan 27, 2021
59266: opt: build scan InvertedConstraints from JSON FetchVal operator r=mgartner a=mgartner

#### opt: rename inverted constraint contains extraction function

The function name now reflects that it specifically deals with the
Contains expression, `@>`.

Release note: None

#### opt: add rule expectations for GenerateInvertedIndexScan tests

Release note: None

#### invertedidx: add convenience function for extracting index column variable

Release note: None

#### opt: build inverted index scan constraints from JSON fetch val operator

GenerateInvertedIndexScans now generates InvertedConstraints from
equality expressions with a JSON FetchVal operator on the left side and
a constant on the right side, like `j->'a' = '1'`. Previously
Constraints were built for these expressions with the `idxconstraint`
package.

By building InvertedConstraints from FetchVal operators, the optimizer
now has the ability to generate inverted index scans with conjunctive
and disjunctive filters containing FetchVal operators. For example, the
optimizer will plan an inverted index scan with the query below.

    CREATE TABLE t (k INT PRIMARY KEY, j JSON, INVERTED INDEX (j))

    SELECT k FROM t WHERE j->'a' = '1' AND j->'b' = 2

This change also brings us a step closer to cleaning up the
`idxconstraint` by removing code related to inverted indexes.

Informs #47340

Release note (performance improvement): The query optimizer now plans
scans over inverted indexes on JSON columns for query filters that
constrain the JSON column with equality and fetch value operators (`->`)
inside conjunctions and disjunctions, like
`j->'a' = '1' AND j->'b' = '2'`.

#### opt: build InvertedConstraints for inverted index scans on computed columns

GenerateInvertedIndexScans now builds InvertedConstraints over inverted
indexes on computed columns when a filter constrains an expression that
is equivalent to the computed column expression. Previously Constraints
were built for these expressions with the `idxconstraint` package.

This change provides no functional change, but brings us a step closer
to cleaning up the `idxconstraint` by removing code related to inverted
indexes.

Release note: None

#### opt: do not build Constraints in GenerateInvertedIndexScans

InvertedConstraints can be built for inverted index scans in all cases
that Constraints could be built. Therefore there is no longer a need to
build Constraints in GenerateInvertedIndexScans. This commit removes the
code related to this.

Release note: None


59408: *:upgrade vendored pgconn to v1.8.0 r=gaoxk a=gaoxk

The old version of pgconn asynchronously cancelled requests to the server. However, .connect synchronously closes connections in responses to errors. This causes race conditions for goroutine leaks.
The newer pgconn synchronously closes connections to the server, ideally eliminating this race condition. See jackc/pgconn@340bfec

Release note (general change): Upgrade to v1.8.0 `pgconn` to patch potential race condition errors.

Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Kristy Gao <[email protected]>
@mgartner
Copy link
Collaborator

mgartner commented Mar 3, 2021

There are a handful of improvement to inverted indexes that will be released in 21.1, including index accelerating disjunctions on the same inverted-indexed column, e.g. j->'a' = '1' OR j->'a' = '2'.

However, we do not currently index accelerate expressions of the form j->'a' @> '1', which are present in the query at the top of this issue. I've opened #61430 to track this enhancement separately.

@mgartner mgartner closed this as completed Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants