opt: enable usage of inverted index with OR clauses #47340

andy-kimball · 2020-04-10T16:30:06Z

This is a new issue derived from comments made by @Bessonov on issue #2142.

CREATE TABLE t (
	id UUID NOT NULL DEFAULT gen_random_uuid(),
	perms JSONB NOT NULL,
	CONSTRAINT "primary" PRIMARY KEY (id ASC),
	INVERTED INDEX ix_perms (perms)
);

> explain (opt) select * from t where perms->'scope'->'xx2000' @> '[{"type":"column"}]' or perms->'scope'->'xx4000' @> '[{"type":"column"}]';
                                                               text                                                               
+--------------------------------------------------------------------------------------------------------------------------------+
  select                                                                                                                          
   ├── scan t                                                                                                                     
   └── filters                                                                                                                    
        └── (perms @> '{"scope": {"xx2000": [{"type": "column"}]}}') OR (perms @> '{"scope": {"xx4000": [{"type": "column"}]}}')

EXPECTED: The ix_perms index should be used.
ACTUAL: The primary index is used instead.

We could fix this in one of two ways:

Extend the work for sql: better query planning for OR expressions #2142 to use UNION in the case where both OR predicates reference the same columns. With regular indexes, that's not necessary because index selection code creates multiple spans for this case. But with inverted indexes, multiple spans can create duplicates.
Update inverted index selection code to generate multiple spans; in this case, we'd need to add a DISTINCT operator to remove any duplicates.

The text was updated successfully, but these errors were encountered:

steeling · 2020-05-22T00:26:32Z

Do you know what the SQL query would be that takes advantage of this feature on inverted indexes? Would it be:

SELECT * FROM t WHERE t.int_arr && ARRAY[1,2,3];

mgartner · 2020-05-22T16:59:36Z

Initially, this feature would support using inverted indexes when there is an explicit OR disjunction, like:

SELECT * FROM t WHERE t.int_arr @> ARRAY[1] OR t.int_arr @> ARRAY[2]

&& is the "overlap" operator, which is an implicit disjunction—it returns true if the left and right have at least one element in common. We should be able to extend this disjunction work to also optimize the && operator.

Ideally the new "InvertedExpr" (work in progress) can handle an arbitrary number of disjunctions, which should make this straight forward. If necessary, we could normalize a && ARRAY[1, 2] to a @> ARRAY[1] OR a @> ARRAY[2].

This commit adds support for planning and executing an inverted index scan + invertedFilter when there is a contains (@>) predicate on a JSON or array column that has an inverted index. Similar to spatial columns, the optimizer may be able to plan an inverted index scan + invertedFilter for JSON and Array columns even if the filter condition is complex, with multiple predicates combined with AND and OR expressions. Informs cockroachdb#47340 Release note (performance improvement): The optimizer now supports using an inverted index on JSON or Array columns for a wider variety of filter predicates. Previously, inverted index usage was only supported for simple predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated predicates are supported by combining simple contains (@>) expressions with AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`). An inverted index will be used if it is available on the filtered column and the optimizer expects it to be more efficient than any alternative plan. This may result in performance improvements for queries involving JSON and Array columns.

56732: sql,util,opt: add support for inverted filter with JSON and Array columns r=rytaft a=rytaft **sql,util: return `unique` boolean from `EncodeContainingInvertedIndexSpans`** This commit adds a result parameter for `EncodeContainingInvertedIndexSpans` called `unique`. The function returns `unique=true` if the spans are guaranteed not to produce duplicate primary keys. Otherwise, returns `unique=false`. The logic is based on the fact that spans for a given JSON or Array datum will not produce duplicate primary keys if there is exactly one JSON or Array path ending in a scalar (i.e., not an empty object or array). **sql,opt: add support for inverted filter with JSON and Array columns** This commit adds support for planning and executing an inverted index scan + invertedFilter when there is a contains (`@>`) predicate on a JSON or array column that has an inverted index. Similar to spatial columns, the optimizer may be able to plan an inverted index scan + invertedFilter for JSON and Array columns even if the filter condition is complex, with multiple predicates combined with AND and OR expressions. Informs #47340 Release note (performance improvement): The optimizer now supports using an inverted index on JSON or Array columns for a wider variety of filter predicates. Previously, inverted index usage was only supported for simple predicates (e.g., `WHERE a @> '{"foo": "bar"}'`), but now more complicated predicates are supported by combining simple contains (@>) expressions with AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`). An inverted index will be used if it is available on the filtered column and the optimizer expects it to be more efficient than any alternative plan. This may result in performance improvements for queries involving JSON and Array columns. 57050: storage: delete mvccGetInternal, use mvccGet directly r=nvanbenschoten a=nvanbenschoten This commit deletes a major chunk of legacy code in MVCC mutation code. `mvccPutInternal` was avoiding using mvccGet because this used to result in a cgo call. This led to a large amount of duplicate logic. Now that `mvccGet` calls into the `pebbleIterator` in Go, using it is much less expensive, so we can remove the re-implementation. With a bit of tuning in follow-on commits, this results in a negligible performance hit which is well worth the elimination of duplicate code: ``` name old time/op new time/op delta MVCCConditionalPut_Pebble/Create/valueSize=10-16 7.33µs ± 8% 7.04µs ± 7% ~ (p=0.075 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=100-16 7.99µs ± 8% 7.89µs ± 8% ~ (p=0.631 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=1000-16 8.51µs ± 7% 8.47µs ± 4% ~ (p=0.529 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=10000-16 23.7µs ± 6% 23.8µs ± 5% ~ (p=0.897 n=10+8) MVCCConditionalPut_Pebble/Replace/valueSize=10-16 12.0µs ± 6% 12.0µs ± 7% ~ (p=0.704 n=10+9) MVCCConditionalPut_Pebble/Replace/valueSize=100-16 13.2µs ± 7% 13.0µs ± 9% ~ (p=0.604 n=9+10) MVCCConditionalPut_Pebble/Replace/valueSize=10000-16 28.8µs ± 7% 30.2µs ± 3% +4.91% (p=0.008 n=10+9) MVCCConditionalPut_Pebble/Replace/valueSize=1000-16 11.5µs ± 6% 12.1µs ± 6% +5.22% (p=0.007 n=10+10) name old speed new speed delta MVCCConditionalPut_Pebble/Create/valueSize=10-16 1.37MB/s ± 8% 1.42MB/s ± 8% ~ (p=0.085 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=100-16 12.5MB/s ± 7% 12.7MB/s ± 9% ~ (p=0.645 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=1000-16 118MB/s ± 8% 118MB/s ± 4% ~ (p=0.529 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=10000-16 423MB/s ± 6% 421MB/s ± 5% ~ (p=0.897 n=10+8) MVCCConditionalPut_Pebble/Replace/valueSize=10-16 835kB/s ± 5% 833kB/s ± 7% ~ (p=0.734 n=10+9) MVCCConditionalPut_Pebble/Replace/valueSize=100-16 7.56MB/s ± 7% 7.69MB/s ±10% ~ (p=0.619 n=9+10) MVCCConditionalPut_Pebble/Replace/valueSize=10000-16 347MB/s ± 7% 331MB/s ± 3% -4.80% (p=0.008 n=10+9) MVCCConditionalPut_Pebble/Replace/valueSize=1000-16 87.1MB/s ± 6% 82.8MB/s ± 6% -4.96% (p=0.007 n=10+10) name old alloc/op new alloc/op delta MVCCConditionalPut_Pebble/Create/valueSize=10-16 366B ± 3% 366B ± 2% ~ (p=1.000 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=100-16 909B ± 2% 905B ± 2% ~ (p=0.492 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=1000-16 6.42kB ± 2% 6.46kB ± 1% ~ (p=0.197 n=10+10) MVCCConditionalPut_Pebble/Create/valueSize=10000-16 79.8kB ± 8% 77.7kB ±10% ~ (p=0.190 n=10+10) MVCCConditionalPut_Pebble/Replace/valueSize=1000-16 8.37kB ± 6% 9.05kB ± 4% +8.20% (p=0.000 n=10+9) MVCCConditionalPut_Pebble/Replace/valueSize=10-16 390B ± 3% 435B ± 4% +11.41% (p=0.000 n=10+10) MVCCConditionalPut_Pebble/Replace/valueSize=10000-16 72.0kB ± 7% 81.0kB ± 3% +12.38% (p=0.000 n=9+10) MVCCConditionalPut_Pebble/Replace/valueSize=100-16 1.02kB ± 3% 1.16kB ± 2% +14.05% (p=0.000 n=10+10) name old allocs/op new allocs/op delta MVCCConditionalPut_Pebble/Create/valueSize=10-16 2.00 ± 0% 2.00 ± 0% ~ (all equal) MVCCConditionalPut_Pebble/Create/valueSize=100-16 2.00 ± 0% 2.00 ± 0% ~ (all equal) MVCCConditionalPut_Pebble/Create/valueSize=1000-16 2.00 ± 0% 2.00 ± 0% ~ (all equal) MVCCConditionalPut_Pebble/Create/valueSize=10000-16 3.00 ± 0% 3.00 ± 0% ~ (all equal) MVCCConditionalPut_Pebble/Replace/valueSize=10-16 4.00 ± 0% 4.00 ± 0% ~ (all equal) MVCCConditionalPut_Pebble/Replace/valueSize=100-16 4.00 ± 0% 4.00 ± 0% ~ (all equal) MVCCConditionalPut_Pebble/Replace/valueSize=1000-16 4.00 ± 0% 4.00 ± 0% ~ (all equal) MVCCConditionalPut_Pebble/Replace/valueSize=10000-16 6.00 ± 0% 6.00 ± 0% ~ (all equal) ``` If we want to continue pulling on this – 2 of the 4 remaining heap allocations here are due to the key copy in [`pebble.Iterator.SeekPrefixGE`](https://github.com/cockroachdb/pebble/blob/2da62788eedefce8dfbeef56a2324603606e18fb/iterator.go#L386). Caching this buffer across calls to `SeekPrefixGE` seems trivial. Caching the buffer across other method calls seems trickier due to the way we use the existence of the slice to indicate "prefix iteration". Maybe someone closer to this code will have good ideas about what to do about this. Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

GenerateInvertedIndexScans now generates InvertedConstraints from equality expressions with a JSON FetchVal operator on the left side and a constant on the right side, like `j->'a' = '1'`. Previously Constraints were built for these expressions with the `idxconstraint` package. By building InvertedConstraints from FetchVal operators, the optimizer now has the ability to generate inverted index scans with conjunctive and disjunctive filters containing FetchVal operators. For example, the optimizer will plan an inverted index scan with the query below. CREATE TABLE t (k INT PRIMARY KEY, j JSON, INVERTED INDEX (j)) SELECT k FROM t WHERE j->'a' = '1' AND j->'b' = 2 This change also brings us a step closer to cleaning up the `idxconstraint` by removing code related to inverted indexes. Informs cockroachdb#47340 Release note (performance improvement): The query optimizer now plans scans over inverted indexes on JSON columns for query filters that constrain the JSON column with equality and fetch value operators (`->`) inside conjunctions and disjunctions, like `j->'a' = '1' AND j->'b' = '2'`.

59266: opt: build scan InvertedConstraints from JSON FetchVal operator r=mgartner a=mgartner #### opt: rename inverted constraint contains extraction function The function name now reflects that it specifically deals with the Contains expression, `@>`. Release note: None #### opt: add rule expectations for GenerateInvertedIndexScan tests Release note: None #### invertedidx: add convenience function for extracting index column variable Release note: None #### opt: build inverted index scan constraints from JSON fetch val operator GenerateInvertedIndexScans now generates InvertedConstraints from equality expressions with a JSON FetchVal operator on the left side and a constant on the right side, like `j->'a' = '1'`. Previously Constraints were built for these expressions with the `idxconstraint` package. By building InvertedConstraints from FetchVal operators, the optimizer now has the ability to generate inverted index scans with conjunctive and disjunctive filters containing FetchVal operators. For example, the optimizer will plan an inverted index scan with the query below. CREATE TABLE t (k INT PRIMARY KEY, j JSON, INVERTED INDEX (j)) SELECT k FROM t WHERE j->'a' = '1' AND j->'b' = 2 This change also brings us a step closer to cleaning up the `idxconstraint` by removing code related to inverted indexes. Informs #47340 Release note (performance improvement): The query optimizer now plans scans over inverted indexes on JSON columns for query filters that constrain the JSON column with equality and fetch value operators (`->`) inside conjunctions and disjunctions, like `j->'a' = '1' AND j->'b' = '2'`. #### opt: build InvertedConstraints for inverted index scans on computed columns GenerateInvertedIndexScans now builds InvertedConstraints over inverted indexes on computed columns when a filter constrains an expression that is equivalent to the computed column expression. Previously Constraints were built for these expressions with the `idxconstraint` package. This change provides no functional change, but brings us a step closer to cleaning up the `idxconstraint` by removing code related to inverted indexes. Release note: None #### opt: do not build Constraints in GenerateInvertedIndexScans InvertedConstraints can be built for inverted index scans in all cases that Constraints could be built. Therefore there is no longer a need to build Constraints in GenerateInvertedIndexScans. This commit removes the code related to this. Release note: None 59408: *:upgrade vendored pgconn to v1.8.0 r=gaoxk a=gaoxk The old version of pgconn asynchronously cancelled requests to the server. However, .connect synchronously closes connections in responses to errors. This causes race conditions for goroutine leaks. The newer pgconn synchronously closes connections to the server, ideally eliminating this race condition. See jackc/pgconn@340bfec Release note (general change): Upgrade to v1.8.0 `pgconn` to patch potential race condition errors. Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Kristy Gao <[email protected]>

mgartner · 2021-03-03T22:16:19Z

There are a handful of improvement to inverted indexes that will be released in 21.1, including index accelerating disjunctions on the same inverted-indexed column, e.g. j->'a' = '1' OR j->'a' = '2'.

However, we do not currently index accelerate expressions of the form j->'a' @> '1', which are present in the query at the top of this issue. I've opened #61430 to track this enhancement separately.

andy-kimball assigned RaduBerinde Apr 10, 2020

andy-kimball mentioned this issue Apr 10, 2020

sql: better query planning for OR expressions #2142

Closed

andy-kimball assigned mgartner and unassigned RaduBerinde Apr 17, 2020

steeling mentioned this issue May 22, 2020

Leverage inverted indexes on array in CRDB 20.1 to remove joins tables interuss/dss#304

Merged

rytaft mentioned this issue Nov 16, 2020

sql,util,opt: add support for inverted filter with JSON and Array columns #56732

Merged

mgartner mentioned this issue Jan 21, 2021

opt: build scan InvertedConstraints from JSON FetchVal operator #59266

Merged

mgartner mentioned this issue Jan 23, 2021

opt: inverted index improvements #59331

Open

31 tasks

mgartner closed this as completed Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: enable usage of inverted index with OR clauses #47340

opt: enable usage of inverted index with OR clauses #47340

andy-kimball commented Apr 10, 2020

steeling commented May 22, 2020

mgartner commented May 22, 2020 •

edited

Loading

mgartner commented Mar 3, 2021

opt: enable usage of inverted index with OR clauses #47340

opt: enable usage of inverted index with OR clauses #47340

Comments

andy-kimball commented Apr 10, 2020

steeling commented May 22, 2020

mgartner commented May 22, 2020 • edited Loading

mgartner commented Mar 3, 2021

mgartner commented May 22, 2020 •

edited

Loading