-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: implement trigram inverted indexes and index acceleration for % and LIKE #79705
Conversation
For an example, here's a log from the CLI:
|
e5300b1
to
3963a96
Compare
18dd990
to
0c70018
Compare
8c17570
to
0d1bcd9
Compare
I added opclass parsing and persistence. I think this is RFAL! |
80adc59
to
2076c41
Compare
This is super cool! What an unexpected win! |
I've split off the builtin functions, operators, session settings, etc into #81418. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Exciting stuff!
Reviewed 47 of 47 files at r10, 9 of 9 files at r11, 23 of 23 files at r12, 15 of 15 files at r13, 13 of 13 files at r14, 4 of 4 files at r15, 17 of 17 files at r16, 6 of 6 files at r17, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/opt/invertedidx/trigram.go
line 32 at r14 (raw file):
} func (t *trigramFilterPlanner) extractInvertedFilterConditionFromLeaf(
nit: add comment:
// extractInvertedFilterConditionFromLeaf is part of the invertedFilterPlanner
// interface.
pkg/sql/opt/invertedidx/trigram.go
line 81 at r14 (raw file):
} var _ invertedFilterPlanner = &trigramFilterPlanner{}
nit: move this just below the type definition
pkg/sql/opt/invertedidx/trigram.go
line 47 at r15 (raw file):
case *memo.ILikeExpr: left, right = e.Left, e.Right // If we're doing a LIKE (or ILIKE) expression, we need to construct an AND out of all
nit: make comment <= 80 columns wide
pkg/sql/opt/invertedidx/trigram.go
line 32 at r17 (raw file):
} func (t *trigramFilterPlanner) extractInvertedFilterConditionFromLeaf(
This could use a unit test (similar to TestTryFilterJsonOrArrayIndex
)
pkg/sql/parser/sql.y
line 8710 at r10 (raw file):
nullsOrder := $3.nullsOrder() if opClass != "" { if opClass == "gin_trgm_ops" || opClass == "gist_trgm_ops" {
you only added support for "gin_trgm_ops" above -- do you also want to add support for the gist version?
pkg/sql/rowenc/index_encoding.go
line 896 at r14 (raw file):
// trigrams present in the input string. func EncodeLikeTrigramSpans(val *tree.DString) (inverted.Expression, error) { chunks := strings.Split(string(*val), "%")
nit: add a comment to explain why you split on %
pkg/sql/rowenc/index_encoding.go
line 898 at r14 (raw file):
chunks := strings.Split(string(*val), "%") // Each chunk will have at minimum 2 trigrams, so start at that default size. keys := make([][]byte, 0, len(chunks)*2)
should this be len(chunks)+2
?
pkg/sql/rowenc/index_encoding.go
line 897 at r17 (raw file):
// expression must match every trigram in the input. Otherwise, it will match // any trigram in the input. func EncodeTrigramSpans(val *tree.DString, allMustMatch bool) (inverted.Expression, error) {
This could use a unit test (similar to TestEncodeContainingArrayInvertedIndexSpans
, TestEncodeOverlapsArrayInvertedIndexSpans
or TestEncodeContainedArrayInvertedIndexSpans
).
pkg/sql/sem/tree/create.go
line 141 at r16 (raw file):
//} // return unimplementedWithIssue(sqllex, 47420) //}
Looks like dead code
pkg/sql/logictest/testdata/logic_test/trigram_indexes
line 17 at r16 (raw file):
statement ok CREATE INDEX ON a USING GIN(t gin_trgm_ops)
do you want to add a test for USING GIST
and gist_trgm_ops
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 13 files at r14, 2 of 46 files at r18, 2 of 25 files at r22, 23 of 23 files at r26, 17 of 17 files at r27, 4 of 6 files at r28.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @jordanlewis, @rafiss, and @rytaft)
pkg/sql/create_index.go
line 351 at r27 (raw file):
} indexDesc.InvertedColumnKinds[0] = descpb.IndexDescriptor_TRIGRAM }
is this going to just accept an opclass for a column type which is, say, INT8
?
For some defense in depth, can you return an assertion failure if this gets called with a type that is not inverted indexable?
More generally, I don't see much if any protection against opclasses showing up in the grammar and then being summarily ignored. Maybe I missed it. Can you add some logic tests which make sure things like CREATE INDEX idx ON (int_col gin_trgm_ops)
fail. F
Code quote:
}
return nil
pkg/sql/catalog/descpb/structured.proto
line 349 at r27 (raw file):
// wouldn't be possible to distinguish a text inverted index that uses // trigrams, vs a text inverted index that uses stemming. enum InvertedIndexColumnKind {
can you define this in catpb
? We're going to need it in the declarative schema changer which sits below descpb
.
Code quote:
enum InvertedIndexColumnKind {
// DEFAULT is the default kind of inverted index column. JSON, Array, and
// geo inverted indexes all are DEFAULT, though prior to 22.2 they had no
// kind at all.
DEFAULT = 0;
// TRIGRAM is the trigram kind of inverted index column. It's only valid on
// text columns.
TRIGRAM = 1;
}
pkg/sql/rowenc/index_encoding_test.go
line 963 at r26 (raw file):
runTest := func(indexedValue, value string, searchType trigramSearchType, expectContainsKeys, expected, expectUnique bool) { t.Logf("test case: %s %s %v %t %t %t", indexedValue, value, searchType, expectContainsKeys, expected, expectUnique)
nit: is this better than a subtest?
pkg/sql/rowenc/index_encoding_test.go
line 1098 at r26 (raw file):
{[]string{}}, } for _, tc := range tcs {
nit: subtests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 23 of 23 files at r26, 17 of 17 files at r27, 6 of 6 files at r28, 1 of 1 files at r29, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @rafiss)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 13 files at r14, 2 of 46 files at r18, 2 of 25 files at r22, 23 of 23 files at r26, 17 of 17 files at r27, 6 of 6 files at r28, 1 of 1 files at r29, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @rafiss)
pkg/sql/opt/invertedidx/trigram.go
line 97 at r26 (raw file):
// For a LIKE or ILIKE expr, we split the string by the % separator // character which is used in LIKE expressions to separate matches. chunks = strings.Split(s, "%")
This seems too naïve to handle escaped %
s:
marcus=# select 'hello' LIKE '%ell%';
?column?
----------
t
(1 row)
marcus=# select 'hello' LIKE '\%ell%';
?column?
----------
f
(1 row)
We've run had previous bugs with escaped strings in LIKE expressions: #68289
Very cool!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @mgartner, @rafiss, and @rytaft)
pkg/sql/create_index.go
line 351 at r27 (raw file):
Previously, ajwerner wrote…
is this going to just accept an opclass for a column type which is, say,
INT8
?For some defense in depth, can you return an assertion failure if this gets called with a type that is not inverted indexable?
More generally, I don't see much if any protection against opclasses showing up in the grammar and then being summarily ignored. Maybe I missed it. Can you add some logic tests which make sure things like
CREATE INDEX idx ON (int_col gin_trgm_ops)
fail. F
Thanks for the catch. That's a good point, I was missing this. Done.
pkg/sql/opt/invertedidx/trigram.go
line 97 at r26 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
This seems too naïve to handle escaped
%
s:marcus=# select 'hello' LIKE '%ell%'; ?column? ---------- t (1 row) marcus=# select 'hello' LIKE '\%ell%'; ?column? ---------- f (1 row)
We've run had previous bugs with escaped strings in LIKE expressions: #68289
It's actually alright, but for a silly reason that I didn't think of either - the trigrams of a string do not include non-alpha characters. So the output of this method will be the same regardless of whether we split the string at all or not.
This makes all of this chunk
stuff totally unnecessary, so I've removed it, which simplifies things significantly. Thanks for asking about this!
The other thing is that we always need to re-check the results of trigram searches when searching with more than 1 trigram because the index doesn't contain information about whether two trigrams are adjacent. So that would "save" us here as well, though I suppose not in the case where we include literal % characters in trigrams that we store or search by, but we don't do that anyway.
pkg/sql/rowenc/index_encoding_test.go
line 963 at r26 (raw file):
Previously, ajwerner wrote…
nit: is this better than a subtest?
I was just following the convention in this file, so I'll keep it as is.
pkg/sql/rowenc/index_encoding_test.go
line 1098 at r26 (raw file):
Previously, ajwerner wrote…
nit: subtests
Simplified this, and there are so few that I don't think it matters too much.
027299d
to
0bc7511
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 25 of 25 files at r30, 16 of 19 files at r31, 9 of 9 files at r34, 6 of 6 files at r35, 1 of 1 files at r36, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @jordanlewis, and @rafiss)
pkg/sql/opt/invertedidx/trigram.go
line 97 at r26 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
It's actually alright, but for a silly reason that I didn't think of either - the trigrams of a string do not include non-alpha characters. So the output of this method will be the same regardless of whether we split the string at all or not.
This makes all of this
chunk
stuff totally unnecessary, so I've removed it, which simplifies things significantly. Thanks for asking about this!The other thing is that we always need to re-check the results of trigram searches when searching with more than 1 trigram because the index doesn't contain information about whether two trigrams are adjacent. So that would "save" us here as well, though I suppose not in the case where we include literal % characters in trigrams that we store or search by, but we don't do that anyway.
Interesting. Makes sense! 👍
pkg/sql/rowenc/index_encoding_test.go
line 1085 at r30 (raw file):
"fo", "a", "",
What if there are 3 characters but not all are alpha characters, like fo%
? Do we expect that to succeed or err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r35.
Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @rafiss)
This commit implements trigram inverted indexes that are searchable using the =, LIKE, ILIKE and % (similarity) operators. This is a partial implementation of the gin_trgm_ops opclass from Postgres. gin_trgm_ops also supports regex matches in Postgres. Release note (sql change): string columns now support inverted indexes using trigrams. These indexes can be searched using the =, LIKE, ILIKE and % (similarity) predicates.
Previously, it was not permitted to use opclass syntax for index creation. For example, the `blah_ops` in `CREATE INDEX ON t USING GIN (col blah_ops))`. Now, this syntax is legal when the opclass is supported for a given type. Release note (sql change): permit usage of jsonb_ops, array_ops, gin_trgm_ops, and gist_trgm_ops as an opclass in inverted index creation.
Make sure that it's not possible to create an inverted index until the cluster is at a new enough version. Release note: None
Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @ajwerner, @mgartner, and @rafiss)
pkg/sql/rowenc/index_encoding_test.go
line 1085 at r30 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
What if there are 3 characters but not all are alpha characters, like
fo%'
? Do we expect that to succeed or err?
Error. Added a test.
Thanks all for the reviews! bors r+ |
Build succeeded: |
Updates #41285
See commits for details. This PR implements inverted indexing on string columns using trigrams and searching trigram indexes using the
=
,LIKE
,ILIKE
and%
(text similarity) operators.Inverted indexes on text columns are built with the
gin_trgm_ops
orgist_trgm_ops
opclasses, withCREATE INDEX ON t USING GIN(col gin_trgm_ops)
,CREATE INVERTED INDEX ON t(col gin_tgrm_ops)
, orCREATE INDEX ON t USING GIST(col gist_trgm_ops)
. Thegin
andgist
opclasses are currently implemented identically.