-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add inverted indexing for tsvector and acceleration for @@ queries #93769
Conversation
c9608d5
to
50871a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Would it be possible to add some randomized tests similar to the ones we have for array / json / trigram?
Reviewed 8 of 8 files at r1, 11 of 11 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/opt/invertedidx/tsearch.go
line 1 at r2 (raw file):
// Copyright 2022 The Cockroach Authors.
Can you add a unit test for this similar to trigram_test.go
?
pkg/sql/sem/builtins/builtins.go
line 5928 at r2 (raw file):
Types: tree.ParamTypes{ {Name: "val", Typ: types.TSVector}, {Name: "version", Typ: types.Int},
nit: this change should go with the first commit
pkg/util/tsearch/encoding.go
line 470 at r1 (raw file):
outKeys = append(outKeys, newKey) } return outKeys, nil
Can you guarantee that all terms are unique?
pkg/util/tsearch/tsquery.go
line 184 at r2 (raw file):
key := EncodeInvertedIndexKey(nil /* inKey */, n.term.lexeme) span := inverted.MakeSingleValSpan(key) return inverted.ExprForSpan(span, true), nil
nit: true /* tight */
Also, should we set Unique = true
on this span expression?
pkg/util/tsearch/tsquery.go
line 186 at r2 (raw file):
return inverted.ExprForSpan(span, true), nil case followedby: fallthrough
should we do SetNotTight()
in this case?
pkg/sql/opt/exec/execbuilder/testdata/tsvector_index
line 1 at r2 (raw file):
# LogicTest: local
it would be good to add a similar file in logictests
to check that the query output is as expected.
50871a3
to
693df3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, responded to all inline comments. I'll add some randomized encoded span containment tests as well, I forgot about those. The amount of places we need the same kinds of test cases for the inverted expressions is a little bit annoying (we'd like the same test cases in logic tests, execbuilder tests, eval tests, encoded span containment tests, and probably more), but I'm not really sure if it's possible to factor them out well.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/opt/invertedidx/tsearch.go
line 1 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Can you add a unit test for this similar to
trigram_test.go
?
🤦 not sure why I didn't do this in the first place, thanks for the nudge. Added a suite.
pkg/util/tsearch/encoding.go
line 470 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Can you guarantee that all terms are unique?
Yes, TSVector is guaranteed to be sorted and contain unique terms only - that's part of its contract. It's enforced during TSVector construction time. Added a comment to help motivate.
pkg/util/tsearch/tsquery.go
line 184 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: true /* tight */
Also, should we set
Unique = true
on this span expression?
There were some difference in behavior here I overlooked when I made this change, until the changes in #94431. Added commentary and updated behavior (to deal with prefix matches and weight matches).
pkg/util/tsearch/tsquery.go
line 186 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
should we do
SetNotTight()
in this case?
Yes, good point. Done. A few changes were necessary, PTAL.
pkg/sql/opt/exec/execbuilder/testdata/tsvector_index
line 1 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
it would be good to add a similar file in
logictests
to check that the query output is as expected.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The amount of places we need the same kinds of test cases for the inverted expressions is a little bit annoying
I agree, although these things have also been notoriously full of bugs, so I think it's better to err on the side of over-testing rather than under-testing. But let me know if you have ideas...
Reviewed 14 of 14 files at r3, 10 of 10 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/util/tsearch/encoding.go
line 463 at r4 (raw file):
func EncodeInvertedIndexKeys(inKey []byte, vector TSVector) ([][]byte, error) { outKeys := make([][]byte, 0, len(vector)) // Note that by construction, TSVector contains only unique term, so we don't
nit: term -> terms
pkg/util/tsearch/tsquery.go
line 192 at r4 (raw file):
// 3. Weighted match. // In this case, we make the match non-tight, because we don't store the // weights of the lexemes in the index, and are forced to re-check the
nit: trailing "the"
pkg/util/tsearch/tsquery.go
line 252 at r4 (raw file):
// We never have a unique match when combining terms, since a single primary // key could be represented in both terms. expr.(*inverted.SpanExpression).Unique = false
I don't think this is needed. The comment for Unique
says "it holds when unique SpanExpressions are combined with And". The code in expression.go
should already be doing the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)
pkg/util/tsearch/tsquery.go
line 252 at r4 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
I don't think this is needed. The comment for
Unique
says "it holds when unique SpanExpressions are combined with And". The code inexpression.go
should already be doing the right thing.
It does seem to be required, unless I'm using the API wrong. The following test case fails without this line of code:
tsearch_test.go:94: test case: {t @@ 'a & b' true true false}
tsearch_test.go:122: For (t @@ 'a & b'), expected unique=false, but got true
Reading the code of And
, it seems like Unique
is populated like this:
Unique: left.Unique && right.Unique,
Which seems wrong to me in this case? Or am I missing something and the test case is wrong?
The trigram inverted filter builder also does this - it assumes that Unique should be false if you're combining 2 spans with And. I guess there are fixes to make over there as well.
Is the idea that we can still assume they're Unique after And because the rows will be intersected during execution?
Then again, I've read the code that reads the Unique field and it appears like nobody ever uses the Unique property of an expression that isn't a leaf, so I think all of this is immaterial. Perhaps we could teach the optimizer to not plan an inverted filter if the operator is AND and the expression is Unique, at least in some cases? But that's not done today.
693df3c
to
4bf87c0
Compare
Nevermind, you still of course need the inverted filter - that's the thing that's doing the intersection. I still think there could be some duplicate work being performed in the inverted filterer - can't we avoid checking each span for containment within the inverted expression's span, since we know by construction that each returned key is in fact contained in the search spans? But I don't really understand that code very well so I could be wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, you still of course need the inverted filter...
I think we discussed this offline, but let me know if you're still unsure about anything.
Reviewed 13 of 13 files at r5, 10 of 10 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/util/tsearch/tsquery.go
line 252 at r4 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
It does seem to be required, unless I'm using the API wrong. The following test case fails without this line of code:
tsearch_test.go:94: test case: {t @@ 'a & b' true true false} tsearch_test.go:122: For (t @@ 'a & b'), expected unique=false, but got true
Reading the code of
And
, it seems likeUnique
is populated like this:Unique: left.Unique && right.Unique,
Which seems wrong to me in this case? Or am I missing something and the test case is wrong?
The trigram inverted filter builder also does this - it assumes that Unique should be false if you're combining 2 spans with And. I guess there are fixes to make over there as well.
Is the idea that we can still assume they're Unique after And because the rows will be intersected during execution?
Then again, I've read the code that reads the Unique field and it appears like nobody ever uses the Unique property of an expression that isn't a leaf, so I think all of this is immaterial. Perhaps we could teach the optimizer to not plan an inverted filter if the operator is AND and the expression is Unique, at least in some cases? But that's not done today.
We discussed this offline -- I think based on that discussion my original comment stands? Let me know if you still disagree.
This commit adds the ability to create an inverted index on a tsvector column. Release note: None
4bf87c0
to
3537d92
Compare
I went ahead and added an index span containment suite with randomization like we do with JSON and Array inverted indexes. I think this is what you were thinking of - PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)
pkg/util/tsearch/tsquery.go
line 252 at r4 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
We discussed this offline -- I think based on that discussion my original comment stands? Let me know if you still disagree.
Yup, agreed, this is now corrected.
92608fe
to
22f6553
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r7, 12 of 14 files at r8, 2 of 2 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/inverted/expression.go
line 303 at r9 (raw file):
Tight bool // Unique is true if the spans are guaranteed not to produce duplicate
maybe this should be "the spans in FactoredUnionSpans"
pkg/sql/inverted/expression.go
line 311 at r9 (raw file):
// Note that the uniqueness property represented here holds for the *output* // of the invertedFilter operator that executes the And/Or, not the raw, input // spans of data.
Technically the invertedFilter always returns unique keys. How about something like this:
// Once a SpanExpression is built, this field is relevant if the root SpanExpression
// has no children (i.e., Operator is None). In this case, Unique is used to determine
// whether an invertedFilter is needed on top of the inverted index scan to deduplicate
// keys (an invertedFilter is always necessary if Operator is not None).
22f6553
to
49a4e7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/inverted/expression.go
line 303 at r9 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
maybe this should be "the spans in FactoredUnionSpans"
Done, I agree that helps matters.
pkg/sql/inverted/expression.go
line 311 at r9 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Technically the invertedFilter always returns unique keys. How about something like this:
// Once a SpanExpression is built, this field is relevant if the root SpanExpression // has no children (i.e., Operator is None). In this case, Unique is used to determine // whether an invertedFilter is needed on top of the inverted index scan to deduplicate // keys (an invertedFilter is always necessary if Operator is not None).
This sounds better, thanks!
This commit adds inverted index acceleration for expressions that evaluate a tsquery against a tsvector using the `@@` operator. Release note (sql change): it's now possible to run efficient tsvector @@ tsquery searches when there is an inverted index on the tsvector column being searched.
49a4e7d
to
5dd8ae2
Compare
Canceled. |
bors r+ |
Build succeeded: |
Updates #41288
This PR adds the ability to create an inverted index on a tsvector column, and adds inverted index acceleration for expressions that evaluate a tsquery against a tsvector using the
@@
operator.Epic: CRDB-22357