Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add inverted indexing for tsvector and acceleration for @@ queries #93769

Merged
merged 2 commits into from
Jan 21, 2023

Conversation

jordanlewis
Copy link
Member

Updates #41288

This PR adds the ability to create an inverted index on a tsvector column, and adds inverted index acceleration for expressions that evaluate a tsquery against a tsvector using the @@ operator.

Epic: CRDB-22357

@jordanlewis jordanlewis requested review from a team December 16, 2022 03:59
@jordanlewis jordanlewis requested a review from a team as a code owner December 16, 2022 03:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis jordanlewis force-pushed the tsvector-inverted branch 3 times, most recently from c9608d5 to 50871a3 Compare December 16, 2022 19:55
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Would it be possible to add some randomized tests similar to the ones we have for array / json / trigram?

Reviewed 8 of 8 files at r1, 11 of 11 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/sql/opt/invertedidx/tsearch.go line 1 at r2 (raw file):

// Copyright 2022 The Cockroach Authors.

Can you add a unit test for this similar to trigram_test.go?


pkg/sql/sem/builtins/builtins.go line 5928 at r2 (raw file):

			Types: tree.ParamTypes{
				{Name: "val", Typ: types.TSVector},
				{Name: "version", Typ: types.Int},

nit: this change should go with the first commit


pkg/util/tsearch/encoding.go line 470 at r1 (raw file):

		outKeys = append(outKeys, newKey)
	}
	return outKeys, nil

Can you guarantee that all terms are unique?


pkg/util/tsearch/tsquery.go line 184 at r2 (raw file):

		key := EncodeInvertedIndexKey(nil /* inKey */, n.term.lexeme)
		span := inverted.MakeSingleValSpan(key)
		return inverted.ExprForSpan(span, true), nil

nit: true /* tight */

Also, should we set Unique = true on this span expression?


pkg/util/tsearch/tsquery.go line 186 at r2 (raw file):

		return inverted.ExprForSpan(span, true), nil
	case followedby:
		fallthrough

should we do SetNotTight() in this case?


pkg/sql/opt/exec/execbuilder/testdata/tsvector_index line 1 at r2 (raw file):

# LogicTest: local

it would be good to add a similar file in logictests to check that the query output is as expected.

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, responded to all inline comments. I'll add some randomized encoded span containment tests as well, I forgot about those. The amount of places we need the same kinds of test cases for the inverted expressions is a little bit annoying (we'd like the same test cases in logic tests, execbuilder tests, eval tests, encoded span containment tests, and probably more), but I'm not really sure if it's possible to factor them out well.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)


pkg/sql/opt/invertedidx/tsearch.go line 1 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Can you add a unit test for this similar to trigram_test.go?

🤦 not sure why I didn't do this in the first place, thanks for the nudge. Added a suite.


pkg/util/tsearch/encoding.go line 470 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Can you guarantee that all terms are unique?

Yes, TSVector is guaranteed to be sorted and contain unique terms only - that's part of its contract. It's enforced during TSVector construction time. Added a comment to help motivate.


pkg/util/tsearch/tsquery.go line 184 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: true /* tight */

Also, should we set Unique = true on this span expression?

There were some difference in behavior here I overlooked when I made this change, until the changes in #94431. Added commentary and updated behavior (to deal with prefix matches and weight matches).


pkg/util/tsearch/tsquery.go line 186 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

should we do SetNotTight() in this case?

Yes, good point. Done. A few changes were necessary, PTAL.


pkg/sql/opt/exec/execbuilder/testdata/tsvector_index line 1 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

it would be good to add a similar file in logictests to check that the query output is as expected.

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of places we need the same kinds of test cases for the inverted expressions is a little bit annoying

I agree, although these things have also been notoriously full of bugs, so I think it's better to err on the side of over-testing rather than under-testing. But let me know if you have ideas...

Reviewed 14 of 14 files at r3, 10 of 10 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/util/tsearch/encoding.go line 463 at r4 (raw file):

func EncodeInvertedIndexKeys(inKey []byte, vector TSVector) ([][]byte, error) {
	outKeys := make([][]byte, 0, len(vector))
	// Note that by construction, TSVector contains only unique term, so we don't

nit: term -> terms


pkg/util/tsearch/tsquery.go line 192 at r4 (raw file):

		// 3. Weighted match.
		//    In this case, we make the match non-tight, because we don't store the
		//    weights of the lexemes in the index, and are forced to re-check the

nit: trailing "the"


pkg/util/tsearch/tsquery.go line 252 at r4 (raw file):

		// We never have a unique match when combining terms, since a single primary
		// key could be represented in both terms.
		expr.(*inverted.SpanExpression).Unique = false

I don't think this is needed. The comment for Unique says "it holds when unique SpanExpressions are combined with And". The code in expression.go should already be doing the right thing.

@jordanlewis jordanlewis requested a review from rytaft January 15, 2023 17:17
Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)


pkg/util/tsearch/tsquery.go line 252 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I don't think this is needed. The comment for Unique says "it holds when unique SpanExpressions are combined with And". The code in expression.go should already be doing the right thing.

It does seem to be required, unless I'm using the API wrong. The following test case fails without this line of code:

    tsearch_test.go:94: test case: {t @@ 'a & b' true true false}
    tsearch_test.go:122: For (t @@ 'a & b'), expected unique=false, but got true

Reading the code of And, it seems like Unique is populated like this:

		Unique: left.Unique && right.Unique,

Which seems wrong to me in this case? Or am I missing something and the test case is wrong?

The trigram inverted filter builder also does this - it assumes that Unique should be false if you're combining 2 spans with And. I guess there are fixes to make over there as well.

Is the idea that we can still assume they're Unique after And because the rows will be intersected during execution?

Then again, I've read the code that reads the Unique field and it appears like nobody ever uses the Unique property of an expression that isn't a leaf, so I think all of this is immaterial. Perhaps we could teach the optimizer to not plan an inverted filter if the operator is AND and the expression is Unique, at least in some cases? But that's not done today.

@jordanlewis
Copy link
Member Author

Nevermind, you still of course need the inverted filter - that's the thing that's doing the intersection. I still think there could be some duplicate work being performed in the inverted filterer - can't we avoid checking each span for containment within the inverted expression's span, since we know by construction that each returned key is in fact contained in the search spans? But I don't really understand that code very well so I could be wrong.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, you still of course need the inverted filter...

I think we discussed this offline, but let me know if you're still unsure about anything.

Reviewed 13 of 13 files at r5, 10 of 10 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/util/tsearch/tsquery.go line 252 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

It does seem to be required, unless I'm using the API wrong. The following test case fails without this line of code:

    tsearch_test.go:94: test case: {t @@ 'a & b' true true false}
    tsearch_test.go:122: For (t @@ 'a & b'), expected unique=false, but got true

Reading the code of And, it seems like Unique is populated like this:

		Unique: left.Unique && right.Unique,

Which seems wrong to me in this case? Or am I missing something and the test case is wrong?

The trigram inverted filter builder also does this - it assumes that Unique should be false if you're combining 2 spans with And. I guess there are fixes to make over there as well.

Is the idea that we can still assume they're Unique after And because the rows will be intersected during execution?

Then again, I've read the code that reads the Unique field and it appears like nobody ever uses the Unique property of an expression that isn't a leaf, so I think all of this is immaterial. Perhaps we could teach the optimizer to not plan an inverted filter if the operator is AND and the expression is Unique, at least in some cases? But that's not done today.

We discussed this offline -- I think based on that discussion my original comment stands? Let me know if you still disagree.

This commit adds the ability to create an inverted index on a tsvector
column.

Release note: None
@jordanlewis
Copy link
Member Author

I went ahead and added an index span containment suite with randomization like we do with JSON and Array inverted indexes. I think this is what you were thinking of - PTAL!

@jordanlewis jordanlewis requested a review from rytaft January 20, 2023 21:58
Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)


pkg/util/tsearch/tsquery.go line 252 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

We discussed this offline -- I think based on that discussion my original comment stands? Let me know if you still disagree.

Yup, agreed, this is now corrected.

@jordanlewis jordanlewis force-pushed the tsvector-inverted branch 2 times, most recently from 92608fe to 22f6553 Compare January 20, 2023 22:39
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks!

Reviewed 15 of 15 files at r7, 12 of 14 files at r8, 2 of 2 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/sql/inverted/expression.go line 303 at r9 (raw file):

	Tight bool

	// Unique is true if the spans are guaranteed not to produce duplicate

maybe this should be "the spans in FactoredUnionSpans"


pkg/sql/inverted/expression.go line 311 at r9 (raw file):

	// Note that the uniqueness property represented here holds for the *output*
	// of the invertedFilter operator that executes the And/Or, not the raw, input
	// spans of data.

Technically the invertedFilter always returns unique keys. How about something like this:

// Once a SpanExpression is built, this field is relevant if the root SpanExpression
// has no children (i.e., Operator is None). In this case, Unique is used to determine
// whether an invertedFilter is needed on top of the inverted index scan to deduplicate
// keys (an invertedFilter is always necessary if Operator is not None).

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)


pkg/sql/inverted/expression.go line 303 at r9 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

maybe this should be "the spans in FactoredUnionSpans"

Done, I agree that helps matters.


pkg/sql/inverted/expression.go line 311 at r9 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Technically the invertedFilter always returns unique keys. How about something like this:

// Once a SpanExpression is built, this field is relevant if the root SpanExpression
// has no children (i.e., Operator is None). In this case, Unique is used to determine
// whether an invertedFilter is needed on top of the inverted index scan to deduplicate
// keys (an invertedFilter is always necessary if Operator is not None).

This sounds better, thanks!

This commit adds inverted index acceleration for expressions that
evaluate a tsquery against a tsvector using the `@@` operator.

Release note (sql change): it's now possible to run efficient tsvector
@@ tsquery searches when there is an inverted index on the tsvector
column being searched.
@craig
Copy link
Contributor

craig bot commented Jan 21, 2023

Canceled.

@jordanlewis
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 21, 2023

Build succeeded:

@craig craig bot merged commit b21379b into cockroachdb:master Jan 21, 2023
@jordanlewis jordanlewis deleted the tsvector-inverted branch January 24, 2023 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants