Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: implement trigram inverted indexes and index acceleration for % and LIKE #79705

Merged
merged 4 commits into from
Jun 7, 2022

Conversation

jordanlewis
Copy link
Member

@jordanlewis jordanlewis commented Apr 9, 2022

Updates #41285

See commits for details. This PR implements inverted indexing on string columns using trigrams and searching trigram indexes using the =, LIKE, ILIKE and % (text similarity) operators.

Inverted indexes on text columns are built with the gin_trgm_ops or gist_trgm_ops opclasses, with CREATE INDEX ON t USING GIN(col gin_trgm_ops), CREATE INVERTED INDEX ON t(col gin_tgrm_ops), or CREATE INDEX ON t USING GIST(col gist_trgm_ops). The gin and gist opclasses are currently implemented identically.

@jordanlewis jordanlewis requested review from a team April 9, 2022 01:15
@jordanlewis jordanlewis requested a review from a team as a code owner April 9, 2022 01:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis jordanlewis marked this pull request as draft April 9, 2022 01:15
@jordanlewis
Copy link
Member Author

For an example, here's a log from the CLI:

[email protected]:26257/defaultdb> create table a(a primary key, t) as select 1, 'afooa';
CREATE TABLE AS
[email protected]:26257/defaultdb> create inverted index on a(t);
CREATE INDEX
[email protected]:26257/defaultdb> explain select * from a where t ilike '%foo%zoo%' OR t ilike '%foo%bar%';
                            info
------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ filter: (t ILIKE '%foo%zoo%') OR (t ILIKE '%foo%bar%')
  │
  └── • index join
      │ table: a@a_pkey
      │
      └── • inverted filter
          │ inverted column: t_inverted_key
          │ num spans: 3
          │
          └── • scan
                missing stats
                table: a@a_t_idx
                spans: 3 spans
(17 rows)
[email protected]:26257/defaultdb> explain select * from a where t ilike '%foo%';
             info
------------------------------
  distribution: local
  vectorized: true

  • filter
  │ filter: t ILIKE '%foo%'
  │
  └── • index join
      │ table: a@a_pkey
      │
      └── • scan
            missing stats
            table: a@a_t_idx
            spans: 1 span
(13 rows)
[email protected]:26257/defaultdb> select * from a where t ilike '%foo%';
  a |   t
----+--------
  1 | afooa
(1 row)
[email protected]:26257/defaultdb>

@jordanlewis jordanlewis force-pushed the trigram branch 4 times, most recently from e5300b1 to 3963a96 Compare April 15, 2022 23:21
@jordanlewis jordanlewis force-pushed the trigram branch 3 times, most recently from 18dd990 to 0c70018 Compare April 16, 2022 21:57
@otan otan removed the request for review from a team April 25, 2022 23:41
@jordanlewis jordanlewis force-pushed the trigram branch 2 times, most recently from 8c17570 to 0d1bcd9 Compare May 6, 2022 20:31
@jordanlewis
Copy link
Member Author

jordanlewis commented May 6, 2022

I added opclass parsing and persistence. I think this is RFAL!

@jordanlewis jordanlewis marked this pull request as ready for review May 6, 2022 23:02
@jordanlewis jordanlewis requested review from a team as code owners May 6, 2022 23:02
@jordanlewis jordanlewis changed the title [wip] sql: implement trigram builtins and inverted indexes sql: implement trigram builtins and inverted indexes May 6, 2022
@jordanlewis jordanlewis force-pushed the trigram branch 2 times, most recently from 80adc59 to 2076c41 Compare May 7, 2022 04:40
@awoods187
Copy link
Contributor

This is super cool! What an unexpected win!

@jordanlewis
Copy link
Member Author

I've split off the builtin functions, operators, session settings, etc into #81418.

@jordanlewis jordanlewis changed the title sql: implement trigram builtins and inverted indexes sql: implement trigram inverted indexes and index acceleration for % and LIKE May 18, 2022
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Exciting stuff!

Reviewed 47 of 47 files at r10, 9 of 9 files at r11, 23 of 23 files at r12, 15 of 15 files at r13, 13 of 13 files at r14, 4 of 4 files at r15, 17 of 17 files at r16, 6 of 6 files at r17, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/sql/opt/invertedidx/trigram.go line 32 at r14 (raw file):

}

func (t *trigramFilterPlanner) extractInvertedFilterConditionFromLeaf(

nit: add comment:

// extractInvertedFilterConditionFromLeaf is part of the invertedFilterPlanner
// interface.

pkg/sql/opt/invertedidx/trigram.go line 81 at r14 (raw file):

}

var _ invertedFilterPlanner = &trigramFilterPlanner{}

nit: move this just below the type definition


pkg/sql/opt/invertedidx/trigram.go line 47 at r15 (raw file):

	case *memo.ILikeExpr:
		left, right = e.Left, e.Right
		// If we're doing a LIKE (or ILIKE) expression, we need to construct an AND out of all

nit: make comment <= 80 columns wide


pkg/sql/opt/invertedidx/trigram.go line 32 at r17 (raw file):

}

func (t *trigramFilterPlanner) extractInvertedFilterConditionFromLeaf(

This could use a unit test (similar to TestTryFilterJsonOrArrayIndex)


pkg/sql/parser/sql.y line 8710 at r10 (raw file):

    nullsOrder := $3.nullsOrder()
    if opClass != "" {
      if opClass == "gin_trgm_ops" || opClass == "gist_trgm_ops" {

you only added support for "gin_trgm_ops" above -- do you also want to add support for the gist version?


pkg/sql/rowenc/index_encoding.go line 896 at r14 (raw file):

// trigrams present in the input string.
func EncodeLikeTrigramSpans(val *tree.DString) (inverted.Expression, error) {
	chunks := strings.Split(string(*val), "%")

nit: add a comment to explain why you split on %


pkg/sql/rowenc/index_encoding.go line 898 at r14 (raw file):

	chunks := strings.Split(string(*val), "%")
	// Each chunk will have at minimum 2 trigrams, so start at that default size.
	keys := make([][]byte, 0, len(chunks)*2)

should this be len(chunks)+2?


pkg/sql/rowenc/index_encoding.go line 897 at r17 (raw file):

// expression must match every trigram in the input. Otherwise, it will match
// any trigram in the input.
func EncodeTrigramSpans(val *tree.DString, allMustMatch bool) (inverted.Expression, error) {

This could use a unit test (similar to TestEncodeContainingArrayInvertedIndexSpans, TestEncodeOverlapsArrayInvertedIndexSpans or TestEncodeContainedArrayInvertedIndexSpans).


pkg/sql/sem/tree/create.go line 141 at r16 (raw file):

	//}
	//	return unimplementedWithIssue(sqllex, 47420)
	//}

Looks like dead code


pkg/sql/logictest/testdata/logic_test/trigram_indexes line 17 at r16 (raw file):


statement ok
CREATE INDEX ON a USING GIN(t gin_trgm_ops)

do you want to add a test for USING GIST and gist_trgm_ops too?

@jordanlewis
Copy link
Member Author

TFTR!

@ajwerner @rafiss, I'd like your review on commits 2 and 3, which add opclass parsing and persistence and a version gate. Thanks!

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 13 files at r14, 2 of 46 files at r18, 2 of 25 files at r22, 23 of 23 files at r26, 17 of 17 files at r27, 4 of 6 files at r28.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @jordanlewis, @rafiss, and @rytaft)


pkg/sql/create_index.go line 351 at r27 (raw file):

		}
		indexDesc.InvertedColumnKinds[0] = descpb.IndexDescriptor_TRIGRAM
	}

is this going to just accept an opclass for a column type which is, say, INT8?

For some defense in depth, can you return an assertion failure if this gets called with a type that is not inverted indexable?

More generally, I don't see much if any protection against opclasses showing up in the grammar and then being summarily ignored. Maybe I missed it. Can you add some logic tests which make sure things like CREATE INDEX idx ON (int_col gin_trgm_ops) fail. F

Code quote:

	}
	return nil

pkg/sql/catalog/descpb/structured.proto line 349 at r27 (raw file):

  // wouldn't be possible to distinguish a text inverted index that uses
  // trigrams, vs a text inverted index that uses stemming.
  enum InvertedIndexColumnKind {

can you define this in catpb? We're going to need it in the declarative schema changer which sits below descpb.

Code quote:

  enum InvertedIndexColumnKind {
    // DEFAULT is the default kind of inverted index column. JSON, Array, and
    // geo inverted indexes all are DEFAULT, though prior to 22.2 they had no
    // kind at all.
    DEFAULT = 0;
    // TRIGRAM is the trigram kind of inverted index column. It's only valid on
    // text columns.
    TRIGRAM = 1;
  }

pkg/sql/rowenc/index_encoding_test.go line 963 at r26 (raw file):

	runTest := func(indexedValue, value string, searchType trigramSearchType,
		expectContainsKeys, expected, expectUnique bool) {
		t.Logf("test case: %s %s %v %t %t %t", indexedValue, value, searchType, expectContainsKeys, expected, expectUnique)

nit: is this better than a subtest?


pkg/sql/rowenc/index_encoding_test.go line 1098 at r26 (raw file):

		{[]string{}},
	}
	for _, tc := range tcs {

nit: subtests

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 23 of 23 files at r26, 17 of 17 files at r27, 6 of 6 files at r28, 1 of 1 files at r29, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @rafiss)

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 13 files at r14, 2 of 46 files at r18, 2 of 25 files at r22, 23 of 23 files at r26, 17 of 17 files at r27, 6 of 6 files at r28, 1 of 1 files at r29, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @rafiss)


pkg/sql/opt/invertedidx/trigram.go line 97 at r26 (raw file):

		// For a LIKE or ILIKE expr, we split the string by the % separator
		// character which is used in LIKE expressions to separate matches.
		chunks = strings.Split(s, "%")

This seems too naïve to handle escaped %s:

marcus=# select 'hello' LIKE '%ell%';
 ?column?
----------
 t
(1 row)

marcus=# select 'hello' LIKE '\%ell%';
 ?column?
----------
 f
(1 row)

We've run had previous bugs with escaped strings in LIKE expressions: #68289

@mgartner
Copy link
Collaborator

mgartner commented Jun 1, 2022

Very cool!!!

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @mgartner, @rafiss, and @rytaft)


pkg/sql/create_index.go line 351 at r27 (raw file):

Previously, ajwerner wrote…

is this going to just accept an opclass for a column type which is, say, INT8?

For some defense in depth, can you return an assertion failure if this gets called with a type that is not inverted indexable?

More generally, I don't see much if any protection against opclasses showing up in the grammar and then being summarily ignored. Maybe I missed it. Can you add some logic tests which make sure things like CREATE INDEX idx ON (int_col gin_trgm_ops) fail. F

Thanks for the catch. That's a good point, I was missing this. Done.


pkg/sql/opt/invertedidx/trigram.go line 97 at r26 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

This seems too naïve to handle escaped %s:

marcus=# select 'hello' LIKE '%ell%';
 ?column?
----------
 t
(1 row)

marcus=# select 'hello' LIKE '\%ell%';
 ?column?
----------
 f
(1 row)

We've run had previous bugs with escaped strings in LIKE expressions: #68289

It's actually alright, but for a silly reason that I didn't think of either - the trigrams of a string do not include non-alpha characters. So the output of this method will be the same regardless of whether we split the string at all or not.

This makes all of this chunk stuff totally unnecessary, so I've removed it, which simplifies things significantly. Thanks for asking about this!

The other thing is that we always need to re-check the results of trigram searches when searching with more than 1 trigram because the index doesn't contain information about whether two trigrams are adjacent. So that would "save" us here as well, though I suppose not in the case where we include literal % characters in trigrams that we store or search by, but we don't do that anyway.


pkg/sql/rowenc/index_encoding_test.go line 963 at r26 (raw file):

Previously, ajwerner wrote…

nit: is this better than a subtest?

I was just following the convention in this file, so I'll keep it as is.


pkg/sql/rowenc/index_encoding_test.go line 1098 at r26 (raw file):

Previously, ajwerner wrote…

nit: subtests

Simplified this, and there are so few that I don't think it matters too much.

@jordanlewis jordanlewis force-pushed the trigram branch 2 times, most recently from 027299d to 0bc7511 Compare June 4, 2022 04:16
@jordanlewis
Copy link
Member Author

RFAL @ajwerner.

@mgartner, let me know if the provided explanation for the LIKE splitting business needs work, or if I'm missing something.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 25 of 25 files at r30, 16 of 19 files at r31, 9 of 9 files at r34, 6 of 6 files at r35, 1 of 1 files at r36, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @jordanlewis, and @rafiss)


pkg/sql/opt/invertedidx/trigram.go line 97 at r26 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

It's actually alright, but for a silly reason that I didn't think of either - the trigrams of a string do not include non-alpha characters. So the output of this method will be the same regardless of whether we split the string at all or not.

This makes all of this chunk stuff totally unnecessary, so I've removed it, which simplifies things significantly. Thanks for asking about this!

The other thing is that we always need to re-check the results of trigram searches when searching with more than 1 trigram because the index doesn't contain information about whether two trigrams are adjacent. So that would "save" us here as well, though I suppose not in the case where we include literal % characters in trigrams that we store or search by, but we don't do that anyway.

Interesting. Makes sense! 👍


pkg/sql/rowenc/index_encoding_test.go line 1085 at r30 (raw file):

		"fo",
		"a",
		"",

What if there are 3 characters but not all are alpha characters, like fo%? Do we expect that to succeed or err?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 6 files at r35.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @rafiss)

This commit implements trigram inverted indexes that are searchable
using the =, LIKE, ILIKE and % (similarity) operators. This is a partial
implementation of the gin_trgm_ops opclass from Postgres. gin_trgm_ops
also supports regex matches in Postgres.

Release note (sql change): string columns now support inverted indexes
using trigrams. These indexes can be searched using the =, LIKE, ILIKE
and % (similarity) predicates.
Previously, it was not permitted to use opclass syntax for index
creation. For example, the `blah_ops` in `CREATE INDEX ON t USING GIN
(col blah_ops))`.

Now, this syntax is legal when the opclass is supported for a given
type.

Release note (sql change): permit usage of jsonb_ops, array_ops,
gin_trgm_ops, and gist_trgm_ops as an opclass in inverted index
creation.
Make sure that it's not possible to create an inverted index until the
cluster is at a new enough version.

Release note: None
Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @ajwerner, @mgartner, and @rafiss)


pkg/sql/rowenc/index_encoding_test.go line 1085 at r30 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

What if there are 3 characters but not all are alpha characters, like fo%'? Do we expect that to succeed or err?

Error. Added a test.

@jordanlewis
Copy link
Member Author

Thanks all for the reviews!

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 7, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants