Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy match support #2916

Merged
merged 54 commits into from
Feb 12, 2019
Merged

Conversation

srfrog
Copy link
Contributor

@srfrog srfrog commented Jan 19, 2019

This PR adds support for fuzzy matching to root queries and filters. It requires a trigram index on the predicate.

  • A new function match() that uses trigram index to make fuzzy comparisons.
  • Fuzzy terms are characters compared in order.

Examples:

// schema:
title: string @index(trigram) .

// max Levenshtein distance is 2
{
  q(func:match(title, "film", 2)) {
    title
  }
}

No breaking changes are expected.

Closes #1883


This change is Reviewable

@srfrog srfrog requested a review from manishrjain January 22, 2019 01:01
Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 7 unresolved discussions (waiting on @martinmr)


systest/queries_test.go, line 579 at r2 (raw file):

Previously, srfrog (Gus) wrote…

Move this down? Ok

Done.


tok/tokens.go, line 50 at r2 (raw file):

Previously, srfrog (Gus) wrote…

Because the caller carries the cost of the check, and typically we send a slice of arguments. But there's room for an optimization here.

I removed this func and added a general GetTokens func that uses tokenizer ID, and the args are variadic so don't need to send a slice. In another change I will refactor the code to use it as needed.


worker/tokens.go, line 76 at r2 (raw file):

Previously, srfrog (Gus) wrote…

actually I'll get rid of the switch and use if. thanks.

Done.

Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 1 unresolved discussion (waiting on @srfrog)


systest/queries_test.go, line 579 at r2 (raw file):

Previously, srfrog (Gus) wrote…

Done.

Sorry, I meant having the code like

	tests := []struct {in, out, failure string} {

cause otherwise there's a line with just }{

Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 1 unresolved discussion (waiting on @martinmr)


systest/queries_test.go, line 579 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

Sorry, I meant having the code like

	tests := []struct {in, out, failure string} {

cause otherwise there's a line with just }{

ah i see. i leave like that for clarity and consistency with the other tests.

@srfrog srfrog requested a review from manishrjain January 31, 2019 18:54
manishrjain and others added 4 commits February 5, 2019 15:54
…actor, which would cause early termination of the algo to save CPU resources. Remove the fuzzysearch lib.
This change allows setting a second integer argument in match() to set the
max Levenshtein distance threshold. If no value is set, the default value
of 8 is used.
Updated test for new matchFuzzy using threshold.
@srfrog srfrog requested review from a team and removed request for manishrjain February 7, 2019 03:54
Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more changes required.

Reviewed 3 of 8 files at r2, 3 of 6 files at r3, 1 of 6 files at r4, 3 of 3 files at r5, 1 of 1 files at r6.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @martinmr and @srfrog)


vendor/vendor.json, line 655 at r6 (raw file):

		{
			"checksumSHA1": "9fTIdD63nJT3Y4QvHtw9dCBhzzE=",
			"path": "github.com/lithammer/fuzzysearch/fuzzy",

Needs to be removed.


worker/task.go, line 1195 at r6 (raw file):

			// convert data from binary to appropriate format
			strVal, err := types.Convert(val, types.StringID)
			if err == nil && matchFuzzy(matchQuery, strVal.Value.(string), int(arg.srcFn.threshold)) {

Convert the threshold to int once, and just use for all the uids.


worker/task.go, line 1507 at r6 (raw file):

		l := len(q.SrcFunc.Args)
		if l == 0 || l > 2 {
			return nil, x.Errorf("Function '%s' requires at most 2 arguments, but got %d (%v)",

at most? Isn't it always 2 args?


worker/task.go, line 1516 at r6 (raw file):

		fc.intersectDest = needsIntersect(f)
		// Max Levenshtein distance
		fc.threshold = 8

Don't add artificial limits. Remove this.

Note that the matching is only done on lists which have been returned by the trigram index. So, we're not matching against the universal data set. Additionally, the cost of reading these lists from the disk is a lot more than running a limited computation in memory.

In the worst case, we're returning all the uids that were returned by the trigram index, which is a manageable list.


worker/task.go, line 1524 at r6 (raw file):

				return nil, x.Errorf("Levenshtein distance value must be an int, got %v", s)
			}
			if max > 0 && max < 8 {

if less than zero, return an error.

Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @manishrjain and @martinmr)


vendor/vendor.json, line 655 at r6 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Needs to be removed.

Done.


worker/task.go, line 1195 at r6 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Convert the threshold to int once, and just use for all the uids.

Done.


worker/task.go, line 1507 at r6 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

at most? Isn't it always 2 args?

No, I made the distance optional. But I will change it to expect 2 args.


worker/task.go, line 1516 at r6 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Don't add artificial limits. Remove this.

Note that the matching is only done on lists which have been returned by the trigram index. So, we're not matching against the universal data set. Additionally, the cost of reading these lists from the disk is a lot more than running a limited computation in memory.

In the worst case, we're returning all the uids that were returned by the trigram index, which is a manageable list.

Done.


worker/task.go, line 1524 at r6 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

if less than zero, return an error.

Done.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice work!

Reviewed 2 of 2 files at r7.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @martinmr)

@srfrog srfrog merged commit 4d6992f into master Feb 12, 2019
@srfrog srfrog deleted the srfrog/issue-1883_fuzzy_match_support_in_dgraph branch February 12, 2019 01:48
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
* added "match" to list of valid funcs

* added "match" to list of valid func names

* added handleMatchFunction and MatchFn type

* the match func needs a fulltext tokenizer

* added ngram bleve analizer

* added ngram tokenizer

* verify match uses ngram tokenizer

* get string tokens for match

* added func to build match tokens

* changed handleMatchFunction to use index

* cherry-pick schema.HasTokenizer

* configure bigram filter before the analyzer to update cache

* added EncodeTokens convenience func to encapsulate encodeToken

* we dont need to pre-get tokens, we do that when the task is running

* handleMatchFunction updated for fuzzy match, filter optimizations, and code cleanups

* matchFuzzy func using index and bigram (ngram) index

* adding Bleve ngram support

* added func comments

* all fuzzy tokens must match

* cp: test cases might not be mutex in the future, revert the order.

* cp: dont try to match all posting values against match

* cp: added comment

* removed extra branch

* switch to trigram for fuzzy match indexing

* fixed typo

* renamed ngram to match

* fixed typo

* added full posting value to search terms, minor cleanups

* added fuzzy matching test

* remove Bleve ngram pkg

* revert this change

* fixed needsIntersect

* switched to a new fuzzy matching pkg

* tweaked test with misspelling

* added func GetTokenizerByID to search registered tokenizers by id

* added tok.GetTokens for generating tokens by id

* changed to use tok.GetTokens

* fixed grammar in comment

* replaced underused switch with if block

* small test change

* Pick up and modify Levenshtein distance to introduce a max distance factor, which would cause early termination of the algo to save CPU resources. Remove the fuzzysearch lib.

* using threshold for lev distance max

* worker/task.go: added match() argument for specifying max distance.

This change allows setting a second integer argument in match() to set the
max Levenshtein distance threshold. If no value is set, the default value
of 8 is used.

* systest/queries_test.go: updated match query tests

Updated test for new matchFuzzy using threshold.

* wiki/content/query-language/index.md: added section for match function

* vendor/vendor.json: removed old fuzzy pkg

* worker/task.go: match func enforce 2 args, max distance must be gt zero

* wiki/content/query-language/index.md: updated syntax, example and fixed typos

* systest/queries_test.go: updated syntax in tests

* wiki/content/query-language/index.md: minior doc fixes
@lukaszlenart
Copy link
Contributor

Is there a chance to cherry-pick support for the match function into Dgraph 1.0.x branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

fuzzy match support in dgraph
4 participants