Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworking search: tokenization, handling of quoted literal search, and postgres fuzziness #2351

Merged
merged 36 commits into from
May 28, 2023

Conversation

jecorn
Copy link
Contributor

@jecorn jecorn commented Apr 18, 2023

What type of PR is this?

  • feature
  • bug (one tiny one)

What this PR does / why we need it:

The current search is based on exact phrases entered into the search box. This behaviour is similar to quoting a search string in a search engine (though diacritics are removed and matches are case insensitive). For example, searching for beans pinto does not find pinto beans, egg over easy does not find and eggs over easy, etc.. This is convenient to code, but leads to some unexpected behaviour for users where "easy" search hits are missed.

This PR adds four improvements to search:

  1. Search is by default now tokenized based on whitespace, so that search strings are separated and hits can be found from matches independent of intervening words and word order. For example, pinto beet burgers matches both pinto burgers and beet pinto burgers
  2. Special characters except quotes (' and ") are removed from the search string and replaced by whitespace, which is then used for tokenization. So slow-cooker carrots is tokenized for search to slow and cooker and carrots
  3. Tokenized search (including special character removal) can be overridden using quotes (either ' or "). Quoted searches are searched for literally and can be mixed and matched with non-quoted searches. For example, "slow-cooker" beans casserole 'with paprika' is tokenized to a search for slow-cooker, beans, casserole and with paprika.
  4. On a postgres backend, fuzzy searching is now default using word similarity trigrams with GIN indexing of the recipe.name_normalized, recipe.description_normalized, recipe_ingredient.note_normalized and recipe_ingredient.original_text_normalized columns. Trigram searching avoids the need to define a language for the database, as would otherwise be needed for full-text search stemming and stop word removal. Fuzziness can therefore work on mixed-language databases. The fuzziness is calibrated to try to return hits if a search term has 1 or two mismatches (depending on word length) while keeping false positives low. Fuzzy search orders hits by trigram similarity to the recipe name. Fuzzy search is incompatible with quoted literal search, so adding quoted strings to search automatically falls back to token search.

Which issue(s) this PR fixes:

(fixes an un-numbered bug in RecipeIngredientModel.note_normalized triggers)
Fixes #2325
Addresses #2335

Special notes for your reviewer:

I tried implementing both trigram and full text searching in postgres. Needing to define a language for the database for full-text search indexing represented a problem because text search depends on defining an index per language in each column. What if a database contains words in multiple languages? This is often the case for international recipes, and we don't know the languages used inside a recipe ahead of time or even at recipe entry/import time.

The main benefit of full-text search over trigrams is performance. But for a recipe database, we are unlikely to run into the a million-row situation that is a problem for trigrams. I found that trigrams with GIN indexing are performant on even a 6,000 real-recipe database, and they have the huge benefit of being language-independent.

(The number of commits in this PR reflects my development style, which spans multiple machines in multiple physical locations, not the complexity of the final code)

Testing

I created/changed several recipe search tests in test_recipe_repository.py:

  1. Refactored the test strings for recipes. They are no longer randomly generated each test time. If they are randomly generated, there is a small chance that two supposedly orthogonal test strings will be close enough to be cross-matched by fuzzy search and false-fail the test. So now they are based on foods (though somewhat silly ones)
  2. Test literal search: quoted strings bypass tokenised search and only match exact phrases in the search text
  3. Test special character removal from non-literal searches: non-quoted strings have special characters removed and are properly tokenised between those special characters
  4. Test token separation: non-quoted strings are separated into tokens based on whitespace and good search results are returned independent of token order
  5. Test fuzzy search: postgres only, typos and small word differences (e.g. plural or singular) return good search results with title-match prioritized

PR passes all automated tests run from scratch (both previous and added above) using both an sqlite and postgres backend.

In addition to the automated tests, I tested tokenized, quoted literal, and fuzzy search on both sqlite and postgres backend using a 6,000 row database of real recipes. Search is fast in all contexts. Fuzzy search returns false positives (as expected, e.g. mean is a false positive for bean), but also rescues searches that would otherwise fail (e.g. bean is a true positive for beans). Ordering by trigram similarity to the recipe name brings the most relevant hits to the top of the list.

Release Notes

Search is now by default independent of word order and matches despite intervening words. @jecorn 
Searches can be quoted to find literal matches & quoted search can be mixed with non-quoted search. @jecorn 
On postgres only, search is now fuzzy by default. This can be overridden by using quoted search. @jecorn 

@michael-genson
Copy link
Collaborator

The backend tests on the GitHub workflow run a few extra tests, namely ruff. Looks like some linting tests are failing.
You can run make backend-all, which runs all the linting tests too (or just run make backend-lint for ruff).

@jecorn
Copy link
Contributor Author

jecorn commented Apr 19, 2023

Thanks! Saves a lot of time being able to run the extra tests locally. Stay tuned.

@jecorn
Copy link
Contributor Author

jecorn commented Apr 19, 2023

While running the backend-all tests, I started now seeing that the test_get_scheduled_webhooks_filter_query test fails. It's not finding anything at all (result list is empty). But this test also fails on a clean checkout of the mealie-next head with all of my changes removed. So I'm inclined to ignore it and move on, since the failure seems unrelated to my diffs.

Should be ready for the GitHub auto-checks.

@jecorn jecorn marked this pull request as ready for review April 19, 2023 11:24
@jecorn jecorn marked this pull request as draft April 19, 2023 11:34
@jecorn jecorn marked this pull request as ready for review April 19, 2023 11:36
@jecorn jecorn marked this pull request as draft April 19, 2023 11:38
@michael-genson
Copy link
Collaborator

While running the backend-all tests, I started now seeing that the test_get_scheduled_webhooks_filter_query test fails

Looks like it worked here ¯\_(ツ)_/¯ Might just be flaky

If this PR is ready, feel free to mark it ready for review. It might be a bit before hay-kot is able to get to it, I know he's got a lot going on

@jecorn jecorn marked this pull request as ready for review April 19, 2023 21:08
@jecorn
Copy link
Contributor Author

jecorn commented Apr 19, 2023

Thanks, will do!

By the way, while doing all of the postgres development, I noticed that the tests are pretty stateful, in that they don't clean up after themselves. Which sometimes leads to weird collisions between tests if they are run without intermediate manual cleaning (e.g. multiple duplicate entries in the database, migration data hanging around that can cross-match between tests, etc). The make backend-clean works great for sqlite, but postgres is a pain. I was thinking of writing some kind of nuking helper function that nukes all data that could be run after each test that leaves data behind. But then again, the diversity of the tests means there would probably need to be many different nuking schemes. Painful.

@jecorn
Copy link
Contributor Author

jecorn commented Apr 30, 2023

Is it bad form to add my user name to the release notes? I saw a note about doing that somewhere in the docs, and noticed it in previous release notes, which is why I did so. But also just saw that recent pull requests do not have a user name in the release notes.

@jecorn
Copy link
Contributor Author

jecorn commented May 13, 2023

Hi @hay-kot. This search PR hasn’t changed in 3 weeks, passes all tests, and improves both SQLite and Postgres search. Is there more info you need from me? Or just let me know if life stuff is too hectic and I’ll give it a rest.

Copy link
Collaborator

@hay-kot hay-kot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good. There's a few small things. Looks like the bulk of changes only change Postgres installs (apart from the bug fix)

Could you also add a note in the documentation so this feature is easily discoverable? Maybe an entry in the FAQ and some mention of it in the getting started guide, mentioning enhanced search with Postgres installation.

mealie/db/models/recipe/ingredient.py Outdated Show resolved Hide resolved
mealie/repos/repository_recipes.py Show resolved Hide resolved
mealie/repos/repository_recipes.py Outdated Show resolved Hide resolved
@jecorn
Copy link
Contributor Author

jecorn commented May 14, 2023

Thanks for taking a look! For clarity, tokenized search applies to sqlite (trigram search is already basically tokenized). Quoted search applies to both postgres and sqlite. And quoting parts of a postgres backend search falls back to tokenized search (you can't mix trigrams with exact substring matches).

For documentation, I added something about fuzzy search in the FAQ. And Postgres-specificity is mentioned in the Features section and Postgres container part of my docs commit (already pulled)

@jecorn
Copy link
Contributor Author

jecorn commented May 16, 2023

@hay-kot Not to confuse things with multiple PRs from me, but I think this search PR is ready to go again whenever you had a moment to take a look.

mealie/repos/repository_recipes.py Show resolved Hide resolved
@hay-kot hay-kot merged commit 7e0d29a into mealie-recipes:mealie-next May 28, 2023
@jecorn jecorn deleted the postgres-fuzz branch May 29, 2023 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v1.0.0 nightly] - Search misses "easy" hits when there are intervening words or the order is changed
3 participants