-
-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworking search: tokenization, handling of quoted literal search, and postgres fuzziness #2351
Conversation
The backend tests on the GitHub workflow run a few extra tests, namely ruff. Looks like some linting tests are failing. |
Thanks! Saves a lot of time being able to run the extra tests locally. Stay tuned. |
While running the backend-all tests, I started now seeing that the test_get_scheduled_webhooks_filter_query test fails. It's not finding anything at all ( Should be ready for the GitHub auto-checks. |
Looks like it worked here ¯\_(ツ)_/¯ Might just be flaky If this PR is ready, feel free to mark it ready for review. It might be a bit before hay-kot is able to get to it, I know he's got a lot going on |
Thanks, will do! By the way, while doing all of the postgres development, I noticed that the tests are pretty stateful, in that they don't clean up after themselves. Which sometimes leads to weird collisions between tests if they are run without intermediate manual cleaning (e.g. multiple duplicate entries in the database, migration data hanging around that can cross-match between tests, etc). The |
Is it bad form to add my user name to the release notes? I saw a note about doing that somewhere in the docs, and noticed it in previous release notes, which is why I did so. But also just saw that recent pull requests do not have a user name in the release notes. |
Hi @hay-kot. This search PR hasn’t changed in 3 weeks, passes all tests, and improves both SQLite and Postgres search. Is there more info you need from me? Or just let me know if life stuff is too hectic and I’ll give it a rest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good. There's a few small things. Looks like the bulk of changes only change Postgres installs (apart from the bug fix)
Could you also add a note in the documentation so this feature is easily discoverable? Maybe an entry in the FAQ and some mention of it in the getting started guide, mentioning enhanced search with Postgres installation.
Thanks for taking a look! For clarity, tokenized search applies to sqlite (trigram search is already basically tokenized). Quoted search applies to both postgres and sqlite. And quoting parts of a postgres backend search falls back to tokenized search (you can't mix trigrams with exact substring matches). For documentation, I added something about fuzzy search in the FAQ. And Postgres-specificity is mentioned in the Features section and Postgres container part of my docs commit (already pulled) |
@hay-kot Not to confuse things with multiple PRs from me, but I think this search PR is ready to go again whenever you had a moment to take a look. |
What type of PR is this?
What this PR does / why we need it:
The current search is based on exact phrases entered into the search box. This behaviour is similar to quoting a search string in a search engine (though diacritics are removed and matches are case insensitive). For example, searching for
beans pinto
does not findpinto beans
,egg over easy
does not find andeggs over easy
, etc.. This is convenient to code, but leads to some unexpected behaviour for users where "easy" search hits are missed.This PR adds four improvements to search:
pinto beet burgers
matches bothpinto burgers
andbeet pinto burgers
slow-cooker carrots
is tokenized for search toslow
andcooker
andcarrots
"slow-cooker" beans casserole 'with paprika'
is tokenized to a search forslow-cooker
,beans
,casserole
andwith paprika
.recipe.name_normalized
,recipe.description_normalized
,recipe_ingredient.note_normalized
andrecipe_ingredient.original_text_normalized
columns. Trigram searching avoids the need to define a language for the database, as would otherwise be needed for full-text search stemming and stop word removal. Fuzziness can therefore work on mixed-language databases. The fuzziness is calibrated to try to return hits if a search term has 1 or two mismatches (depending on word length) while keeping false positives low. Fuzzy search orders hits by trigram similarity to the recipe name. Fuzzy search is incompatible with quoted literal search, so adding quoted strings to search automatically falls back to token search.Which issue(s) this PR fixes:
(fixes an un-numbered bug in RecipeIngredientModel.note_normalized triggers)
Fixes #2325
Addresses #2335
Special notes for your reviewer:
I tried implementing both trigram and full text searching in postgres. Needing to define a language for the database for full-text search indexing represented a problem because text search depends on defining an index per language in each column. What if a database contains words in multiple languages? This is often the case for international recipes, and we don't know the languages used inside a recipe ahead of time or even at recipe entry/import time.
The main benefit of full-text search over trigrams is performance. But for a recipe database, we are unlikely to run into the a million-row situation that is a problem for trigrams. I found that trigrams with GIN indexing are performant on even a 6,000 real-recipe database, and they have the huge benefit of being language-independent.
(The number of commits in this PR reflects my development style, which spans multiple machines in multiple physical locations, not the complexity of the final code)
Testing
I created/changed several recipe search tests in
test_recipe_repository.py
:PR passes all automated tests run from scratch (both previous and added above) using both an sqlite and postgres backend.
In addition to the automated tests, I tested tokenized, quoted literal, and fuzzy search on both sqlite and postgres backend using a 6,000 row database of real recipes. Search is fast in all contexts. Fuzzy search returns false positives (as expected, e.g.
mean
is a false positive forbean
), but also rescues searches that would otherwise fail (e.g.bean
is a true positive forbeans
). Ordering by trigram similarity to the recipe name brings the most relevant hits to the top of the list.Release Notes