-
Notifications
You must be signed in to change notification settings - Fork 13
Notes on stemming and wildcard searching
-
Possible improvements to stemming and tokenization that I did not implement (because they will need a lot of fine tuning and/or inputting vocabularies manually):
- Synonym tokenization, which lets you define lists of synonyms to treat as matches
- Stemmer overriding, which similarly lets you add custom entries for improved stemming
-
N-gram tokenization, which splits all text into tokens of between
min_gram
andmax_gram
letters
-
Wildcard search is possible, but not recommended for search performance at query time.
-
The search logic would have to change to accommodate this. Some possible changes, from this:
# apps/readux/views.py from elasticsearch_dsl.query import MultiMatch ... multimatch_query = MultiMatch(query=search_query, fields=self.query_search_fields) volumes = volumes.query(multimatch_query)
to this:
# apps/readux/views.py from elasticsearch_dsl.query import MultiMatch, Wildcard ... multimatch_query = MultiMatch(query=search_query, fields=self.query_search_fields) volumes = volumes.query(multimatch_query) if "*" in search_query: summary_query = Wildcard(query=search_query, field="summary") label_query = Wildcard(query=search_query, field="label") volumes = volumes.query(summary_query).query(label_query)
It's possible this will also mess with highlighting or produce other weird results.
-
-
Existing stemmers are algorithmic, meaning they try to guess stems, but miss some like "Rome" and "Roman". The only available alternative, a dictionary-based stemmer called hunspell, is not available on AWS managed Elasticsearch. In order to use it, you'd have to move off the AWS managed instance.
If you do want to do that, here is how to enable hunspell:
- Run the command
hunspell -D
on the machine where Elasticsearch is running to find out the hunspell search paths. - Make the en-US dictionary files available in one of those paths, in a subfolder called
en_US
. - Make the following change to the
stemmer
inapps/iiif/manifests/documents.py
:# apps/iiif/manifests/documents.py from elasticsearch_dsl import analyzer, token_filter en_US_token_filter = token_filter( "en_US", type="hunspell", locale="en_US", ) stemmer = analyzer( "en", tokenizer="standard", filter=["lowercase", en_US_token_filter], )
For what it's worth, you'll likely need to do this if you eventually want to do stemming on Tibetan or other languages.
- Run the command