Notes on stemming and wildcard searching

Possible improvements to stemming and tokenization that I did not implement (because they will need a lot of fine tuning and/or inputting vocabularies manually):
- Synonym tokenization, which lets you define lists of synonyms to treat as matches
- Stemmer overriding, which similarly lets you add custom entries for improved stemming
- N-gram tokenization, which splits all text into tokens of between min_gram and max_gram letters

Wildcard search is possible, but not recommended for search performance at query time.

The search logic would have to change to accommodate this. Some possible changes, from this:

# apps/readux/views.py
from elasticsearch_dsl.query import MultiMatch
...
multimatch_query = MultiMatch(query=search_query, fields=self.query_search_fields)
volumes = volumes.query(multimatch_query)

to this:

# apps/readux/views.py
from elasticsearch_dsl.query import MultiMatch, Wildcard
...
multimatch_query = MultiMatch(query=search_query, fields=self.query_search_fields)
volumes = volumes.query(multimatch_query)
if "*" in search_query:
    summary_query = Wildcard(query=search_query, field="summary")
    label_query = Wildcard(query=search_query, field="label")
    volumes = volumes.query(summary_query).query(label_query)

It's possible this will also mess with highlighting or produce other weird results.

Existing stemmers are algorithmic, meaning they try to guess stems, but miss some like "Rome" and "Roman". The only available alternative, a dictionary-based stemmer called hunspell, is not available on AWS managed Elasticsearch. In order to use it, you'd have to move off the AWS managed instance.

If you do want to do that, here is how to enable hunspell:
1. Run the command hunspell -D on the machine where Elasticsearch is running to find out the hunspell search paths.
2. Make the en-US dictionary files available in one of those paths, in a subfolder called en_US.
3. Make the following change to the stemmer in apps/iiif/manifests/documents.py:
```
# apps/iiif/manifests/documents.py
from elasticsearch_dsl import analyzer, token_filter

en_US_token_filter = token_filter(
    "en_US",
    type="hunspell",
    locale="en_US",
)

stemmer = analyzer(
    "en",
    tokenizer="standard",
    filter=["lowercase", en_US_token_filter],
)
```
For what it's worth, you'll likely need to do this if you eventually want to do stemming on Tibetan or other languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on stemming and wildcard searching

Clone this wiki locally