Skip to content

Notes on stemming and wildcard searching

Ben Silverman edited this page Apr 28, 2022 · 3 revisions
  • Possible improvements to stemming and tokenization that I did not implement (because they will need a lot of fine tuning and/or inputting vocabularies manually):

  • Wildcard search is possible, but not recommended for search performance at query time.

    • The search logic would have to change to accommodate this. Some possible changes, from this:

      # apps/readux/views.py
      from elasticsearch_dsl.query import MultiMatch
      ...
      multimatch_query = MultiMatch(query=search_query, fields=self.query_search_fields)
      volumes = volumes.query(multimatch_query)

      to this:

      # apps/readux/views.py
      from elasticsearch_dsl.query import MultiMatch, Wildcard
      ...
      multimatch_query = MultiMatch(query=search_query, fields=self.query_search_fields)
      volumes = volumes.query(multimatch_query)
      if "*" in search_query:
          summary_query = Wildcard(query=search_query, field="summary")
          label_query = Wildcard(query=search_query, field="label")
          volumes = volumes.query(summary_query).query(label_query)

      It's possible this will also mess with highlighting or produce other weird results.

  • Existing stemmers are algorithmic, meaning they try to guess stems, but miss some like "Rome" and "Roman". The only available alternative, a dictionary-based stemmer called hunspell, is not available on AWS managed Elasticsearch. In order to use it, you'd have to move off the AWS managed instance.

    If you do want to do that, here is how to enable hunspell:

    1. Run the command hunspell -D on the machine where Elasticsearch is running to find out the hunspell search paths.
    2. Make the en-US dictionary files available in one of those paths, in a subfolder called en_US.
    3. Make the following change to the stemmer in apps/iiif/manifests/documents.py:
      # apps/iiif/manifests/documents.py
      from elasticsearch_dsl import analyzer, token_filter
      
      en_US_token_filter = token_filter(
          "en_US",
          type="hunspell",
          locale="en_US",
      )
      
      stemmer = analyzer(
          "en",
          tokenizer="standard",
          filter=["lowercase", en_US_token_filter],
      )

    For what it's worth, you'll likely need to do this if you eventually want to do stemming on Tibetan or other languages.

Clone this wiki locally