Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Commit

Permalink
Add search algorithm documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
sarayourfriend committed Jan 4, 2023
1 parent 1748feb commit 9d0410e
Show file tree
Hide file tree
Showing 2 changed files with 228 additions and 0 deletions.
1 change: 1 addition & 0 deletions api/docs/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Subpackages
.. toctree::
:maxdepth: 4

search_algorithm
api/index
urls

Expand Down
227 changes: 227 additions & 0 deletions api/docs/reference/search_algorithm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Search Algorithm

Openverse currently uses a relatively simple and naïve search algorithm with
very limited options. The documentation on this page was written by referencing
the code in Openverse as well as parts of Openverse's historical development.
Parts of the story for how Openverse's indexes came to be configured as they are
today are likely missing. Future improvements to Openverse's indexing and search
will be more carefully documented here and in the code to ensure there is
greater longevitiy of understanding.

> **Note**: This document avoids covering details covered in the
> [Openverse Search Guide](https://wordpress.org/openverse/search-help).
> Specifically, this document does not describe _how_ to search (advanced
> techniques and syntax), rather _what is searched_ and in what way.
## Reference resources

This document includes links to specific parts of Elasticsearch's documentation
and Openverse code. The following are broadly useful entry points into learning
about Elasticsearch, full text-search, and Openverse's index configuration:

- [Elasticsearch 7.12 documentation](https://www.elastic.co/guide/en/elasticsearch/reference/7.12/index.html)
- [Full-text search (Wikipedia)](https://en.wikipedia.org/wiki/Full-text_search)
- [Stemming (Wikipedia)](https://en.wikipedia.org/wiki/Stemming)
- [`es_mapping.py` (index configuration)](https://github.com/WordPress/openverse-api/blob/main/ingestion_server/ingestion_server/es_mapping.py)

## Terms

- "Document": The total queryable information representing a single work
catalogued in Openverse.
- "Field": The individual queryable elements of a document. In the context of
Openverse, these may be textual, numerical, or keywords.
- "Result relevance": An individual result is "relevant" to a search when it
matches the expectations of the user for a given query. A result is irrelevant
if it does not match the expectations of the user for that same query. For
example, the query "bird watch" may produce pictures of a wrist watch with a
bird clockface illustration _or_ could surface pictures related to the
activity also known as "birding" (bird watching), due to stemming. In the case
of this specific query, "bird watching" may not be relevant, despite being a
technically correct match for the query given Openverse's current index
configuration. Other relevancy issues may be caused by descriptions that are
not related to the contents of an image. This often happens on Flickr where
users sometimes include blog-like text in the description of an image that
references things that happened outside of the context of the image itself.
- "Result quality": A combination of relevance and other factors like the actual
perceived "quality" of a given work. A work may be directly relevant to a
particular query but be of low quality. Quality is subjective, though there
may be certain characteristics that are broadly applicable to some subset of
searches.

## Technology

Openverse uses Elasticsearch's
[full text indexing and search capabilities](https://www.elastic.co/guide/en/elasticsearch/reference/7.12/full-text-queries.html).
We currently rely heavily on Elasticsearch's default behaviours in many aspects
of our search including Elasticsearch's default stemming configuration, aside
from small adjustments documented in the
[text analysis and stemming](#text-analysis/tokenization) section below. The
"raw" index configuration can be found in the `es_mapping.py` module (see link
in [resources](#resources)). Information for how to understand the configuration
can be found in the
[Elasticsearch documentation for index configuration](https://www.elastic.co/guide/en/elasticsearch/reference/7.12/index-modules.html).

> **Note**: We also apply cluster-level configurations as part of our
> Elasticsearch deployment. These are intentionally not covered here as they
> primarily deal with cluster performance and are irrelevant to the way searches
> are executed.
## Text analysis/tokenization

> **Note**: A general understanding of
> [full-text search (Wikipedia)](https://en.wikipedia.org/wiki/Full-text_search)
> and the concepts of
> ["stemming" (Wikipedia)](https://en.wikipedia.org/wiki/Stemming) and
> ["tokenization" (Wikipedia)](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)
> will be useful for understanding this section.
Text analysis or tokenization is the broad process Elasticsearch follows to
derive the index tokens for a given text document. A significant aspect of this
for Openverse (and many applications) is
[stemming](https://www.elastic.co/guide/en/elasticsearch/reference/7.12/stemming.html).
When Elasticsearch performs a full text search, it is searching the derived
tokens, for which it has created quickly searchable index, rather than the
original text itself. This means that the text analysis configuration applied
has a significant impact on the way documents are searched and how relevance is
calculated.

Openverse only applies our custom text analysis configuration to the "title",
"description", and "tags" fields. All other fields use the
[standard analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-standard-analyzer.html).

Openverse is currently configured exclusively with English language text
analysis. This means Openverse does not properly index documents in any language
other than English. This is a known issue that we hope to be able to address
soon via Elasticsearch's internationalisation tools.

We primarily use the default English language stemming settings aside from
[minor changes (GitHub)](https://github.com/cc-archive/cccatalog-api/issues/574#issue-668091876)
present in the `es_mapping.py` configuration to address a specific issue with
the "anim" stem. We use the
[Snowball English stemmer](https://snowballstem.org/algorithms/porter/stemmer.html)
and
[Lucene's possessive English stemmer](https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilter.html).

The rationale for this specific stemmer configuration is not documented aside
from the "anim" stem issue linked above.

Openverse also applies the `lowercase` text analysis filter making our indexes
case-insensitive.

## Search execution

There are essentially 2 broad types of "full-text" searches available via the
Openverse API:

1. General "query" searching, which applies a simple text query search against
several, non-configurable fields.
2. Individual field querying, which enables searching specific fields with
independent query terms.

Both of these search types also allow further keyword filtering along the
following fields:

- Extension
- Category
- Length
- Aspect ratio
- Size
- Source
- License

Source is the only field for which you can currently also specify exclusions.

By default, items marked "mature" are excluded, but these can also be enabled.

See the following API documentation links for descriptions and options for each
field:

- [Audio search](https://api.openverse.engineering/v1/#operation/audio_search)
- [Image search](https://api.openverse.engineering/v1/#operation/image_search)

Each of these fields are searched relatively strictly, primarily because the
search domain in each is very small and "keyword" like. That is, there is a
limited and specific set of terms that appear for the relevant document fields
for each of these query parameters. All of them are validated to only allow
specific options (documented in the API documentation links above), which
enforces the "keyword" like nature of their usage.

### General "query" searching

This is the type of searching most commonly used to query Openverse. It is
activated when the `q` query parameter is present. It searches the following
aspects of a document:

- Description
- Title
- Tags

> **Note**: The API does not currently surface the "description" field, though
> [discussion is underway](https://github.com/WordPress/openverse-catalog/issues/364)
> to potentially change this fact.
Of these, title is weighted 10000 times more heavily than the description and
tags. This makes searches that match a title very closely rise to the "top" of
the results, even if the same text is present word-for-word in a description. It
also breaks ties between documents, if for example two documents are returned,
one because the title matches and one because a tag matches, the title-matched
document will be ranked higher and therefore appear first.

Additionally, if the query is wrapped in double quotes (`"`), we search each of
these _exactly_ which will bypass stemming and match exact word order. Title
weighting is still applied in this case. This means that if the exact query text
is found in the title of one document and the description of another, the
title-matched document will appear ahead of the description-matched document.

### Individual field querying

This type of search is not commonly used. The Openverse frontend only surfaces
one of the subset of fields that are queryable in this approach. It is only
available if the `q` query parameter is excluded. In other words, if the request
includes the `q` query parameter, _the API will ignore these options even if
they are present in the request_. In that case, it will execute the
[general "query" search](#general-query-searching) described above. Future work
may surface the possibility for combining general querying and individual field
querying.

These are the fields currently supported for individual field querying:

- Creator
- Title
- Tags

> **Note**: The "tags" filter applies to the `name` field of the tags. No other
> aspect of the tags is currently searchable.
Each of these can be stacked. You can make a request that queries for a specific
title by a specific creator. The following parameters would search only for
works by "Claude Monet" where the word "madame" (or it's stemmed versions)
appear in the title: `?creator=Claude Monet&title=madame`.

As you can see, this is unique from the general query searching in that it
allows you to apply separate queries for individual fields (hence the name
"individual field querying"). The other notable difference from general querying
is that the "description" field of the document is not available for individual
field querying.

## Document scoring

Aside from the aforementioned weighting of document "title" matches, Openverse
also includes one other attempt at scoring documents to improve search relevancy
and quality. Future improvements to Openverse's search relevancy will most
likely involve changes to how we score documents. For example, we may score
documents higher if they are determined to be popular results in Openverse
itself.

### Provider supplied popularity

Some providers supply a "popularity" rating for individual works. We ingest this
data and calculate a normalised "popularity" score named `rank_feature`. Flickr
is one of the providers that supplies this information. This data is used so
that works that are popular on the provider side, are ranked higher in Openverse
as well. The assumption here is that works that are popular on the provider's
own website are likely higher quality and therefore more desirable results.
Whether this has a significant impact on result relevancy or quality has not
been measured, in part due to loose definitions of "relevancy" and "quality" and
in part because we do not currently have tools for measuring user perception of
a results relevancy or quality.

0 comments on commit 9d0410e

Please sign in to comment.