Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate ANN search #78473

Closed
17 tasks done
jtibshirani opened this issue Sep 29, 2021 · 12 comments
Closed
17 tasks done

Integrate ANN search #78473

jtibshirani opened this issue Sep 29, 2021 · 12 comments
Assignees
Labels
>feature Meta release highlight :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jtibshirani
Copy link
Contributor

jtibshirani commented Sep 29, 2021

Background

Currently Elasticsearch supports storing vectors through the dense_vector field type and using them when scoring documents. This allows users to perform an exact k-nearest neighbors (kNN) search by scanning all documents. This work builds on that functionality to support fast, approximate nearest neighbor search (ANN). The implementation will use Lucene's new ANN support, which is based on the HNSW algorithm. Since Lucene will ship ANN in its upcoming 9.0 release, this feature will only target Elasticsearch 8.x.

Our plan is to extend the dense_vector field type to support adding vectors to an ANN index. We'll then add a new REST endpoint focused on kNN search. This new endpoint will be marked 'experimental' in the first release, as we expect to make API improvements in response to feedback. At first the endpoint will only perform kNN, but we'll follow-up with support for filtering, hybrid retrieval, aggregations, and more. We are really looking forward to everyone's feedback, which will help define the feature and set its direction.

Implementation Plan

Phase 0: Help prepare Lucene's HNSW implementation

Phase 1: Basic ANN support

Future Plans: Improvements to functionality and performance

@jtibshirani jtibshirani added :Search/Search Search-related issues that do not fall into other categories >feature Meta labels Sep 29, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Sep 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@MLnick
Copy link

MLnick commented Sep 30, 2021

Very excited to see this!

Support cosine similarity instead of dot product (?)

IMO both should be supported. Similar items via cosine sim is a very common use case, as is dot product (for e.g. recommendations).

jtibshirani added a commit that referenced this issue Oct 5, 2021
This PR extends the `dense_vector` type to allow vectors to be added to an ANN index:
```
"mappings": {
  "properties": {
    "my_vector": {
      "type": "dense_vector",
      "dims": 128,
      "index": true,
      "similarity": "l2_norm"
    }
  }
}
```

A description of the parameters:
* `index`. Setting this to `true` indicates the field should be added to the ANN index. The values will be parsed as a `KnnVectorField` instead of a doc values field. By default `index: false` to provide a smooth transition from 7.x, where vectors are not indexed.
* `similarity`. When `index: true`, it's required to specify what similarity to use when indexing the vectors. Right now the accepted values are `l2_norm` and `dot_product`, which matches the Lucene options. (We decided to require `similarity` to be set since there's no default choice that works in general, and it's easy to overlook and accidentally get poor results.)

Indexed vectors still support the same functionality as vectors based on doc values -- they work with vector script functions and `exists` queries.

Relates to #78473.
@jtibshirani
Copy link
Contributor Author

@MLnick thanks for the feedback, I updated the plan to make sure we cover both. Dot product may need special handling as it's not a true metric (for example doesn't satisfy the triangle inequality). I've also seen dot product used as an optimized cosine similarity, by normalizing all vectors to unit length beforehand -- this is more straightforward to support.

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Oct 14, 2021
This PR extends the dense_vector type to allow configure HNSW params in
`index_options`:
`m` – max number of connections for each  node,
`ef_construction` – number  of candidate neighbors to track while searching
the graph for each newly inserted node.

```
"mappings": {
  "properties": {
    "my_vector": {
      "type": "dense_vector",
      "dims": 128,
      "index": true,
      "similarity": "l2_norm",
      "index_options": {
        "type" : "hnsw",
        "m" : 15,
        "ef_construction" : 50
      }
    }
  }
}
```

index_options as an object, and all parameters underneath are optional.
If  `m` or `ef_contruction` are not provided, the default values from the
current codec will be used.

Relates to elastic#78473
mayya-sharipova added a commit that referenced this issue Oct 18, 2021
This PR extends the dense_vector type to allow configure HNSW params in
`index_options`:
`m` – max number of connections for each  node,
`ef_construction` – number  of candidate neighbours to track while searching
the graph for each newly inserted node.

```
"mappings": {
  "properties": {
    "my_vector": {
      "type": "dense_vector",
      "dims": 128,
      "index": true,
      "similarity": "l2_norm",
      "index_options": {
        "type" : "hnsw",
        "m" : 15,
        "ef_construction" : 50
      }
    }
  }
}
```

`index_options` as an object is optional. If not provided, the default values from the
current codec will be used.
If `index_options` is provided,  that all parameters related to the specific type
must be provided. 

Relates to #78473
jtibshirani added a commit that referenced this issue Oct 20, 2021
The new kNN endpoint currently doesn't support searches on nested fields. This
PR updates the validation logic to detect this case and throw a clear error. It
also adds tests for kNN search when there are nested documents.

Relates to #78473.
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Oct 21, 2021
This PR throws an exception for kNN searches on filetered aliases.
We don't allow kNN searches on filtered aliases as currently filters are
applied only after kNN searches are done, which may lead to returning
less than k results.

In the future, we want to apply filters while doing a kNN search.
Once implemented, we will allow kNN searches on filtered aliases.

Relates to elastic#78473
jtibshirani added a commit that referenced this issue Oct 23, 2021
This PR fixes some issues in `KnnVectorQueryBuilderTests`:
* Improve the check on the Lucene query
* Remove an unused field mapping

Relates to #78473.
jtibshirani added a commit that referenced this issue Oct 26, 2021
This PR ensures the `_knn_search` endpoint handles both FLS and DLS:
* Updates `FieldSubsetReader` to handle FLS for the vectors format
* Adds tests to check both DLS and FLS work

Relates to #78473.
lockewritesdocs pushed a commit to lockewritesdocs/elasticsearch that referenced this issue Oct 28, 2021
This PR fixes some issues in `KnnVectorQueryBuilderTests`:
* Improve the check on the Lucene query
* Remove an unused field mapping

Relates to elastic#78473.
jtibshirani added a commit that referenced this issue Nov 4, 2021
This commit updates the `dense_vector` docs to include information on the new
`index`, `similarity`, and `index_options` parameters. It also tries to clarify
the difference between `similarity` and `index_options` with the existing
parameters that have the same name.

Relates to #78473.
jtibshirani added a commit that referenced this issue Nov 4, 2021
This commit updates the `dense_vector` docs to include information on the new
`index`, `similarity`, and `index_options` parameters. It also tries to clarify
the difference between `similarity` and `index_options` with the existing
parameters that have the same name.

Relates to #78473.
@jrodewig jrodewig self-assigned this Nov 5, 2021
jtibshirani added a commit that referenced this issue Nov 9, 2021
This commit adds docs for the new `_knn_search` endpoint.

It focuses on being an API reference and is light on details in terms of how
exactly the kNN search works, and how the endpoint contrasts with
`script_score` queries. We plan to add a high-level guide on kNN search that
will explain this in depth.

Relates to #78473.
jtibshirani added a commit that referenced this issue Nov 9, 2021
This commit adds docs for the new `_knn_search` endpoint.

It focuses on being an API reference and is light on details in terms of how
exactly the kNN search works, and how the endpoint contrasts with
`script_score` queries. We plan to add a high-level guide on kNN search that
will explain this in depth.

Relates to #78473.
@coreation
Copy link

coreation commented Nov 23, 2021

@mayya-sharipova great to see this being put to work, I was going through the documentation WIP (https://elasticsearch_80857.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/knn-search.html#exact-knn) and was a bit confused. There's a lot of work being done around ANN's, which is experimental, and there's "exact kNN", but correct me if I'm wrong, there's nothing new being done with regards to exact kNN's right? The function score there is already possible in 7.11 for example.

I was wondering if the effort being done will help speed up exact kNN searches, is that something that will be improved by this issue? If read the issue it doesn't sound like it, but I wanted to make sure I wasn't mistaken.

@mayya-sharipova
Copy link
Contributor

@coreation

there's nothing new being done with regards to exact kNN's right?

You are right, this issue and the work done is concerned only approximate NN, and doesn't bring improvement to the exact KNN search.

I wondering what is your use case of the exact KNN, would it be possible for your use case to use ANN tuned for high accuracy/recall (Using a big number for num_candidates)?

Also what kind of speed ups are you thinking for the exact kNN search?

@coreation
Copy link

@mayya-sharipova Our use is that given a set of vectors, find the best fitting (N) other documents that are also complying with a set of filters. Currently we use a query in combination with a script score function for this, where the script score function can have 1-30 cosine similarity calculations, since we don't have 1 vector to match against, but a set of vectors.

This takes quite a bit of time, which is understandable given the sometimes 30 cosine similarity computations per score. I think aNN with filters will help speed up this process dramatically, if I understand it correctly because we don't need an exact score per se, just an idea of how well they match compared to the given vectors.

So your proposal of using ANN tuned for high accuracy/recall will suffice - and likely return results in a much faster manner.

@tholor
Copy link

tholor commented Nov 26, 2021

+1 for the combination of ANN and filters!

From what I understand from this current draft, the combination of ANN and filtering won't be supported yet and will only be explored in a distinct future?
Our use case is also heavily relying on running KNN on a filtered subset of documents. As these subsets are growing into the millions, we've reached the limits of KNN and hoped for switching to ANN with the 8.0 release. However, if ANN doesn't support filtering, we will run into accuracy problems when running this on the whole index.

@jtibshirani
Copy link
Contributor Author

Thanks for the feedback! I was too ambitious in listing all these extensions (like filtering) under "Phase 2". I changed the heading name to "Future Plans". We'll tackle them in their own dedicated GitHub issues.

@coreation
Copy link

coreation commented Nov 30, 2021

Thanks for the update @jtibshirani , will that issue (filtering) be linked as well in the main post when available?

jrodewig added a commit that referenced this issue Nov 30, 2021
Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch.

Relates to #78473.
elasticsearchmachine pushed a commit that referenced this issue Nov 30, 2021
Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch.

Relates to #78473.
@jrodewig jrodewig removed their assignment Dec 2, 2021
@msahamed
Copy link

msahamed commented Dec 4, 2021

Thank you for the great work with ANN support. I could not agree more regarding @tholor's view on the ANN and filter. Filtering with ANN is among the powerful options that other databases lack. In my role as a data scientist, I feel this is a necessity every day. Hence, it would be more beneficial if the ANN +filter were a higher priority.

jtibshirani added a commit that referenced this issue Dec 9, 2021
In order to perform a kNN search on a `dense_vector` field, it must have
`index: true` in its mapping. This commit clarifies the error message. Before
the message was confusing, because the user likely didn't touch the `index`
parameter and might not even be aware of it.

It adds a note to the docs clarifying that when coming from 7.x, you must
explicitly update `index: true` and reindex the vectors.

Relates to #78473.
jtibshirani added a commit that referenced this issue Dec 9, 2021
In order to perform a kNN search on a `dense_vector` field, it must have
`index: true` in its mapping. This commit clarifies the error message. Before
the message was confusing, because the user likely didn't touch the `index`
parameter and might not even be aware of it.

It adds a note to the docs clarifying that when coming from 7.x, you must
explicitly update `index: true` and reindex the vectors.

Relates to #78473.
@jtibshirani
Copy link
Contributor Author

I opened #81788 to track work on supporting ANN with filtering (also linked under "Future Plans" in the description). From your comments, it sounds like filtering would be really useful and a high priority for you.

I'm going to close out this issue, since we've merged the work required for basic ANN support. This is just a beginning -- we expect to iterate on and improve the feature through other GitHub issues.

@jtibshirani
Copy link
Contributor Author

I opened a new meta issue to track our follow-up work: #84324.

@jtibshirani jtibshirani added :Search Relevance/Vectors Vector search and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 21, 2022
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Meta release highlight :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

9 participants