-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate ANN search #78473
Comments
Pinging @elastic/es-search (Team:Search) |
Very excited to see this!
IMO both should be supported. Similar items via cosine sim is a very common use case, as is dot product (for e.g. recommendations). |
This PR extends the `dense_vector` type to allow vectors to be added to an ANN index: ``` "mappings": { "properties": { "my_vector": { "type": "dense_vector", "dims": 128, "index": true, "similarity": "l2_norm" } } } ``` A description of the parameters: * `index`. Setting this to `true` indicates the field should be added to the ANN index. The values will be parsed as a `KnnVectorField` instead of a doc values field. By default `index: false` to provide a smooth transition from 7.x, where vectors are not indexed. * `similarity`. When `index: true`, it's required to specify what similarity to use when indexing the vectors. Right now the accepted values are `l2_norm` and `dot_product`, which matches the Lucene options. (We decided to require `similarity` to be set since there's no default choice that works in general, and it's easy to overlook and accidentally get poor results.) Indexed vectors still support the same functionality as vectors based on doc values -- they work with vector script functions and `exists` queries. Relates to #78473.
@MLnick thanks for the feedback, I updated the plan to make sure we cover both. Dot product may need special handling as it's not a true metric (for example doesn't satisfy the triangle inequality). I've also seen dot product used as an optimized cosine similarity, by normalizing all vectors to unit length beforehand -- this is more straightforward to support. |
This PR extends the dense_vector type to allow configure HNSW params in `index_options`: `m` – max number of connections for each node, `ef_construction` – number of candidate neighbors to track while searching the graph for each newly inserted node. ``` "mappings": { "properties": { "my_vector": { "type": "dense_vector", "dims": 128, "index": true, "similarity": "l2_norm", "index_options": { "type" : "hnsw", "m" : 15, "ef_construction" : 50 } } } } ``` index_options as an object, and all parameters underneath are optional. If `m` or `ef_contruction` are not provided, the default values from the current codec will be used. Relates to elastic#78473
This PR extends the dense_vector type to allow configure HNSW params in `index_options`: `m` – max number of connections for each node, `ef_construction` – number of candidate neighbours to track while searching the graph for each newly inserted node. ``` "mappings": { "properties": { "my_vector": { "type": "dense_vector", "dims": 128, "index": true, "similarity": "l2_norm", "index_options": { "type" : "hnsw", "m" : 15, "ef_construction" : 50 } } } } ``` `index_options` as an object is optional. If not provided, the default values from the current codec will be used. If `index_options` is provided, that all parameters related to the specific type must be provided. Relates to #78473
The new kNN endpoint currently doesn't support searches on nested fields. This PR updates the validation logic to detect this case and throw a clear error. It also adds tests for kNN search when there are nested documents. Relates to #78473.
This PR throws an exception for kNN searches on filetered aliases. We don't allow kNN searches on filtered aliases as currently filters are applied only after kNN searches are done, which may lead to returning less than k results. In the future, we want to apply filters while doing a kNN search. Once implemented, we will allow kNN searches on filtered aliases. Relates to elastic#78473
This PR fixes some issues in `KnnVectorQueryBuilderTests`: * Improve the check on the Lucene query * Remove an unused field mapping Relates to #78473.
This PR ensures the `_knn_search` endpoint handles both FLS and DLS: * Updates `FieldSubsetReader` to handle FLS for the vectors format * Adds tests to check both DLS and FLS work Relates to #78473.
This PR fixes some issues in `KnnVectorQueryBuilderTests`: * Improve the check on the Lucene query * Remove an unused field mapping Relates to elastic#78473.
This commit updates the `dense_vector` docs to include information on the new `index`, `similarity`, and `index_options` parameters. It also tries to clarify the difference between `similarity` and `index_options` with the existing parameters that have the same name. Relates to #78473.
This commit updates the `dense_vector` docs to include information on the new `index`, `similarity`, and `index_options` parameters. It also tries to clarify the difference between `similarity` and `index_options` with the existing parameters that have the same name. Relates to #78473.
This commit adds docs for the new `_knn_search` endpoint. It focuses on being an API reference and is light on details in terms of how exactly the kNN search works, and how the endpoint contrasts with `script_score` queries. We plan to add a high-level guide on kNN search that will explain this in depth. Relates to #78473.
This commit adds docs for the new `_knn_search` endpoint. It focuses on being an API reference and is light on details in terms of how exactly the kNN search works, and how the endpoint contrasts with `script_score` queries. We plan to add a high-level guide on kNN search that will explain this in depth. Relates to #78473.
@mayya-sharipova great to see this being put to work, I was going through the documentation WIP (https://elasticsearch_80857.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/knn-search.html#exact-knn) and was a bit confused. There's a lot of work being done around ANN's, which is experimental, and there's "exact kNN", but correct me if I'm wrong, there's nothing new being done with regards to exact kNN's right? The function score there is already possible in 7.11 for example. I was wondering if the effort being done will help speed up exact kNN searches, is that something that will be improved by this issue? If read the issue it doesn't sound like it, but I wanted to make sure I wasn't mistaken. |
You are right, this issue and the work done is concerned only approximate NN, and doesn't bring improvement to the exact KNN search. I wondering what is your use case of the exact KNN, would it be possible for your use case to use ANN tuned for high accuracy/recall (Using a big number for num_candidates)? Also what kind of speed ups are you thinking for the exact kNN search? |
@mayya-sharipova Our use is that given a set of vectors, find the best fitting (N) other documents that are also complying with a set of filters. Currently we use a query in combination with a script score function for this, where the script score function can have 1-30 cosine similarity calculations, since we don't have 1 vector to match against, but a set of vectors. This takes quite a bit of time, which is understandable given the sometimes 30 cosine similarity computations per score. I think aNN with filters will help speed up this process dramatically, if I understand it correctly because we don't need an exact score per se, just an idea of how well they match compared to the given vectors. So your proposal of using ANN tuned for high accuracy/recall will suffice - and likely return results in a much faster manner. |
+1 for the combination of ANN and filters! From what I understand from this current draft, the combination of ANN and filtering won't be supported yet and will only be explored in a distinct future? |
Thanks for the feedback! I was too ambitious in listing all these extensions (like filtering) under "Phase 2". I changed the heading name to "Future Plans". We'll tackle them in their own dedicated GitHub issues. |
Thanks for the update @jtibshirani , will that issue (filtering) be linked as well in the main post when available? |
Adds a high-level guide for running an approximate or exact kNN search in Elasticsearch. Relates to #78473.
Thank you for the great work with ANN support. I could not agree more regarding @tholor's view on the ANN and filter. Filtering with ANN is among the powerful options that other databases lack. In my role as a data scientist, I feel this is a necessity every day. Hence, it would be more beneficial if the ANN +filter were a higher priority. |
In order to perform a kNN search on a `dense_vector` field, it must have `index: true` in its mapping. This commit clarifies the error message. Before the message was confusing, because the user likely didn't touch the `index` parameter and might not even be aware of it. It adds a note to the docs clarifying that when coming from 7.x, you must explicitly update `index: true` and reindex the vectors. Relates to #78473.
In order to perform a kNN search on a `dense_vector` field, it must have `index: true` in its mapping. This commit clarifies the error message. Before the message was confusing, because the user likely didn't touch the `index` parameter and might not even be aware of it. It adds a note to the docs clarifying that when coming from 7.x, you must explicitly update `index: true` and reindex the vectors. Relates to #78473.
I opened #81788 to track work on supporting ANN with filtering (also linked under "Future Plans" in the description). From your comments, it sounds like filtering would be really useful and a high priority for you. I'm going to close out this issue, since we've merged the work required for basic ANN support. This is just a beginning -- we expect to iterate on and improve the feature through other GitHub issues. |
Adds a release highlight for the kNN search API. Relates to #78473 and #79013 ### Preview https://elasticsearch_83755.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/8.0/release-highlights.html#_knn_search_api
I opened a new meta issue to track our follow-up work: #84324. |
Background
Currently Elasticsearch supports storing vectors through the
dense_vector
field type and using them when scoring documents. This allows users to perform an exact k-nearest neighbors (kNN) search by scanning all documents. This work builds on that functionality to support fast, approximate nearest neighbor search (ANN). The implementation will use Lucene's new ANN support, which is based on the HNSW algorithm. Since Lucene will ship ANN in its upcoming 9.0 release, this feature will only target Elasticsearch 8.x.Our plan is to extend the
dense_vector
field type to support adding vectors to an ANN index. We'll then add a new REST endpoint focused on kNN search. This new endpoint will be marked 'experimental' in the first release, as we expect to make API improvements in response to feedback. At first the endpoint will only perform kNN, but we'll follow-up with support for filtering, hybrid retrieval, aggregations, and more. We are really looking forward to everyone's feedback, which will help define the feature and set its direction.Implementation Plan
Phase 0: Help prepare Lucene's HNSW implementation
Phase 1: Basic ANN support
dense_vector
field type to support ANN indexingdense_vector
to support indexing vectors #78491dense_vector
docs with kNN indexing options #80306Future Plans: Improvements to functionality and performance
The text was updated successfully, but these errors were encountered: