-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Support Radius Search in k-NN #1483
Comments
Hey @junqiu-lei this looks cool! A couple questions: For faiss, when distance is passed in for l2, what if the implementation uses the l2^2 ordering? Also, I imagine we will probably want to support both distance and score. Also, in Opensearch, there exists a "min_score" field that seems to do just this: https://opensearch.org/docs/latest/api-reference/search/#url-parameters. Can we take that as a parameter and build around it? |
Hi @jmazanec15, much appreciated the feedback!
In FAISS, when L2 distance is used, the internal implementation relies on squared L2 distance (L2^2) for efficiency, as it avoids the square root operation. This doesn't affect the relative ordering of search results, since the nearest neighbors remain the same whether using L2 or L2^2.
Certainly, incorporating both distance and score is feasible, thanks to the ability to convert between these metrics. However, it would be helpful to understand the extent to which distance thresholds are preferred, considering scores often provide a more intuitive indication of relevance.
Good point to call out "min_score", It indeed serves as an effective filter with top-k queries. However integrating the radius query parameter in a similar manner to the k parameter is advisable, as "radius" and "k" represent distinct search methodologies. This alignment ensures clarity and consistency in parameter handling. |
But for distance parameter, would this mean user has to pass in l2^2 or l2?
I guess as a user who is familiar with the embedding space being generated, it may be more natural for me to reason about the distance than have to convert the distance to a score. Adding both does not need to be p0, but we should easily be able to extend it.
Good point. That makes sense to me. |
In Faiss, user need provide the distance parameter in l2^2.
Yes, agree with it. |
So in the interface, if we were to introduce both, what do you think it would look like? |
In the interface, if we support both distance and score types, the name need to be clear understand, maybe introduce like "radius_score" and "radius_distance" or "min_score" and "min_distance". |
Yeah that makes sense. One minor thing: I would go with "min_score" and "radial_distance" |
Id also be interested in more details around benchmarking plan |
Certainly, I'll provide more details on the benchmarking plan later. |
I dont know if this will work. I think we need to add more detail on how we're limiting memory consumption for a very low threshold |
Closing this issue as this feature is going to release at 2.14. |
Overview
Following from #814, this document details the proposed enhancement of the OpenSearch k-NN plugin with a radius search feature, leveraging advancements in the Lucene and FAISS libraries. This enhancement aims to broaden the plugin's capabilities beyond traditional k-nearest neighbors (k-NN) searches by introducing the ability to perform radius searches. Radius searches will allow users to identify all points within a vector space that fall within a specified distance(score) threshold from a query point, offering more flexibility and utility in search operations.
Libraries Background
FAISS
FAISS renowned for its efficient similarity search and clustering of dense vectors, FAISS excels in performing radius searches. The range_search API facilitates these searches by accepting a distance parameter, enabling efficient identification of points within a specified radius using both Inverted File System (IVF) and Hierarchical Navigable Small World (HNSW) methods.
Lucene:
Lucene recent updates have introduced the capability for similarity-based vector searches. Lucene's approach involves finding all vectors that score above a certain similarity threshold by navigating the HNSW graph. This process continues until no better-scoring nodes are found or the best candidate's score falls below a specified traversal similarity at the lowest graph level.
Scope
Requirements
Radius Parameter
The radius search in k-NN will require new parameter to define the maximum range of query. Here are options:
Option 1: Unified Parameter Across Engines
Pros
Cons
Sub-Option 1.1: Using "Score" as a Unified Parameter
Pros
Cons
Example Query and Result:
Sub-Option 1.2: Using "Distance" as a Unified Parameter (Proposing)
Pros
Cons
Example Query:
Option2: Introduce Differently Parameters Based on Engines
To accommodate the specific characteristics and capabilities of each engine (FAISS and Lucene), we can introduce new parameters that are tailored to each engine. This approach provides users with the flexibility to leverage the unique features and optimizations of the underlying engine. Below are enhancements and the completion of the different parameter solutions:
FAISS-Specific Parameters
radius: This parameter is used with FAISS to specify the radius within which points are considered similar to the query vector. It directly translates to a distance in the vector space.
Lucene-Specific Parameters
similarity_threshold: This parameter is used with Lucene to specify a score threshold. Documents with a similarity score above this threshold are considered similar to the query vector. The score is typically derived from a similarity function or a distance metric.
Note: The following contents are based on option 1.1, using "Score" as a unified parameter.
High Level Query Workflow
Native Engine (Faiss) Radius Query
Lucene Engine Radius Query
Low-Level Implementation
Validation radius parameter from XContent
Radius parameter will be passed in from XContentParser object, the radius value should be received like:
Create new KNNQueryBuilder constructor with validation
FAISS Integration
Translate radius to distance
We can use the score → rawScore based on existing rawScore → score transilation for different space types at index/SpaceType.java
JNI Service
Add new method in JNI Service
Add new method in JNI FaissService wrapper
Lucene Integration
Lucene integration will be available with the release of Lucene 9.10, necessitating an update in OpenSearch and k-NN to incorporate this new feature.
Add new VectorSimilarityQuery in KNNQUeryFactory
Others
Benchmarks
Benchmark will published once after core implementation completed.
Limit the number of results returned
Incorporating radius search capabilities into OpenSearch brings with it the challenge of managing large result sets, especially when querying extensive datasets in some corner cases. The nature of radius search, designed to return all results within a defined threshold, poses a risk of generating a voluminous number of hits. This scenario could potentially impact cluster performance by demanding substantial computational and memory resources.
OpenSearch employs a safeguard to mitigate such risks: the index.max_result_window setting. This setting caps the maximum number of hits that can be returned in a single query response, with a default limit of 10,000 hits.
API Usage Stats
Along with the existing k-NN API usage stats, we can add a new metric to track the usage of radius search.
The text was updated successfully, but these errors were encountered: