[RFC] Support Radius Search in k-NN #1483

junqiu-lei · 2024-02-14T23:34:54Z

Overview

Following from #814, this document details the proposed enhancement of the OpenSearch k-NN plugin with a radius search feature, leveraging advancements in the Lucene and FAISS libraries. This enhancement aims to broaden the plugin's capabilities beyond traditional k-nearest neighbors (k-NN) searches by introducing the ability to perform radius searches. Radius searches will allow users to identify all points within a vector space that fall within a specified distance(score) threshold from a query point, offering more flexibility and utility in search operations.

Libraries Background

FAISS

FAISS renowned for its efficient similarity search and clustering of dense vectors, FAISS excels in performing radius searches. The range_search API facilitates these searches by accepting a distance parameter, enabling efficient identification of points within a specified radius using both Inverted File System (IVF) and Hierarchical Navigable Small World (HNSW) methods.

Feature Availability in k-NN
- IVF: Supported
- HNSW: Supported, following the recent integration of the k-NN Faiss update.

Lucene:

Lucene recent updates have introduced the capability for similarity-based vector searches. Lucene's approach involves finding all vectors that score above a certain similarity threshold by navigating the HNSW graph. This process continues until no better-scoring nodes are found or the best candidate's score falls below a specified traversal similarity at the lowest graph level.

Feature Availability in k-NN
- HNSW: It will be available with the release of Lucene 9.10, necessitating an update in OpenSearch to incorporate this new feature.

Scope

Design and implement a API in OpenSearch k-NN that supports radius search using both FAISS and Lucene.
Ensure the feature is intuitive, efficient, and provides consistent results across both libraries.

Requirements

Parameter Handling: The API should correctly interpret and handle parameters related to radius search, such as radius distance and similarity thresholds.
Performance: The implementation should not significantly degrade query performance.
Indexing and Output Consistency: The introduction of the radius search feature should preserve the existing indexing process and query output format.

Radius Parameter

The radius search in k-NN will require new parameter to define the maximum range of query. Here are options:

Option 1: Unified Parameter Across Engines

Pros

Simplified User Interface: Introduces a single parameter across both engines (Faiss and Lucene), streamlining the API and enhancing user experience by reducing the learning curve.
Consistency Across Engines: Offers a consistent user experience, regardless of the underlying engine, promoting ease of use and implementation.

Cons

Mapping Complexity: Ensures accurate mapping from the unified parameter to engine-specific parameters, which may introduce complexity at the k-NN plugin level.
Potential Feature Limitation: Risks limiting access to engine-specific features or optimizations that might be available through native parameters, potentially underutilizing the engines' capabilities.

Sub-Option 1.1: Using "Score" as a Unified Parameter

Pros

Intuitiveness: In many search contexts, especially those familiar with search engines, scores are a more intuitive measure of relevance. Users may find it easier to understand a higher score as better.
Consistency with Lucene: Lucene inherently uses scoring for ranking documents. Using score as the unified parameter maintains consistency across Lucene-based and vector search functionalities, simplifying the understanding and application of search criteria.
Flexibility in Definition: Scores can be flexibly defined to incorporate a variety of factors, including similarity, document popularity, or custom metrics, providing a versatile tool for fine-tuning search results.

Cons

Ambiguity in Meaning: Without a clear, standardized definition, scores can be ambiguous. Different systems or algorithms might calculate scores in vastly different ways, potentially confusing users or complicating the interpretation of results.
Normalization and Comparison Issues: Scores often require normalization to be comparable across different datasets or query contexts. This can introduce complexity in ensuring fair and consistent ranking, especially when integrating results from multiple sources or search modes.

Example Query and Result:

GET /target-index-faiss/_search
{
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [4, 8],
        "score": 0.6
      }
    }
  }
}

{
    "took": 10,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.7936508,
        "hits": [
            {
                "_index": "target-index-faiss",
                "_id": "12",
                "_score": 0.7936508,
                "_source": {
                    "my_vector1": [
                        4.5,
                        7.9
                    ],
                    "price": 7.5
                }
            },
            {
                "_index": "target-index-faiss",
                "_id": "8",
                "_score": 0.62111807,
                "_source": {
                    "my_vector1": [
                        5.6,
                        8.5
                    ],
                    "price": 18.7
                }
            }
        ]
    }
}

Sub-Option 1.2: Using "Distance" as a Unified Parameter (Proposing)

Pros

Clarity and Precision Distances offer a clear and precise measure of similarity, especially in vector search spaces. The concept of distance (e.g., Euclidean, cosine) is mathematically defined and universally understood in the context of similarity searches.
Direct Relevance to Vector Search: For vector searches, such as those performed by Faiss, distance directly corresponds to the query's intent—finding the nearest neighbors. It provides a straightforward, objective criterion for evaluating query results.
Easier to Standardize: Since distance metrics are well-defined mathematically, they can be more easily standardized across different systems, facilitating comparison and integration of search results from diverse sources.

Cons

Less Intuitive for Non-Technical Users: The concept of distance, while precise, may be less intuitive for users unfamiliar with vector space models or the specific metrics used (e.g., understanding the difference between Euclidean and cosine distances).
Potential Misalignment with User Expectations: In some contexts, the smallest distance does not always equate to the most "relevant" result, especially if other factors (like document freshness or popularity) are important. Users expecting a simple "higher is better" metric may find distances counterintuitive.
Complexity in Mixing Search Types: When integrating vector searches with traditional text searches, translating distances into a comparable metric that aligns with text search scores can introduce complexity, potentially requiring additional transformations or heuristics.

Example Query:

GET /target-index/_search
{
  "query": {
    "knn": {
      "target-field": {
        "vector": [1,2,3,4],
        "distance":5.0
      }
    }
  }
}

Option2: Introduce Differently Parameters Based on Engines

To accommodate the specific characteristics and capabilities of each engine (FAISS and Lucene), we can introduce new parameters that are tailored to each engine. This approach provides users with the flexibility to leverage the unique features and optimizations of the underlying engine. Below are enhancements and the completion of the different parameter solutions:

FAISS-Specific Parameters

radius: This parameter is used with FAISS to specify the radius within which points are considered similar to the query vector. It directly translates to a distance in the vector space.

Lucene-Specific Parameters

similarity_threshold: This parameter is used with Lucene to specify a score threshold. Documents with a similarity score above this threshold are considered similar to the query vector. The score is typically derived from a similarity function or a distance metric.

Note: The following contents are based on option 1.1, using "Score" as a unified parameter.

High Level Query Workflow

Native Engine (Faiss) Radius Query

Lucene Engine Radius Query

Low-Level Implementation

Validation radius parameter from XContent

Radius parameter will be passed in from XContentParser object, the radius value should be received like:

radius = (float) NumberFieldMapper.NumberType.FLOAT.parse(parser.objectBytes(), false);

Create new KNNQueryBuilder constructor with validation

public KNNQueryBuilder(String fieldName, float[] vector, float radius, QueryBuilder filter) {
    if (Strings.isNullOrEmpty(fieldName)) {
        throw new IllegalArgumentException("[" + NAME + "] requires fieldName");
    }
    if (vector == null) {
        throw new IllegalArgumentException("[" + NAME + "] requires query vector");
    }
    if (vector.length == 0) {
        throw new IllegalArgumentException("[" + NAME + "] query vector is empty");
    }
    if (radius > RADIUS_MAX || radius < RADIUS_MIN) {
        throw new IllegalArgumentException("[" + NAME + "] requires radius to be between " + RADIUS_MIN + " and " + RADIUS_MAX);
    }

    this.fieldName = fieldName;
    this.vector = vector;
    this.filter = filter;
    this.ignoreUnmapped = false;
    this.radius = radius;
}

FAISS Integration

Translate radius to distance

We can use the score → rawScore based on existing rawScore → score transilation for different space types at index/SpaceType.java

public enum SpaceType {
    L2("l2") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    COSINESIMIL("cosinesimil") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    L1("l1") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    LINF("linf") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    INNER_PRODUCT("innerproduct") {
        @Override
        public float rawScoreTranslation(float score) {
            return -score + 1;
        }
    },
    HAMMING_BIT("hammingbit") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    };

JNI Service

Add new method in JNI Service

public static KNNQueryResult[] radiusQueryIndex(long indexPointer, float[] queryVector, float radius, String engineName) {
    if (KNNEngine.FAISS.getName().equals(engineName)) {
        return FaissService.radiusQueryIndex(indexPointer, queryVector, radius);
    }
    throw new IllegalArgumentException("RadiusQueryIndex not supported for provided engine");
}

Add new method in JNI FaissService wrapper

jobjectArray knn_jni::faiss_wrapper::RangeSearch(knn_jni::JNIUtilInterface *jniUtil, JNIEnv *env, jlong indexPointerJ,
                                                 jfloatArray queryVectorJ, jfloat radiusJ) {
    if (queryVectorJ == nullptr) {
        throw std::runtime_error("Query Vector cannot be null");
    }

    auto *index = reinterpret_cast<faiss::IndexIDMap *>(indexPointerJ);

    if (index == nullptr) {
        throw std::runtime_error("Invalid pointer to index");
    }

    float *rawQueryvector = jniUtil->GetFloatArrayElements(env, queryVectorJ, nullptr);

    faiss::RangeSearchResult res(1);
    index->range_search(1, rawQueryvector, radiusJ, &res);

    int resultSize = res.lims[1];

    jclass resultClass = jniUtil->FindClass(env,"org/opensearch/knn/index/query/KNNQueryResult");
    jmethodID allArgs = jniUtil->FindMethod(env, "org/opensearch/knn/index/query/KNNQueryResult", "<init>");

    jobjectArray results = jniUtil->NewObjectArray(env, resultSize, resultClass, nullptr);

    jobject result;
    for(int i = 0; i < resultSize; ++i) {
        result = jniUtil->NewObject(env, resultClass, allArgs, res.labels[i], res.distances[i]);
        jniUtil->SetObjectArrayElement(env, results, i, result);
    }

    return results;
}

Lucene Integration

Lucene integration will be available with the release of Lucene 9.10, necessitating an update in OpenSearch and k-NN to incorporate this new feature.

Add new VectorSimilarityQuery in KNNQUeryFactory

public static Query create(CreateQueryRequest createQueryRequest) {
    ......
    if (VectorDataType.BYTE == vectorDataType) {
        return new ByteVectorSimilarityQuery(fieldName, byteVector, radius);
    } else {
        return new FloatVectorSimilarityQuery(fieldName, floatVector, radius);
    }
}

Others

Benchmarks

Benchmark will published once after core implementation completed.

Limit the number of results returned

Incorporating radius search capabilities into OpenSearch brings with it the challenge of managing large result sets, especially when querying extensive datasets in some corner cases. The nature of radius search, designed to return all results within a defined threshold, poses a risk of generating a voluminous number of hits. This scenario could potentially impact cluster performance by demanding substantial computational and memory resources.

OpenSearch employs a safeguard to mitigate such risks: the index.max_result_window setting. This setting caps the maximum number of hits that can be returned in a single query response, with a default limit of 10,000 hits.

API Usage Stats

Along with the existing k-NN API usage stats, we can add a new metric to track the usage of radius search.

radius_query_requests: The number of requests made to the k-NN plugin for radius searches.
radius_query_errors: The number of errors encountered while processing radius search requests.

The text was updated successfully, but these errors were encountered:

jmazanec15 · 2024-02-19T23:10:39Z

Hey @junqiu-lei this looks cool! A couple questions: For faiss, when distance is passed in for l2, what if the implementation uses the l2^2 ordering? Also, I imagine we will probably want to support both distance and score.

Also, in Opensearch, there exists a "min_score" field that seems to do just this: https://opensearch.org/docs/latest/api-reference/search/#url-parameters. Can we take that as a parameter and build around it?

junqiu-lei · 2024-02-20T00:17:38Z

Hi @jmazanec15, much appreciated the feedback!

For faiss, when distance is passed in for l2, what if the implementation uses the l2^2 ordering?

In FAISS, when L2 distance is used, the internal implementation relies on squared L2 distance (L2^2) for efficiency, as it avoids the square root operation. This doesn't affect the relative ordering of search results, since the nearest neighbors remain the same whether using L2 or L2^2.

Also, I imagine we will probably want to support both distance and score.

Certainly, incorporating both distance and score is feasible, thanks to the ability to convert between these metrics. However, it would be helpful to understand the extent to which distance thresholds are preferred, considering scores often provide a more intuitive indication of relevance.

Also, in Opensearch, there exists a "min_score" field that seems to do just this: https://opensearch.org/docs/latest/api-reference/search/#url-parameters. Can we take that as a parameter and build around it?

Good point to call out "min_score", It indeed serves as an effective filter with top-k queries. However integrating the radius query parameter in a similar manner to the k parameter is advisable, as "radius" and "k" represent distinct search methodologies. This alignment ensures clarity and consistency in parameter handling.

jmazanec15 · 2024-02-20T16:00:05Z

In FAISS, when L2 distance is used, the internal implementation relies on squared L2 distance (L2^2) for efficiency, as it avoids the square root operation. This doesn't affect the relative ordering of search results, since the nearest neighbors remain the same whether using L2 or L2^2.

But for distance parameter, would this mean user has to pass in l2^2 or l2?

Certainly, incorporating both distance and score is feasible, thanks to the ability to convert between these metrics. However, it would be helpful to understand the extent to which distance thresholds are preferred, considering scores often provide a more intuitive indication of relevance.

I guess as a user who is familiar with the embedding space being generated, it may be more natural for me to reason about the distance than have to convert the distance to a score. Adding both does not need to be p0, but we should easily be able to extend it.

Good point to call out "min_score", It indeed serves as an effective filter with top-k queries. However integrating the radius query parameter in a similar manner to the k parameter is advisable, as "radius" and "k" represent distinct search methodologies. This alignment ensures clarity and consistency in parameter handling.

Good point. That makes sense to me.

junqiu-lei · 2024-02-20T18:42:09Z

But for distance parameter, would this mean user has to pass in l2^2 or l2?

In Faiss, user need provide the distance parameter in l2^2.

I guess as a user who is familiar with the embedding space being generated, it may be more natural for me to reason about the distance than have to convert the distance to a score. Adding both does not need to be p0, but we should easily be able to extend it.

Yes, agree with it.

jmazanec15 · 2024-02-20T19:46:13Z

So in the interface, if we were to introduce both, what do you think it would look like?

junqiu-lei · 2024-02-20T20:02:22Z

So in the interface, if we were to introduce both, what do you think it would look like?

In the interface, if we support both distance and score types, the name need to be clear understand, maybe introduce like "radius_score" and "radius_distance" or "min_score" and "min_distance".

jmazanec15 · 2024-02-20T21:29:00Z

Yeah that makes sense. One minor thing: I would go with "min_score" and "radial_distance"

jmazanec15 · 2024-02-20T23:21:52Z

Id also be interested in more details around benchmarking plan

junqiu-lei · 2024-02-21T01:26:37Z

Id also be interested in more details around benchmarking plan

Certainly, I'll provide more details on the benchmarking plan later.

jmazanec15 · 2024-03-07T17:54:47Z

Incorporating radius search capabilities into OpenSearch brings with it the challenge of managing large result sets, especially when querying extensive datasets in some corner cases. The nature of radius search, designed to return all results within a defined threshold, poses a risk of generating a voluminous number of hits. This scenario could potentially impact cluster performance by demanding substantial computational and memory resources.

OpenSearch employs a safeguard to mitigate such risks: the index.max_result_window setting. This setting caps the maximum number of hits that can be returned in a single query response, with a default limit of 10,000 hits.

I dont know if this will work. I think we need to add more detail on how we're limiting memory consumption for a very low threshold

junqiu-lei · 2024-03-18T22:32:14Z

From my experiment on Faiss engine with 512 MB JVM heap, the default 10k size of response limit doesn't cause OOM issue.

Engine	JVM heap size	Data size	Dimension	L2 square	hit count	size	Memory used(MB)
Lucene	512MB	0.5M	128	700000	499986	1,000	31
Lucene	512MB	0.5M	128	700000	499986	10,000	108
Lucene	512MB	0.5M	128	700000	499986	50,000	115
Lucene	512MB	0.5M	128	700000	499986	100,000	154
Lucene	512MB	0.5M	128	700000	499986	200000	280
Lucene	512MB	0.5M	128	700000	499986	300000	OOM

In case of 300,000 size set, the OOM was mainly caused by the number of 300,000 size from request setting.

Test steps:

JVM size: export OPENSEARCH_JAVA_OPTS="-Xms512m -Xmx512m"
Index configuration

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/lucene-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 128,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_path": "/tmp/sift-128-euclidean.hdf5",
    "target_index_bulk_indexing_clients": 10,
    
    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 600,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,
    "target_index_num_vectors": 500000,
  }

Query with low threshold L2 square distance which can reach out all docs

GET /target_index/_search?scroll=1m HTTP/1.1
Host: localhost:9200
Content-Type: application/json

{    
  "size": <changed in tests>,
  "query": {
    "knn": {
      "target_field": {
        "vector": [6.0,62.0,24.0,0.0,0.0,1.0,2.0,0.0,7.0,64.0,123.0,21.0,0.0,0.0,0.0,1.0,3.0,2.0,93.0,94.0,0.0,3.0,8.0,4.0,22.0,1.0,6.0,18.0,0.0,1.0,3.0,18.0,123.0,60.0,5.0,0.0,0.0,0.0,0.0,2.0,123.0,84.0,86.0,23.0,0.0,1.0,0.0,6.0,8.0,8.0,85.0,123.0,71.0,19.0,0.0,2.0,2.0,0.0,6.0,96.0,71.0,10.0,1.0,5.0,123.0,2.0,0.0,0.0,0.0,0.0,0.0,42.0,123.0,1.0,0.0,0.0,0.0,23.0,54.0,102.0,7.0,0.0,0.0,10.0,70.0,123.0,57.0,10.0,0.0,0.0,0.0,6.0,79.0,96.0,1.0,0.0,21.0,36.0,83.0,4.0,0.0,0.0,0.0,26.0,7.0,7.0,66.0,63.0,3.0,26.0,54.0,47.0,0.0,0.0,2.0,33.0,10.0,105.0,50.0,1.0,0.0,0.0,0.0,2.0,5.0,25.0,1.0,0.0],
        "distance": 700000 // this distance can reach out all docs
      }
    }
  }
}

Response example:

{
    "_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFndaNE5BczRqUVRHTUxnLXMyTlItWVEAAAAAAAAAfRZYQ1Bad1kzNlNTS3lMWWl3Y2JpOVB3",
    "took": 1215,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 499986,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [...]
    }
}

@jmazanec15 @vamshin @navneet1v

junqiu-lei · 2024-04-30T22:00:43Z

Closing this issue as this feature is going to release at 2.14.

junqiu-lei added Features Introduces a new unit of functionality that satisfies a requirement k-NN labels Feb 14, 2024

junqiu-lei self-assigned this Feb 14, 2024

github-actions bot added the untriaged label Feb 14, 2024

junqiu-lei added RFC Request for comments and removed untriaged labels Feb 14, 2024

junqiu-lei mentioned this issue Feb 27, 2024

[DOC] Support radius search in k-NN opensearch-project/documentation-website#6528

Closed

4 tasks

navneet1v added this to Vector Search RoadMap Mar 19, 2024

navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Mar 19, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Mar 19, 2024

vamshin moved this from Backlog (Hot) to 2.14.0 in Vector Search RoadMap Apr 1, 2024

junqiu-lei closed this as completed Apr 30, 2024

github-project-automation bot moved this from 2.14.0 to ✅ Done in Vector Search RoadMap Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Support Radius Search in k-NN #1483

[RFC] Support Radius Search in k-NN #1483

junqiu-lei commented Feb 14, 2024 •

edited

Loading

jmazanec15 commented Feb 19, 2024

junqiu-lei commented Feb 20, 2024

jmazanec15 commented Feb 20, 2024

junqiu-lei commented Feb 20, 2024

jmazanec15 commented Feb 20, 2024

junqiu-lei commented Feb 20, 2024

jmazanec15 commented Feb 20, 2024 •

edited

Loading

jmazanec15 commented Feb 20, 2024

junqiu-lei commented Feb 21, 2024

jmazanec15 commented Mar 7, 2024

junqiu-lei commented Mar 18, 2024

junqiu-lei commented Apr 30, 2024

[RFC] Support Radius Search in k-NN #1483

[RFC] Support Radius Search in k-NN #1483

Comments

junqiu-lei commented Feb 14, 2024 • edited Loading

Overview

Libraries Background

FAISS

Lucene:

Scope

Requirements

Radius Parameter

Option 1: Unified Parameter Across Engines

Sub-Option 1.1: Using "Score" as a Unified Parameter

Sub-Option 1.2: Using "Distance" as a Unified Parameter (Proposing)

Option2: Introduce Differently Parameters Based on Engines

FAISS-Specific Parameters

Lucene-Specific Parameters

High Level Query Workflow

Native Engine (Faiss) Radius Query

Lucene Engine Radius Query

Low-Level Implementation

Validation radius parameter from XContent

Create new KNNQueryBuilder constructor with validation

FAISS Integration

Translate radius to distance

JNI Service

Lucene Integration

Others

Benchmarks

Limit the number of results returned

API Usage Stats

jmazanec15 commented Feb 19, 2024

junqiu-lei commented Feb 20, 2024

jmazanec15 commented Feb 20, 2024

junqiu-lei commented Feb 20, 2024

jmazanec15 commented Feb 20, 2024

junqiu-lei commented Feb 20, 2024

jmazanec15 commented Feb 20, 2024 • edited Loading

jmazanec15 commented Feb 20, 2024

junqiu-lei commented Feb 21, 2024

jmazanec15 commented Mar 7, 2024

junqiu-lei commented Mar 18, 2024

junqiu-lei commented Apr 30, 2024

junqiu-lei commented Feb 14, 2024 •

edited

Loading

jmazanec15 commented Feb 20, 2024 •

edited

Loading