Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support Radius Search in k-NN #1483

Closed
junqiu-lei opened this issue Feb 14, 2024 · 12 comments
Closed

[RFC] Support Radius Search in k-NN #1483

junqiu-lei opened this issue Feb 14, 2024 · 12 comments
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement k-NN RFC Request for comments

Comments

@junqiu-lei
Copy link
Member

junqiu-lei commented Feb 14, 2024

Overview

Following from #814, this document details the proposed enhancement of the OpenSearch k-NN plugin with a radius search feature, leveraging advancements in the Lucene and FAISS libraries. This enhancement aims to broaden the plugin's capabilities beyond traditional k-nearest neighbors (k-NN) searches by introducing the ability to perform radius searches. Radius searches will allow users to identify all points within a vector space that fall within a specified distance(score) threshold from a query point, offering more flexibility and utility in search operations.

Libraries Background

FAISS

FAISS renowned for its efficient similarity search and clustering of dense vectors, FAISS excels in performing radius searches. The range_search API facilitates these searches by accepting a distance parameter, enabling efficient identification of points within a specified radius using both Inverted File System (IVF) and Hierarchical Navigable Small World (HNSW) methods.

  • Feature Availability in k-NN
    • IVF: Supported
    • HNSW: Supported, following the recent integration of the k-NN Faiss update.

Lucene:

Lucene recent updates have introduced the capability for similarity-based vector searches. Lucene's approach involves finding all vectors that score above a certain similarity threshold by navigating the HNSW graph. This process continues until no better-scoring nodes are found or the best candidate's score falls below a specified traversal similarity at the lowest graph level.

  • Feature Availability in k-NN
    • HNSW: It will be available with the release of Lucene 9.10, necessitating an update in OpenSearch to incorporate this new feature.

Scope

  1. Design and implement a API in OpenSearch k-NN that supports radius search using both FAISS and Lucene.
  2. Ensure the feature is intuitive, efficient, and provides consistent results across both libraries.

Requirements

  1. Parameter Handling: The API should correctly interpret and handle parameters related to radius search, such as radius distance and similarity thresholds.
  2. Performance: The implementation should not significantly degrade query performance.
  3. Indexing and Output Consistency: The introduction of the radius search feature should preserve the existing indexing process and query output format.

Radius Parameter

The radius search in k-NN will require new parameter to define the maximum range of query. Here are options:

Option 1: Unified Parameter Across Engines

Pros

  • Simplified User Interface: Introduces a single parameter across both engines (Faiss and Lucene), streamlining the API and enhancing user experience by reducing the learning curve.
  • Consistency Across Engines: Offers a consistent user experience, regardless of the underlying engine, promoting ease of use and implementation.

Cons

  • Mapping Complexity: Ensures accurate mapping from the unified parameter to engine-specific parameters, which may introduce complexity at the k-NN plugin level.
  • Potential Feature Limitation: Risks limiting access to engine-specific features or optimizations that might be available through native parameters, potentially underutilizing the engines' capabilities.

Sub-Option 1.1: Using "Score" as a Unified Parameter

Pros

  • Intuitiveness: In many search contexts, especially those familiar with search engines, scores are a more intuitive measure of relevance. Users may find it easier to understand a higher score as better.
  • Consistency with Lucene: Lucene inherently uses scoring for ranking documents. Using score as the unified parameter maintains consistency across Lucene-based and vector search functionalities, simplifying the understanding and application of search criteria.
  • Flexibility in Definition: Scores can be flexibly defined to incorporate a variety of factors, including similarity, document popularity, or custom metrics, providing a versatile tool for fine-tuning search results.

Cons

  • Ambiguity in Meaning: Without a clear, standardized definition, scores can be ambiguous. Different systems or algorithms might calculate scores in vastly different ways, potentially confusing users or complicating the interpretation of results.
  • Normalization and Comparison Issues: Scores often require normalization to be comparable across different datasets or query contexts. This can introduce complexity in ensuring fair and consistent ranking, especially when integrating results from multiple sources or search modes.

Example Query and Result:

GET /target-index-faiss/_search
{
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [4, 8],
        "score": 0.6
      }
    }
  }
}
{
    "took": 10,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.7936508,
        "hits": [
            {
                "_index": "target-index-faiss",
                "_id": "12",
                "_score": 0.7936508,
                "_source": {
                    "my_vector1": [
                        4.5,
                        7.9
                    ],
                    "price": 7.5
                }
            },
            {
                "_index": "target-index-faiss",
                "_id": "8",
                "_score": 0.62111807,
                "_source": {
                    "my_vector1": [
                        5.6,
                        8.5
                    ],
                    "price": 18.7
                }
            }
        ]
    }
}

Sub-Option 1.2: Using "Distance" as a Unified Parameter (Proposing)

Pros

  • Clarity and Precision Distances offer a clear and precise measure of similarity, especially in vector search spaces. The concept of distance (e.g., Euclidean, cosine) is mathematically defined and universally understood in the context of similarity searches.
  • Direct Relevance to Vector Search: For vector searches, such as those performed by Faiss, distance directly corresponds to the query's intent—finding the nearest neighbors. It provides a straightforward, objective criterion for evaluating query results.
  • Easier to Standardize: Since distance metrics are well-defined mathematically, they can be more easily standardized across different systems, facilitating comparison and integration of search results from diverse sources.

Cons

  • Less Intuitive for Non-Technical Users: The concept of distance, while precise, may be less intuitive for users unfamiliar with vector space models or the specific metrics used (e.g., understanding the difference between Euclidean and cosine distances).
  • Potential Misalignment with User Expectations: In some contexts, the smallest distance does not always equate to the most "relevant" result, especially if other factors (like document freshness or popularity) are important. Users expecting a simple "higher is better" metric may find distances counterintuitive.
  • Complexity in Mixing Search Types: When integrating vector searches with traditional text searches, translating distances into a comparable metric that aligns with text search scores can introduce complexity, potentially requiring additional transformations or heuristics.

Example Query:

GET /target-index/_search
{
  "query": {
    "knn": {
      "target-field": {
        "vector": [1,2,3,4],
        "distance":5.0
      }
    }
  }
}

Option2: Introduce Differently Parameters Based on Engines

To accommodate the specific characteristics and capabilities of each engine (FAISS and Lucene), we can introduce new parameters that are tailored to each engine. This approach provides users with the flexibility to leverage the unique features and optimizations of the underlying engine. Below are enhancements and the completion of the different parameter solutions:

FAISS-Specific Parameters

radius: This parameter is used with FAISS to specify the radius within which points are considered similar to the query vector. It directly translates to a distance in the vector space.

Lucene-Specific Parameters

similarity_threshold: This parameter is used with Lucene to specify a score threshold. Documents with a similarity score above this threshold are considered similar to the query vector. The score is typically derived from a similarity function or a distance metric.

Note: The following contents are based on option 1.1, using "Score" as a unified parameter.

High Level Query Workflow

Native Engine (Faiss) Radius Query

radius-search-11 drawio

Lucene Engine Radius Query

radius-search-10 drawio

Low-Level Implementation

Validation radius parameter from XContent

Radius parameter will be passed in from XContentParser object, the radius value should be received like:

radius = (float) NumberFieldMapper.NumberType.FLOAT.parse(parser.objectBytes(), false);

Create new KNNQueryBuilder constructor with validation

public KNNQueryBuilder(String fieldName, float[] vector, float radius, QueryBuilder filter) {
    if (Strings.isNullOrEmpty(fieldName)) {
        throw new IllegalArgumentException("[" + NAME + "] requires fieldName");
    }
    if (vector == null) {
        throw new IllegalArgumentException("[" + NAME + "] requires query vector");
    }
    if (vector.length == 0) {
        throw new IllegalArgumentException("[" + NAME + "] query vector is empty");
    }
    if (radius > RADIUS_MAX || radius < RADIUS_MIN) {
        throw new IllegalArgumentException("[" + NAME + "] requires radius to be between " + RADIUS_MIN + " and " + RADIUS_MAX);
    }

    this.fieldName = fieldName;
    this.vector = vector;
    this.filter = filter;
    this.ignoreUnmapped = false;
    this.radius = radius;
}

FAISS Integration

Translate radius to distance

We can use the score → rawScore based on existing rawScore → score transilation for different space types at index/SpaceType.java

public enum SpaceType {
    L2("l2") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    COSINESIMIL("cosinesimil") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    L1("l1") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    LINF("linf") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    },
    INNER_PRODUCT("innerproduct") {
        @Override
        public float rawScoreTranslation(float score) {
            return -score + 1;
        }
    },
    HAMMING_BIT("hammingbit") {
        @Override
        public float rawScoreTranslation(float score) {
            return 1 / score - 1;
        }
    };

JNI Service

Add new method in JNI Service

public static KNNQueryResult[] radiusQueryIndex(long indexPointer, float[] queryVector, float radius, String engineName) {
    if (KNNEngine.FAISS.getName().equals(engineName)) {
        return FaissService.radiusQueryIndex(indexPointer, queryVector, radius);
    }
    throw new IllegalArgumentException("RadiusQueryIndex not supported for provided engine");
}

Add new method in JNI FaissService wrapper

jobjectArray knn_jni::faiss_wrapper::RangeSearch(knn_jni::JNIUtilInterface *jniUtil, JNIEnv *env, jlong indexPointerJ,
                                                 jfloatArray queryVectorJ, jfloat radiusJ) {
    if (queryVectorJ == nullptr) {
        throw std::runtime_error("Query Vector cannot be null");
    }

    auto *index = reinterpret_cast<faiss::IndexIDMap *>(indexPointerJ);

    if (index == nullptr) {
        throw std::runtime_error("Invalid pointer to index");
    }

    float *rawQueryvector = jniUtil->GetFloatArrayElements(env, queryVectorJ, nullptr);

    faiss::RangeSearchResult res(1);
    index->range_search(1, rawQueryvector, radiusJ, &res);

    int resultSize = res.lims[1];

    jclass resultClass = jniUtil->FindClass(env,"org/opensearch/knn/index/query/KNNQueryResult");
    jmethodID allArgs = jniUtil->FindMethod(env, "org/opensearch/knn/index/query/KNNQueryResult", "<init>");

    jobjectArray results = jniUtil->NewObjectArray(env, resultSize, resultClass, nullptr);

    jobject result;
    for(int i = 0; i < resultSize; ++i) {
        result = jniUtil->NewObject(env, resultClass, allArgs, res.labels[i], res.distances[i]);
        jniUtil->SetObjectArrayElement(env, results, i, result);
    }

    return results;
}

Lucene Integration

Lucene integration will be available with the release of Lucene 9.10, necessitating an update in OpenSearch and k-NN to incorporate this new feature.

Add new VectorSimilarityQuery in KNNQUeryFactory

public static Query create(CreateQueryRequest createQueryRequest) {
    ......
    if (VectorDataType.BYTE == vectorDataType) {
        return new ByteVectorSimilarityQuery(fieldName, byteVector, radius);
    } else {
        return new FloatVectorSimilarityQuery(fieldName, floatVector, radius);
    }
}

Others

Benchmarks

Benchmark will published once after core implementation completed.

Limit the number of results returned

Incorporating radius search capabilities into OpenSearch brings with it the challenge of managing large result sets, especially when querying extensive datasets in some corner cases. The nature of radius search, designed to return all results within a defined threshold, poses a risk of generating a voluminous number of hits. This scenario could potentially impact cluster performance by demanding substantial computational and memory resources.

OpenSearch employs a safeguard to mitigate such risks: the index.max_result_window setting. This setting caps the maximum number of hits that can be returned in a single query response, with a default limit of 10,000 hits.

API Usage Stats

Along with the existing k-NN API usage stats, we can add a new metric to track the usage of radius search.

  • radius_query_requests: The number of requests made to the k-NN plugin for radius searches.
  • radius_query_errors: The number of errors encountered while processing radius search requests.
@junqiu-lei junqiu-lei added Features Introduces a new unit of functionality that satisfies a requirement k-NN labels Feb 14, 2024
@junqiu-lei junqiu-lei self-assigned this Feb 14, 2024
@junqiu-lei junqiu-lei added RFC Request for comments and removed untriaged labels Feb 14, 2024
@jmazanec15
Copy link
Member

Hey @junqiu-lei this looks cool! A couple questions: For faiss, when distance is passed in for l2, what if the implementation uses the l2^2 ordering? Also, I imagine we will probably want to support both distance and score.

Also, in Opensearch, there exists a "min_score" field that seems to do just this: https://opensearch.org/docs/latest/api-reference/search/#url-parameters. Can we take that as a parameter and build around it?

@junqiu-lei
Copy link
Member Author

Hi @jmazanec15, much appreciated the feedback!

For faiss, when distance is passed in for l2, what if the implementation uses the l2^2 ordering?

In FAISS, when L2 distance is used, the internal implementation relies on squared L2 distance (L2^2) for efficiency, as it avoids the square root operation. This doesn't affect the relative ordering of search results, since the nearest neighbors remain the same whether using L2 or L2^2.

Also, I imagine we will probably want to support both distance and score.

Certainly, incorporating both distance and score is feasible, thanks to the ability to convert between these metrics. However, it would be helpful to understand the extent to which distance thresholds are preferred, considering scores often provide a more intuitive indication of relevance.

Also, in Opensearch, there exists a "min_score" field that seems to do just this: https://opensearch.org/docs/latest/api-reference/search/#url-parameters. Can we take that as a parameter and build around it?

Good point to call out "min_score", It indeed serves as an effective filter with top-k queries. However integrating the radius query parameter in a similar manner to the k parameter is advisable, as "radius" and "k" represent distinct search methodologies. This alignment ensures clarity and consistency in parameter handling.

@jmazanec15
Copy link
Member

In FAISS, when L2 distance is used, the internal implementation relies on squared L2 distance (L2^2) for efficiency, as it avoids the square root operation. This doesn't affect the relative ordering of search results, since the nearest neighbors remain the same whether using L2 or L2^2.

But for distance parameter, would this mean user has to pass in l2^2 or l2?

Certainly, incorporating both distance and score is feasible, thanks to the ability to convert between these metrics. However, it would be helpful to understand the extent to which distance thresholds are preferred, considering scores often provide a more intuitive indication of relevance.

I guess as a user who is familiar with the embedding space being generated, it may be more natural for me to reason about the distance than have to convert the distance to a score. Adding both does not need to be p0, but we should easily be able to extend it.

Good point to call out "min_score", It indeed serves as an effective filter with top-k queries. However integrating the radius query parameter in a similar manner to the k parameter is advisable, as "radius" and "k" represent distinct search methodologies. This alignment ensures clarity and consistency in parameter handling.

Good point. That makes sense to me.

@junqiu-lei
Copy link
Member Author

But for distance parameter, would this mean user has to pass in l2^2 or l2?

In Faiss, user need provide the distance parameter in l2^2.

I guess as a user who is familiar with the embedding space being generated, it may be more natural for me to reason about the distance than have to convert the distance to a score. Adding both does not need to be p0, but we should easily be able to extend it.

Yes, agree with it.

@jmazanec15
Copy link
Member

So in the interface, if we were to introduce both, what do you think it would look like?

@junqiu-lei
Copy link
Member Author

So in the interface, if we were to introduce both, what do you think it would look like?

In the interface, if we support both distance and score types, the name need to be clear understand, maybe introduce like "radius_score" and "radius_distance" or "min_score" and "min_distance".

@jmazanec15
Copy link
Member

jmazanec15 commented Feb 20, 2024

Yeah that makes sense. One minor thing: I would go with "min_score" and "radial_distance"

@jmazanec15
Copy link
Member

Id also be interested in more details around benchmarking plan

@junqiu-lei
Copy link
Member Author

Id also be interested in more details around benchmarking plan

Certainly, I'll provide more details on the benchmarking plan later.

@jmazanec15
Copy link
Member

Incorporating radius search capabilities into OpenSearch brings with it the challenge of managing large result sets, especially when querying extensive datasets in some corner cases. The nature of radius search, designed to return all results within a defined threshold, poses a risk of generating a voluminous number of hits. This scenario could potentially impact cluster performance by demanding substantial computational and memory resources.

OpenSearch employs a safeguard to mitigate such risks: the index.max_result_window setting. This setting caps the maximum number of hits that can be returned in a single query response, with a default limit of 10,000 hits.

I dont know if this will work. I think we need to add more detail on how we're limiting memory consumption for a very low threshold

@junqiu-lei
Copy link
Member Author

From my experiment on Faiss engine with 512 MB JVM heap, the default 10k size of response limit doesn't cause OOM issue.

Engine JVM heap size Data size Dimension L2 square hit count size Memory used(MB)
Lucene 512MB 0.5M 128 700000 499986 1,000 31
Lucene 512MB 0.5M 128 700000 499986 10,000 108
Lucene 512MB 0.5M 128 700000 499986 50,000 115
Lucene 512MB 0.5M 128 700000 499986 100,000 154
Lucene 512MB 0.5M 128 700000 499986 200000 280
Lucene 512MB 0.5M 128 700000 499986 300000 OOM

In case of 300,000 size set, the OOM was mainly caused by the number of 300,000 size from request setting.

Screenshot 2024-03-18 at 3 28 20 PM

image

Test steps:

  • JVM size: export OPENSEARCH_JAVA_OPTS="-Xms512m -Xmx512m"

  • Index configuration

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/lucene-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 128,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_path": "/tmp/sift-128-euclidean.hdf5",
    "target_index_bulk_indexing_clients": 10,
    
    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 600,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,
    "target_index_num_vectors": 500000,
  }
  • Query with low threshold L2 square distance which can reach out all docs
GET /target_index/_search?scroll=1m HTTP/1.1
Host: localhost:9200
Content-Type: application/json

{    
  "size": <changed in tests>,
  "query": {
    "knn": {
      "target_field": {
        "vector": [6.0,62.0,24.0,0.0,0.0,1.0,2.0,0.0,7.0,64.0,123.0,21.0,0.0,0.0,0.0,1.0,3.0,2.0,93.0,94.0,0.0,3.0,8.0,4.0,22.0,1.0,6.0,18.0,0.0,1.0,3.0,18.0,123.0,60.0,5.0,0.0,0.0,0.0,0.0,2.0,123.0,84.0,86.0,23.0,0.0,1.0,0.0,6.0,8.0,8.0,85.0,123.0,71.0,19.0,0.0,2.0,2.0,0.0,6.0,96.0,71.0,10.0,1.0,5.0,123.0,2.0,0.0,0.0,0.0,0.0,0.0,42.0,123.0,1.0,0.0,0.0,0.0,23.0,54.0,102.0,7.0,0.0,0.0,10.0,70.0,123.0,57.0,10.0,0.0,0.0,0.0,6.0,79.0,96.0,1.0,0.0,21.0,36.0,83.0,4.0,0.0,0.0,0.0,26.0,7.0,7.0,66.0,63.0,3.0,26.0,54.0,47.0,0.0,0.0,2.0,33.0,10.0,105.0,50.0,1.0,0.0,0.0,0.0,2.0,5.0,25.0,1.0,0.0],
        "distance": 700000 // this distance can reach out all docs
      }
    }
  }
}
  • Response example:
{
    "_scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFndaNE5BczRqUVRHTUxnLXMyTlItWVEAAAAAAAAAfRZYQ1Bad1kzNlNTS3lMWWl3Y2JpOVB3",
    "took": 1215,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 499986,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [...]
    }
}

@jmazanec15 @vamshin @navneet1v

@navneet1v navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Mar 19, 2024
@vamshin vamshin moved this from Backlog (Hot) to 2.14.0 in Vector Search RoadMap Apr 1, 2024
@junqiu-lei
Copy link
Member Author

Closing this issue as this feature is going to release at 2.14.

@github-project-automation github-project-automation bot moved this from 2.14.0 to ✅ Done in Vector Search RoadMap Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement k-NN RFC Request for comments
Projects
Status: Done
Development

No branches or pull requests

2 participants