[RFC] Two-phased Search Re-score Design #1861

jmazanec15 · 2024-07-22T16:54:00Z

Introduction

Document provides a design for re-score portion of disk-based two-phase ANN search project (See #1779).

Problem Statement

Quantization provides a significant reduction in memory consumption for vector search. This is vital to customers in order to control costs for their neural search systems. However, this comes at the cost of k-NN search accuracy, because the distances being computed are approximate. From a user perspective, this results in degraded search relevance.

In order to improve search accuracy while still reaping benefits of reduced memory consumption provided by quantization, a two-phased disk-based approach can be taken, with a predictable increase in latency (this will depend on a lot of system/workload specifics - but we have seen only 10’s of ms increases in some cases).

From preliminary experiments, we have seen this work very well (See #1779 (comment))

With that, we need to provide a way for users to execute this two-phased search approach in OpenSearch in an easy to use, yet efficient manner.

Requirements

Functional

Support ability for users to over-sample results from a quantized index
Support ability for users to specify that results from the quantized index should be re-scored with full precision
Re-score should work on quantized index types produced by native engines (i.e. faiss PQ or custom BQ). It should re-score if the index uses full-precision scoring for ANN.
Full precision data set should not need to be fully-resident in memory for strong performance
Works for neural search query type as well

Non-functional

Keep implementation as engine/algorithm agnostic as possible
Behavior for radial search should be well-defined
API interface needs to be easy to use. It should have a simple syntax that is flexible enough to support all functionality.

Out of scope/Future scope

Hierarchical-quantization based re-scoring. One optimization we could make would be to do multiple layers of quantization and then do re-scoring on fewer and fewer results with higher and higher precision. While it may be useful in the future, there is not enough evidence to justify this complexity.
Lucene quantized indices. Lucene quantized indices are structured a little bit differently than the native indices. One major challenge here will be pinning the quantized index in memory so that it does not get paged in and out frequently. We will support Lucene quantized indices, but it will be tracked separately from this (see this issue we opened). Leaving it out of this doc for now. Implementation may be able to re-use components in future.

High Level Design

Proposed

Architecture
We are going to augment our existing k-NN query to support re-scoring logic. A user would specify if re-scoring should be done and, if so, the factor at which to over-sample the quantized ANN index. Results from the ANN search will be merged from the segments and then re-scored and reduced to k overall.

Shard Two-Phased Search Flow

Alternative 1: Use script-score functionality already available in query dsl to implement re-scoring.

Currently, it is possible to do the following to achieve re-score functionality via script-scoring and/or re-scoring via k-NN scripts. We could just use these to support the functionality.

Pros

Easy — its already implemented
Flexible — more complex scoring functionality can be used because scripting allows for it

Cons

Scripting interface is verbose — Requires a lot of characters to execute the query
Scripting interface would require user to repeat vector (inefficient)
Script can be more expensive — scripts will need to compile

Overall, this option was not selected mainly because the user experience is complex.

Alternative 2: Implement re-scoring at the segment level by modifying KNNScorer

Re-scoring could be applied to a segment’s quantized ANN search results before getting merged with the shards top results. The major implication is that performance will vary depending on how many segments there are.

Pros

Easy — pretty minor changes

Cons

Segment count will have strong influence on performance
Not easy to reuse for Lucene-based quantization

Because segment count will have heavy impact on performance, we are not going with this approach.

Access Pattern Optimization Strategy

The re-score process reads vectors from secondary storage into memory in order to recompute the distances. Reads from secondary storage are typically slow and can be a bottleneck. Thus, for optimal performance, secondary storage IOPs should be minimized.

In general, there are two ways that IOPs can be minimized for re-scoring:

Get multiple vectors to be re-scored per IOP
Cache frequently used vectors

Initially, we will not explicitly implement any optimization strategy, for the following reasons:

Because the vector files are mmap’d, we will automatically get some benefits from the page cache. It may not be optimal, but it is something
From our experiments, we have seen acceptable latencies with good throughputs without any optimization
Implementing either strategy will require a large effort.

In the future, we will explore this. However, given successful performance from tests, it does not need to be done as p0.

API Design
Important: completely out of box, one parameter, API design not covered here (see mode param in RFC). Will be covered in future. This section focuses on lower level mechanisms for controlling re-scoring.

The API should:

allow users to signal that re-scoring should be done
allow the plugin to transparently configure the oversampling factor of the quantized ANN index when we develop out-of-box mode interface.
should maintain logical meaning of k parameter as the number of competitive results to retrieve per segment

With that, we propose the following API:

# Simple - just re-score top k
POST my-vector-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5],
        "k": 10,
        "rescore": true
      }
    }
  }
}

# Configure oversample
POST my-vector-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5,1.5, 5.5],
        "k": 10,
        "rescore": {
            "oversample_factor": 1.5
        }
      }
    }
  }
}

We are introducing 2 new optional parameters:

rescore (bool or dict; default=false) — boolean parameter indicating whether or not the scores should be re-scored with full-precision vectors. If it is a dict, oversample factor can be set
oversample_factor (float; default=1.0) — multiplier of k to over-sample the results from the query.

With this, users can still logically reason that k means the number of competitive results per segment. Additionally, the oversample_factor parameter is necessary because it will allow us to set a default for disk based out of box experience to abstract the underlying two-phase search algorithm from the user.

Radial Search Behavior

Re-score functionality will be partially available with radial search. There are a couple problems supporting re-scoring for the radial use-case:

Binary quantization will take a user’s vectors in a linear space and quantize them into a binary space. It is very difficult to translate the distance threshold in a linear space to a binary space that would preserve the meaning.
Accuracy issues at high compression levels could lead to inconsistent performance. Because the distances for the ANN search will be approximated, when the results are re-scored, the results may no longer fall within the threshold provided by the user (or results may be missed).

For (1), we will disable radial support for binary quantization. We may come back and try to find a way to solve this, but it will be a large effort that may not have great return. We could provide functionality where we set the radial distance to the maximum and then apply the threshold on re-scoring. However, this would mean that ANN search performance would always be the slowest when setting the radius. This would be a bad experience.

For (2), we will support radial search, and will filter on threshold during re-scoring. This could mean that if the distance approximations are significantly underestimated, we get 0 results. While this is not optimal, it will at least preserve the contract on the radial search problem — i.e. we will not be returning a result we know falls outside of the radius. We will give users a way to avoid this by using the oversample_factor parameter to expand the radius on the ANN index search.

Full-precision ANN Index Behavior

Re-score functionality for full-precision ANN indices will be a no-op. Because the results are returned with full-precision scores, it does not make sense to re-score them and then get the same score. That being said, if oversampling is applied, each ANN search will return oversample*k results and they will be reduced to k results. This is more to ensure that logical consistency is maintained within the query API.

Metrics

From a metrics perspective, in order to monitor performance, we will rely on existing metric providers/functionality.

Re-scoring is very IO intensive. It will require a significant number of IOPS in order to properly load the full precision vectors into memory for each query. Thus, to monitor the health of the system, users will need to monitor existing disk/storage based metrics. For example, if the backing storage device is EBS, it would be important to monitor the throughput and IOPS published by EBS. Additionally, node stats can be used to monitor fs and disk metrics via OpenSearch API. We will not add any explicit new resource-monitoring functionality. This will be pretty system specific and need to be handled in the monitoring done for the deployment environment.

In addition, for breakdowns around latency per shard, users can use the profile API to see what time is being spent on the query to troubleshoot slow queries. See the profile API for more details. Specifically, the ANN search will be happening during the rewrite stage of the query so large profile.shards.searches.rewrite_time times will be indicative of poor behavior.

[Feedback requested] Lastly, we could implement a debug flag for the k-NN query that would indicate that fine-grained query debugging/profiling should take place. This could help an operator get fine grained performance specifics on the k-NN query without impacting the whole production system. It would also not add as much overhead as the profile API. The behavior would be that with this flag enabled, we log out debugging statements per shard on the time taken for each operation. This would help zero in on what the performance issue is being caused.

Low Level Design

For the plugin, the k-NN search happens when the per-segment scorer is created (see KNNWeight.scorer). In order to implement the re-score functionality, we are going to modify this process so that the ANN search happens at the rewrite stage (similar to Lucene k-NN) and then we will reduce per segment results and then re-score. The main benefits of this approach are:

Shard level re-scoring as opposed to segment will make performance independent of shard count
We can migrate to Lucene k-NN queries more easily (they do the ANN at the re-write time).

See #1845 for more details. With this, the scorer that is built will only have k results and full precision distances.

IndexSearch.searcher for k-NN with re-score (what search looks like at shard level)

Note — The result reduction phase is oversimplified for the sake of brevity. In this phase, lucene will iterate through the scorers and collect the scores by calling scorer.score() and then reduce. See IndexSearcher code for details.

While we could have chosen to return a scorer that would lazily compute the scores when score was called during the reduction step, we chose to recompute scores before scorer is created so that

Disk-reads would happen in close proximity with each other. This could potentially improve performance
We can reduce the matches per segment from oversample_factor*k → k. Otherwise, the scorer would need to say all of them matched and more than k results would be iterated over at the scoring level.

With this in mind, we will need to ensure that we only re-score live docs so that we do not unnecessarily re-score documents that are deleted.

Future Work

Investigate access pattern optimization strategies to improve performance.
Support Lucene quantized indices with re-scoring

The text was updated successfully, but these errors were encountered:

jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Jul 22, 2024

github-actions bot added the untriaged label Jul 22, 2024

jmazanec15 removed the untriaged label Jul 22, 2024

navneet1v added search-improvements v2.17.0 labels Jul 22, 2024

navneet1v added this to Vector Search RoadMap Jul 22, 2024

navneet1v moved this to 2.17.0 in Vector Search RoadMap Jul 22, 2024

vamshin added the Roadmap:Vector Database/GenAI Project-wide roadmap label label Aug 26, 2024

opensearch-infra bot added this to OpenSearch Roadmap Aug 26, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Aug 26, 2024

navneet1v assigned jmazanec15 Sep 14, 2024

naveentatikonda closed this as completed Sep 18, 2024

github-project-automation bot moved this from 2.17.0 to ✅ Done in Vector Search RoadMap Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Two-phased Search Re-score Design #1861

[RFC] Two-phased Search Re-score Design #1861

jmazanec15 commented Jul 22, 2024

[RFC] Two-phased Search Re-score Design #1861

[RFC] Two-phased Search Re-score Design #1861

Comments

jmazanec15 commented Jul 22, 2024

Introduction

Problem Statement

Requirements

Functional

Non-functional

Out of scope/Future scope

High Level Design

Proposed

Metrics

Low Level Design

Future Work