[RFC] Two-phased Search Re-score Design #1861
Labels
Features
Introduces a new unit of functionality that satisfies a requirement
Roadmap:Vector Database/GenAI
Project-wide roadmap label
search-improvements
v2.17.0
Introduction
Document provides a design for re-score portion of disk-based two-phase ANN search project (See #1779).
Problem Statement
Quantization provides a significant reduction in memory consumption for vector search. This is vital to customers in order to control costs for their neural search systems. However, this comes at the cost of k-NN search accuracy, because the distances being computed are approximate. From a user perspective, this results in degraded search relevance.
In order to improve search accuracy while still reaping benefits of reduced memory consumption provided by quantization, a two-phased disk-based approach can be taken, with a predictable increase in latency (this will depend on a lot of system/workload specifics - but we have seen only 10’s of ms increases in some cases).
From preliminary experiments, we have seen this work very well (See #1779 (comment))
With that, we need to provide a way for users to execute this two-phased search approach in OpenSearch in an easy to use, yet efficient manner.
Requirements
Functional
Non-functional
Out of scope/Future scope
High Level Design
Proposed
Architecture
We are going to augment our existing k-NN query to support re-scoring logic. A user would specify if re-scoring should be done and, if so, the factor at which to over-sample the quantized ANN index. Results from the ANN search will be merged from the segments and then re-scored and reduced to k overall.
Shard Two-Phased Search Flow
Alternative 1: Use script-score functionality already available in query dsl to implement re-scoring.
Currently, it is possible to do the following to achieve re-score functionality via script-scoring and/or re-scoring via k-NN scripts. We could just use these to support the functionality.
Pros
Cons
Overall, this option was not selected mainly because the user experience is complex.
Alternative 2: Implement re-scoring at the segment level by modifying KNNScorer
Re-scoring could be applied to a segment’s quantized ANN search results before getting merged with the shards top results. The major implication is that performance will vary depending on how many segments there are.
Pros
Cons
Because segment count will have heavy impact on performance, we are not going with this approach.
Access Pattern Optimization Strategy
The re-score process reads vectors from secondary storage into memory in order to recompute the distances. Reads from secondary storage are typically slow and can be a bottleneck. Thus, for optimal performance, secondary storage IOPs should be minimized.
In general, there are two ways that IOPs can be minimized for re-scoring:
Initially, we will not explicitly implement any optimization strategy, for the following reasons:
In the future, we will explore this. However, given successful performance from tests, it does not need to be done as p0.
API Design
Important: completely out of box, one parameter, API design not covered here (see mode param in RFC). Will be covered in future. This section focuses on lower level mechanisms for controlling re-scoring.
The API should:
With that, we propose the following API:
We are introducing 2 new optional parameters:
With this, users can still logically reason that k means the number of competitive results per segment. Additionally, the oversample_factor parameter is necessary because it will allow us to set a default for disk based out of box experience to abstract the underlying two-phase search algorithm from the user.
Radial Search Behavior
Re-score functionality will be partially available with radial search. There are a couple problems supporting re-scoring for the radial use-case:
For (1), we will disable radial support for binary quantization. We may come back and try to find a way to solve this, but it will be a large effort that may not have great return. We could provide functionality where we set the radial distance to the maximum and then apply the threshold on re-scoring. However, this would mean that ANN search performance would always be the slowest when setting the radius. This would be a bad experience.
For (2), we will support radial search, and will filter on threshold during re-scoring. This could mean that if the distance approximations are significantly underestimated, we get 0 results. While this is not optimal, it will at least preserve the contract on the radial search problem — i.e. we will not be returning a result we know falls outside of the radius. We will give users a way to avoid this by using the oversample_factor parameter to expand the radius on the ANN index search.
Full-precision ANN Index Behavior
Re-score functionality for full-precision ANN indices will be a no-op. Because the results are returned with full-precision scores, it does not make sense to re-score them and then get the same score. That being said, if oversampling is applied, each ANN search will return oversample*k results and they will be reduced to k results. This is more to ensure that logical consistency is maintained within the query API.
Metrics
From a metrics perspective, in order to monitor performance, we will rely on existing metric providers/functionality.
Re-scoring is very IO intensive. It will require a significant number of IOPS in order to properly load the full precision vectors into memory for each query. Thus, to monitor the health of the system, users will need to monitor existing disk/storage based metrics. For example, if the backing storage device is EBS, it would be important to monitor the throughput and IOPS published by EBS. Additionally, node stats can be used to monitor fs and disk metrics via OpenSearch API. We will not add any explicit new resource-monitoring functionality. This will be pretty system specific and need to be handled in the monitoring done for the deployment environment.
In addition, for breakdowns around latency per shard, users can use the profile API to see what time is being spent on the query to troubleshoot slow queries. See the profile API for more details. Specifically, the ANN search will be happening during the rewrite stage of the query so large profile.shards.searches.rewrite_time times will be indicative of poor behavior.
[Feedback requested] Lastly, we could implement a debug flag for the k-NN query that would indicate that fine-grained query debugging/profiling should take place. This could help an operator get fine grained performance specifics on the k-NN query without impacting the whole production system. It would also not add as much overhead as the profile API. The behavior would be that with this flag enabled, we log out debugging statements per shard on the time taken for each operation. This would help zero in on what the performance issue is being caused.
Low Level Design
For the plugin, the k-NN search happens when the per-segment scorer is created (see KNNWeight.scorer). In order to implement the re-score functionality, we are going to modify this process so that the ANN search happens at the rewrite stage (similar to Lucene k-NN) and then we will reduce per segment results and then re-score. The main benefits of this approach are:
See #1845 for more details. With this, the scorer that is built will only have k results and full precision distances.
IndexSearch.searcher for k-NN with re-score (what search looks like at shard level)
Note — The result reduction phase is oversimplified for the sake of brevity. In this phase, lucene will iterate through the scorers and collect the scores by calling scorer.score() and then reduce. See IndexSearcher code for details.
While we could have chosen to return a scorer that would lazily compute the scores when score was called during the reduction step, we chose to recompute scores before scorer is created so that
With this in mind, we will need to ensure that we only re-score live docs so that we do not unnecessarily re-score documents that are deleted.
Future Work
The text was updated successfully, but these errors were encountered: