-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Creating Vector data structures Greedily #1942
Comments
So for 1.i) the process will be
The idea is with #2007, there will be speed up in 2, and overall there will be reduction in build time. Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present. |
In this comment we will be looking at some of the options on how to expose a setting “approximate_threshold” which controls when to build vector data structure during flush and merge operation. If no value is provided during index creation, we will use default value to decide whether to build vector data structure or not during flush and merge operations. The possible values for this parameters are:
Proposal1. Keep “approximate_threshold” as dynamic index settingIn this approach, users can set this value per index. They can use “Update Index Setting” API to update this value on live index at any time. Example:
Pros:
Cons
2. Keep “approximate_threshold” as mapping parameterIn this approach, users can set this value as mapping parameter for knn vector field type. They can use “Update Mapping Setting” API to update this value. Example:
Pros
Cons
But in order to update a mapping parameter, users has to pass other non-updatable parameters as part of update mapping api request. In above example, to update “ 3. Combine 1 & 2In this approach we will provide this as both index setting and mapping parameter, where index setting will be considered as global setting for every field, and, users can override this value for any specific field by updating as mapping parameter. Pros
Cons
Finalized approachSince there is no work around to use as mapping parameter and provide better user experience at same time, we can first introduce this as “index setting”, (Proposal 1 ) and, if we receive substantial feedback on having it as mapping parameter from community, we can introduce mapping parameter in next iteration**(effectively Proposal 3).** |
Default value for Approximate threshold index settingObjectiveIn this document we will look into experiments that we conducted to compare the performance of k-NN index with and without vector data structures. The metrics from this experiments will help us to find crossing point where vector search performance is not affected if k-NN plugin uses exact search instead of Approximate search. The motivation for identifying right default value is mainly to reduce indexing time, and/or reduce cpu utilization during indexing, where we found out that graph building is directly proportional to high cpu utilization and total indexing time. Experiment Configurationif a setting is not specified, then consider default is used from latest 2.x Cluster Configuration
Index Configuration
Workload Configuration
We ran two types of experiment to compare performance. First, we don’t create necessary vector data structure during indexing, and, will perform exact search during knn search api, this is simulated by setting “ Second, we create required data structure during and use those data structure to perform approximate search during knn search api ( default behavior ). This is simulated by setting “ Experiment 1 (index.knn.approximate_threshold = -1 ):IndexingP50 Service time comparisonP90 Service time comparisonMax CPU UtilizationMax throughput (docs/secs)ObservationAs expected, if we don’t build graph as part of indexing process, the indexing latency improves compare to default set up where we build index while flushing segments. P90 Service Time
CPU Utilization
SearchFor search we conducted two sets of experiment with 1 client and 10 parallel clients to capture trend when more clients are used. Also, before running search, indices are already available in memory using warm up api. P50 Service Time (client = 1 )P90 Service time (client = 1 )Max Cpu Utilization (client = 1 )Max throughput (client = 1 )P90 Service time (client = 10)Max CPU UtilizationObservationAs expected Approximate search performs better once number of vectors increases. In the above experiment, once the number of vectors grows above 50K, the exact search performs poor compared to approx search. In below metrics, positive represents improved behavior, negative means degradation from default behavior P90 Service Time ( client = 1 )
CPU Utilization ( client = 1 )
ConclusionBased on Indexing and Search comparison, decided to go with 15K as default threshold |
Closing this GH issue as the feature will be launched with 2.18 version of Opensearch |
Description
As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.
Hence the idea is we should build these data structures greedily.
and merges of segments. We should only create vector data structures once the whole indexingand merges(including force merge if possible)is completed. This will ensure that we are creating vector data structures once. Refer Appendix A on how we can do this of GH issue: [META] [Build-Time] Improving Build time for Vector Indices #1599.Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.
References:
The text was updated successfully, but these errors were encountered: