[FEATURE] Creating Vector data structures Greedily #1942

navneet1v · 2024-08-08T18:31:24Z

Description

As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.

Hence the idea is we should build these data structures greedily.

Delaying Vector Data structures Creation:
1. For bulk ingestion the proposal is we should completely stop creating vector data structures during segment creation ~~and merges of segments~~. We should only create vector data structures once the whole indexing ~~and merges(including force merge if possible)~~ is completed. This will ensure that we are creating vector data structures once. Refer Appendix A on how we can do this of GH issue: [META] [Build-Time] Improving Build time for Vector Indices #1599.
2. If a user is constantly indexing and searching on data then not building graphs will lead to high latency + user might not have time to do force merge to trigger the graph creation, hence, we should build the capability where based on a certain threshold we can Opensearch can decide for which segment graph needs to be created. This will improve the ingestion speed. Ref this community request: [FEATURE] Separate refresh interval for KNN segment merge #1577

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

Incremental graph creation: For native engines(Nmslib and Faiss) the vector data structures are build during segment creation(aka OS refresh), which leads to spikes in CPU and sometimes throttling due to high CPU(we saw this in various benchmarks). On the other time when segments are not getting created the CPU util stays very low. This leads uneven usage of CPU(basically a SAW tooth curve of CPU utilization is created) and users are not able to push more documents at Opensearch at steady state. Specifically for Streaming Ingestion use cases we can create graphs incrementally during ingestion(Lucene engine already implements this) this will ensure that we spread out the CPU utilization peak over a period of time and when OS refresh happens we already have graph created. Which will ensure that it is available for search for users hence reduce the Search after Indexing time.

References:

Meta Issue: [META] [Build-Time] Improving Build time for Vector Indices #1599
Community request: [FEATURE] Separate refresh interval for KNN segment merge #1577

navneet1v · 2024-08-08T18:35:55Z

With the changes that are done as part of #1938, and #1853 we will have the ground work to do the Incremental graph creation. Once these issues are resolved we can start working on this feature.

navneet1v · 2024-08-28T01:33:49Z

So for 1.i) the process will be

user disable the graph creation for the index and index all the vectors
user enables the graph creation for the index.
User now run force merge to 1 segment to ensure that all KNN DS are created.
Perform search.

The idea is with #2007, there will be speed up in 2, and overall there will be reduction in build time.

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

VijayanB · 2024-10-10T06:51:27Z

In this comment we will be looking at some of the options on how to expose a setting “approximate_threshold” which controls when to build vector data structure during flush and merge operation. If no value is provided during index creation, we will use default value to decide whether to build vector data structure or not during flush and merge operations.

The possible values for this parameters are:

-1 => When users want to disable building vector data structure during indexing
0 => When users want to build vector data structure always ( this is the behavior till 2.17)
1.. INTEGER.MAX_VALUE - 2 => Will build vector data structure only when a segment contains more than this number of documents.

Proposal

1. Keep “approximate_threshold” as dynamic index setting

In this approach, users can set this value per index. They can use “Update Index Setting” API to update this value on live index at any time.

Example:

PUT my-test-knn-index/_settings
{
      "index.knn.advanced.approximate_threshold":-1
}

Pros:

It is easy to update setting by just passing only the property and the desired value.
This setting will be shared by all vector fields from same index. Users don’t have to update individual fields explicitly.
This is aligned with other property “index.knn.advanced.filtered_exact_search_threshold"

Cons

Different vector fields from same index cannot have different value.

2. Keep “approximate_threshold” as mapping parameter

In this approach, users can set this value as mapping parameter for knn vector field type. They can use “Update Mapping Setting” API to update this value.

Example:

PUT my-test-knn-index/_mappings
{
  "properties": {
    "my_vector1": {
      "type": "knn_vector",
      "space_type": "l2",
      "dimension": 2,
`      "approximate_threshold": -1,`
      "method": {
        "name": "hnsw",
        "engine": "nmslib",
        "parameters": {
          "ef_construction": 128,
          "m": 24
        }
      }
    }
  }
}

Pros

Different vector fields from same index can have different value. This gives more control to the users since they can tune this parameter individually at field level.
This works well if user is going to set it and forget during index creation step.

Cons

This doesn’t provide very good user experience if users want to update the value after index creation step. we would like to provide same experience as index setting, for example.

PUT my-test-knn-index/_mappings
{
  "properties": {
    "my_vector1": {
      "type": "knn_vector",
      "approximate_threshold": -1
    }
  }
}

But in order to update a mapping parameter, users has to pass other non-updatable parameters as part of update mapping api request. In above example, to update “approximate_threshold", we had to first use GET Setting API to get the non-updatable properties like type, space_type, dimension, methods etc and then use exact value in update mapping api to change ”approximate_threshold".

3. Combine 1 & 2

In this approach we will provide this as both index setting and mapping parameter, where index setting will be considered as global setting for every field, and, users can override this value for any specific field by updating as mapping parameter.

Pros

It is easy to update index setting by just passing only the property and the desired value.
The index setting will be shared by all vector fields from same index. Users don’t have to update individual fields explicitly.
This is aligned partially with other property “index.knn.advanced.filtered_exact_search_threshold"
Different vector fields from same index can have different value by overriding at field level. This gives more control to the users since they can tune this parameter individually at field level.

Cons

This doesn’t solve the problem of having to pass non-updatable field as part of update mapping api request. However, this will be only considered in a specific case where users had to override this value for certain fields. This significantly reduces the nuances for most of the use case where user either has an index with one vector field or they want to use same setting for every vector field.

Finalized approach

Since there is no work around to use as mapping parameter and provide better user experience at same time, we can first introduce this as “index setting”, (Proposal 1 ) and, if we receive substantial feedback on having it as mapping parameter from community, we can introduce mapping parameter in next iteration**(effectively Proposal 3).**

VijayanB · 2024-10-25T01:50:48Z

Default value for Approximate threshold index setting

Objective

In this document we will look into experiments that we conducted to compare the performance of k-NN index with and without vector data structures. The metrics from this experiments will help us to find crossing point where vector search performance is not affected if k-NN plugin uses exact search instead of Approximate search. The motivation for identifying right default value is mainly to reduce indexing time, and/or reduce cpu utilization during indexing, where we found out that graph building is directly proportional to high cpu utilization and total indexing time.

Experiment Configuration

if a setting is not specified, then consider default is used from latest 2.x

Cluster Configuration

Setting	Value
No of Data Node	3
No of cores	8
Memory	64
No of Cluster Manager Node	1
Storage	50 GB

Index Configuration

Setting	Value
No of primary	3
No of replicas	0
Engine	FAISS
Space Type	inner product
Dimension	768

Workload Configuration

Setting	Value
Bulk Size	100
Total Indexing Client	10
Search Clients	1 or 10
K	100
Total Search Query	1000

We ran two types of experiment to compare performance. First, we don’t create necessary vector data structure during indexing, and, will perform exact search during knn search api, this is simulated by setting “approximate_threshold” to -1.

Second, we create required data structure during and use those data structure to perform approximate search during knn search api ( default behavior ). This is simulated by setting “approximate_threshold” to 0

Experiment 1 (index.knn.approximate_threshold = -1 ):

Indexing

P50 Service time comparison

P90 Service time comparison

Max CPU Utilization

Max throughput (docs/secs)

Observation

As expected, if we don’t build graph as part of indexing process, the indexing latency improves compare to default set up where we build index while flushing segments.

P90 Service Time

Total Num Vectors	P90 latency with graph (ms)	P90 latency without graph(ms)	Improvement in percentage
1K	71	44	38.03
2k	63	42	33.33
5K	47	40	14.89
10K	55	43	21.82
15K	63	43	31.75
25K	48	40	16.67
50K	46	43.6	5.22
100K	63	42.6	32.38

CPU Utilization

Total Num Vectors	With Graph	Without Graph	Improvement in percentage
1K	10	1	90
2k	10	1	90
5K	9	6	33.33
10K	9	6	33.33
15K	13	6	53.85
25K	13	7	46.15
50K	26	20	23.08
100K	27	22	18.52

Search

For search we conducted two sets of experiment with 1 client and 10 parallel clients to capture trend when more clients are used. Also, before running search, indices are already available in memory using warm up api.

P50 Service Time (client = 1 )

P90 Service time (client = 1 )

Max Cpu Utilization (client = 1 )

Max throughput (client = 1 )

P90 Service time (client = 10)

Max CPU Utilization

Observation

As expected Approximate search performs better once number of vectors increases. In the above experiment, once the number of vectors grows above 50K, the exact search performs poor compared to approx search. In below metrics, positive represents improved behavior, negative means degradation from default behavior

P90 Service Time ( client = 1 )

Total Num Vectors	P90 latency with graph (ms)	P90 latency without graph (ms)	Difference (ms)
1K	4.5	4	0.5
2k	5	4.8	0.2
5K	5.5	5.4	0.1
10K	5.8	5.7	0.1
15K	5.9	5.8	0.1
25K	8.8	7.8	1
50K	10.8	11.1	-0.3
100K	13.2	22	-8.8

CPU Utilization ( client = 1 )

Total Num Vectors	With Graph	Without Graph	Difference
1K	5	5	0
2k	11	5	6
5K	12	7	5
10K	8	7	1
15K	10	7	3
25K	20	16	4
50K	30	25	5
100K	30	37.88	-7.88

Conclusion

Based on Indexing and Search comparison, decided to go with 15K as default threshold

navneet1v · 2024-10-31T17:51:45Z

Closing this GH issue as the feature will be launched with 2.18 version of Opensearch

navneet1v added untriaged enhancement labels Aug 8, 2024

navneet1v assigned VijayanB Aug 8, 2024

navneet1v removed the untriaged label Aug 8, 2024

navneet1v mentioned this issue Aug 8, 2024

[META] [Build-Time] Improving Build time for Vector Indices #1599

Open

7 tasks

navneet1v added the indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. label Aug 8, 2024

navneet1v added this to Vector Search RoadMap Aug 8, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Aug 8, 2024

navneet1v added the Features Introduces a new unit of functionality that satisfies a requirement label Aug 8, 2024

VijayanB mentioned this issue Aug 27, 2024

Introduce new setting to configure when to build graph during segment creation #2007

Merged

5 tasks

This was referenced Sep 24, 2024

Add exact search if no native engine files are available #2136

Merged

Add integration test for binary vector values #2142

Merged

KNNIterators should support with and without filters #2155

Merged

This was referenced Oct 1, 2024

Add support for radial search in exact search #2174

Merged

Allow build graph greedily for quantization scenarios #2175

Merged

Add support to build vector data structures greedily and perform exact search when there are no engine files #2188

Merged

VijayanB added v2.18.0 Roadmap:Vector Database/GenAI Project-wide roadmap label labels Oct 4, 2024

VijayanB moved this from Backlog to 2.18.0 in Vector Search RoadMap Oct 4, 2024

VijayanB mentioned this issue Oct 9, 2024

[DOC] Vector Search now supports new advanced setting to improve indexing performance opensearch-project/documentation-website#8482

Closed

4 tasks

This was referenced Oct 22, 2024

Update approximate_threshold to 15K documents #2229

Merged

Select index settings based on cluster version #2236

Merged

navneet1v closed this as completed Oct 31, 2024

github-project-automation bot moved this from 2.18.0 to ✅ Done in Vector Search RoadMap Oct 31, 2024

navneet1v mentioned this issue Oct 31, 2024

[FEATURE] Separate refresh interval for KNN segment merge #1577

Closed

navneet1v mentioned this issue Nov 28, 2024

[RFC] : Boosting OpenSearch Vector Engine Performance using GPUs #2293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Creating Vector data structures Greedily #1942

[FEATURE] Creating Vector data structures Greedily #1942

navneet1v commented Aug 8, 2024 •

edited

Loading

navneet1v commented Aug 8, 2024

navneet1v commented Aug 28, 2024

VijayanB commented Oct 10, 2024

VijayanB commented Oct 25, 2024

navneet1v commented Oct 31, 2024

[FEATURE] Creating Vector data structures Greedily #1942

[FEATURE] Creating Vector data structures Greedily #1942

Comments

navneet1v commented Aug 8, 2024 • edited Loading

Description

navneet1v commented Aug 8, 2024

navneet1v commented Aug 28, 2024

VijayanB commented Oct 10, 2024

Proposal

1. Keep “approximate_threshold” as dynamic index setting

Example:

Pros:

Cons

2. Keep “approximate_threshold” as mapping parameter

Example:

Pros

Cons

3. Combine 1 & 2

Pros

Cons

Finalized approach

VijayanB commented Oct 25, 2024

Default value for Approximate threshold index setting

Objective

Experiment Configuration

Cluster Configuration

Index Configuration

Workload Configuration

Experiment 1 (index.knn.approximate_threshold = -1 ):

Indexing

P50 Service time comparison

P90 Service time comparison

Max CPU Utilization

Max throughput (docs/secs)

Observation

Search

P50 Service Time (client = 1 )

P90 Service time (client = 1 )

Max Cpu Utilization (client = 1 )

Max throughput (client = 1 )

P90 Service time (client = 10)

Max CPU Utilization

Observation

Conclusion

navneet1v commented Oct 31, 2024

navneet1v commented Aug 8, 2024 •

edited

Loading