-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce new setting to configure when to build graph during segment creation #2007
Introduce new setting to configure when to build graph during segment creation #2007
Conversation
@@ -65,6 +65,7 @@ public class KNNSettings { | |||
* Settings name | |||
*/ | |||
public static final String KNN_SPACE_TYPE = "index.knn.space_type"; | |||
public static final String KNN_BUILD_GRAPH = "index.knn.build_graph"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final String KNN_BUILD_GRAPH = "index.knn.build_graph"; | |
public static final String KNN_BUILD_GRAPH = "index.knn.build_vector_datastructure"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
@@ -97,6 +98,7 @@ public class KNNSettings { | |||
*/ | |||
public static final boolean KNN_DEFAULT_FAISS_AVX2_DISABLED_VALUE = false; | |||
public static final String INDEX_KNN_DEFAULT_SPACE_TYPE = "l2"; | |||
public static final boolean INDEX_KNN_DEFAULT_BUILD_GRAPH = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this as per the new setting name I suggested on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
private boolean getBuildGraphSetting() { | ||
// Default should be true, only explicit false should disable building graph for | ||
// native engine | ||
if (mapperService.isEmpty()) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if MapperService is empty the code should have crashed in line 77. I think we should get the mapperservice from line 77 and pass it in the function.
return true; | ||
} | ||
IndexSettings indexSettings = mapperService.get().getIndexSettings(); | ||
return indexSettings.getValue(KNNSettings.INDEX_KNN_BUILD_GRAPH_SETTING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should pass a default value if the setting is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only index that are created 2.17 and above will have the ability to pause building graph
Do we need a index version check in this? If an index is created in 2.16 and then when the cluster is upgraded this will still allow the feature right?
this(lucene99FlatVectorsFormat, true); | ||
} | ||
|
||
public NativeEngines990KnnVectorsFormat(final FlatVectorsFormat lucene99FlatVectorsFormat, boolean buildGraphSetting) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public NativeEngines990KnnVectorsFormat(final FlatVectorsFormat lucene99FlatVectorsFormat, boolean buildGraphSetting) { | |
public NativeEngines990KnnVectorsFormat(final FlatVectorsFormat lucene99FlatVectorsFormat, boolean buildVectorDataStructure) { |
if (!buildGraph) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have an info log here. Other wise we will never know if the vector data structures are created or not.
if (!buildGraph) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
protected Settings buildKNNSettingsWithBuildGraphProperty(boolean buildGraph) { | ||
return Settings.builder() | ||
.put("number_of_shards", 1) | ||
.put("number_of_replicas", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 1 replica? this will keep the cluster yellow. If you want to test with more than 1 shard do that.
@VijayanB What is behavior around search before, mixed and after this setting is set from false to true? Are some segments search via exact and others not? Also, why is this an index setting and not field mapping parameter? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VijayanB The primary advantage seems to be reduce the CPU spike before merge, Are we seeing any improvements during force merge? Can you add the CPU usage of before and after with this PR? we probably need one for index and one for force merge
Overall can we have a summary of alternative solutions considered? I am a bit hesitant to add a user exposed setting which does not give the improvements out of the box.
return true; | ||
} | ||
IndexSettings indexSettings = mapperService.get().getIndexSettings(); | ||
return indexSettings.getValue(KNNSettings.INDEX_KNN_BUILD_GRAPH_SETTING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only index that are created 2.17 and above will have the ability to pause building graph
Do we need a index version check in this? If an index is created in 2.16 and then when the cluster is upgraded this will still allow the feature right?
I could not make a connection between the issue #1942 and this PR. How is this PR going to be used for the issue? |
@heemin32 with this change the solution builds the capability to completely stop the Vector DataStructure creation, which is Now the GH issue says that graph creation should be done even after force_merge, but as of now it cannot be done, because it requires coming with new API for doing merges. Plus the way force_merge works with 1 segment use-case is it doesn't build temporary segments and directly merges to 1 segment. So with this capability the idea would be if we are doing indexing once, we ingest data by disabling graph creation and then enable the setting. After that we do force_merge to 1 segment for best performance for index build. The next part of the issue will be to have a vector threshold based mechanism to see when to build the graph and when not to. To avoid confusion I added more details on the GH issue. |
* build_vector_datastructure - This parameter determines whether to build vector data structure for knn fields during indexing | ||
* and merging. | ||
*/ | ||
public static final Setting<Boolean> INDEX_KNN_BUILD_VECTOR_DATASTRUCTURE = Setting.boolSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we build this setting as int rather than bool, so that when we have threshold based graph build same setting can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can min value as -1 and max 1 to start with. and in next time we can increase the max value to be very high
if (!buildVectorDatastructure) { | ||
log.info("Skip building vector datastructure"); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a codec level tests which will ensure that if this setting is true in reality graphs are not created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking
As a part of incremental graph build approach, this PR will allow user to have ability to pause building graph before completing ingestion process. Before this feature, knn plugin will build graph every time during flush, which contributes to the high total indexing time whenever new segments are being created. If user is going to merge segments after ingestion, those graphs which are created during segment flush will be discarded and new graph will be built from scratch during force merge. This gives additional control to users who wishes to reduce indexing time if their use case doesn't require building graph on every segment creation step. As follow up, we will provide automated solution to decide whether to build graph or not based on number of vectors in given segment. We will introduce a threshold parameter and will let flush/merge decide whether to build graph or not accordingly. |
The primary advantage is to provide more control to users on when to build vector data structures like graph. In current approach, CPU utilization for indexing is high due to graph creation process which will be later discarded if force merge is executed after ingestion. This setting will allow users to disable graph building step while ingestion ( this will reduce CPU utilization, which in turn improve ingestion throughput), and later update setting before merge in order to build graph as part of merge step, just before search operation. |
d6a4a05
to
530dda8
Compare
db80d04
to
d1cfae5
Compare
Updated this PR's base to feature branch |
d1cfae5
to
5700d62
Compare
Previosuly we were not creating hnsw file on segment flush for faiss engine. After successfully integrating hnsw file creation, we forgot to update unit test. Here, we will confirm that required files are being created based on field type. Signed-off-by: Vijayan Balasubramanian <[email protected]>
39807c5
to
f39ee0a
Compare
Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
b53e6d3
to
95b6b89
Compare
KNNVectorValues<?> knnVectorValues = getVectorValues(vectorDataType, field.getDocsWithField(), field.getVectors()); | ||
if (totalLiveDocs == 0) { | ||
log.debug("[Flush] No live docs for field {}", fieldInfo.getName()); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return
will break multi-field case since this is inside a for loop, I thought it was trivial enough to not add a test case, I was probably wrong. Maybe add List.of(Map.of(//some vectors), Map.emptyMap(), Map.of(//some vectors))
in the test case so we cover it (before/after) empty case in 1 case.
Same for line 107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will update the test case
} | ||
} | ||
|
||
public void testFlush_whenThresholdIsEqualToNumberOfVectors_thenNativeIndexWriterIsCalled() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a case where vectorSize > threshold and verify only threshold vectors are added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit test
...a/org/opensearch/knn/index/codec/KNN990Codec/NativeEngines990KnnVectorsWriterFlushTests.java
Outdated
Show resolved
Hide resolved
95b6b89
to
32cff9b
Compare
Signed-off-by: Vijayan Balasubramanian <[email protected]>
32cff9b
to
4f5cc31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -6,6 +6,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
|
|||
## [Unreleased 3.0](https://github.com/opensearch-project/k-NN/compare/2.x...HEAD) | |||
### Features | |||
* Introduce new setting to configure whether to build vector data structure or not during segment creation [#2007](https://github.com/opensearch-project/k-NN/pull/2007) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change log is added under unreleased 3.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I will add it to 2.x when i create PR to 2.x Currently this PR is raised against feature branch
@@ -66,6 +66,7 @@ public class KNNSettings { | |||
* Settings name | |||
*/ | |||
public static final String KNN_SPACE_TYPE = "index.knn.space_type"; | |||
public static final String INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD = "index.knn.build_vector_data_structure_threshold"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocker, but I think we need to rename this setting. Need to brainstorm some ideas, but this is too verbose.
Some thoughts
- index.knn.exact.threshold
- index.ann.threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VijayanB lets take this as AI to conclude on the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
c1c7040
into
opensearch-project:feature/build-vector-ds-greedily
… creation (opensearch-project#2007) * Fix native engine vector format test (opensearch-project#2103) Previosuly we were not creating hnsw file on segment flush for faiss engine. After successfully integrating hnsw file creation, we forgot to update unit test. Here, we will confirm that required files are being created based on field type. Signed-off-by: Vijayan Balasubramanian <[email protected]> * Introduce a setting to control whether to build graph Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
… creation (#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
… creation (#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
… creation (#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
… creation (opensearch-project#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
… creation (opensearch-project#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
… creation (opensearch-project#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>
…t search when there are no engine files (#2188) * Introduce new setting to configure when to build graph during segment creation (#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add integration test for binary vector values (#2142) Signed-off-by: Vijayan Balasubramanian <[email protected]> * Allow build graph greedily for quantization scenarios (#2175) Previosuly we only added support to build greedily for non quantization scenario. In this commit, we can remove that constraint, however, we cannot skip writing quanitization state since it is required irrespective of type of search is executed later. Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add exact search if no native engine files are available (#2136) * Add exact search if no engine files are in segments When graph is not available, plugin will return empty results. With this change, exact search will be performed when only no engine file is available in segment. We also don't need version check or feature flag because, option to not build vector data structure will only be available post 2.17. If an index is created using pre 2.17 version, segment will always have engine files and this feature will never be called during search. --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add support for radial search in exact search (#2174) * Add support for radial search in exact search When threshold value is set, knn plugin will not be creating graph. Hence, when search request is trigged during that time, exact search will return valid results. However, radial search was never included as part of exact search. This will break radial search when threshold is added and radial search is requested. In this commit, new method is introduced to accept min score and return documents that are greater than min score, similar to how radial search is performed by native engines. This search is independent of engine, but, radial search is supported only for FAISS engine out of all native engines. Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]>
…t search when there are no engine files (#2188) * Introduce new setting to configure when to build graph during segment creation (#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add integration test for binary vector values (#2142) Signed-off-by: Vijayan Balasubramanian <[email protected]> * Allow build graph greedily for quantization scenarios (#2175) Previosuly we only added support to build greedily for non quantization scenario. In this commit, we can remove that constraint, however, we cannot skip writing quanitization state since it is required irrespective of type of search is executed later. Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add exact search if no native engine files are available (#2136) * Add exact search if no engine files are in segments When graph is not available, plugin will return empty results. With this change, exact search will be performed when only no engine file is available in segment. We also don't need version check or feature flag because, option to not build vector data structure will only be available post 2.17. If an index is created using pre 2.17 version, segment will always have engine files and this feature will never be called during search. --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add support for radial search in exact search (#2174) * Add support for radial search in exact search When threshold value is set, knn plugin will not be creating graph. Hence, when search request is trigged during that time, exact search will return valid results. However, radial search was never included as part of exact search. This will break radial search when threshold is added and radial search is requested. In this commit, new method is introduced to accept min score and return documents that are greater than min score, similar to how radial search is performed by native engines. This search is independent of engine, but, radial search is supported only for FAISS engine out of all native engines. Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> (cherry picked from commit 5a56829)
…nd perform exact search when there are no engine files (#2201) * Add support to build vector data structures greedily and perform exact search when there are no engine files (#2188) * Introduce new setting to configure when to build graph during segment creation (#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add integration test for binary vector values (#2142) Signed-off-by: Vijayan Balasubramanian <[email protected]> * Allow build graph greedily for quantization scenarios (#2175) Previosuly we only added support to build greedily for non quantization scenario. In this commit, we can remove that constraint, however, we cannot skip writing quanitization state since it is required irrespective of type of search is executed later. Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add exact search if no native engine files are available (#2136) * Add exact search if no engine files are in segments When graph is not available, plugin will return empty results. With this change, exact search will be performed when only no engine file is available in segment. We also don't need version check or feature flag because, option to not build vector data structure will only be available post 2.17. If an index is created using pre 2.17 version, segment will always have engine files and this feature will never be called during search. --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> * Add support for radial search in exact search (#2174) * Add support for radial search in exact search When threshold value is set, knn plugin will not be creating graph. Hence, when search request is trigged during that time, exact search will return valid results. However, radial search was never included as part of exact search. This will break radial search when threshold is added and radial search is requested. In this commit, new method is introduced to accept min score and return documents that are greater than min score, similar to how radial search is performed by native engines. This search is independent of engine, but, radial search is supported only for FAISS engine out of all native engines. Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> (cherry picked from commit 5a56829) * Fix compilation issue due to package error Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> Co-authored-by: Vijayan Balasubramanian <[email protected]>
Description
Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build relevant vector data structure or not for native engines before creating and merging segments. The default value is chosen to be 0 to make it backward compatible. This is applicable only for native engines, and, it is no-op for lucene.
We also don't need a feature flag since implementation is behind NativeEngines990KnnVectorsFormat. Hence, only index that are created 2.17 and above will have the ability to pause building graph while indexing or merging based on this updatable index setting.
In this PR, we introduced a setting to decide when to build the graph with min as -1 ( no graph will be built ) and Integer.MAX_Value - 2 ( as max value up to which user can provide ). Currently the default value is 0 ( build graph always ). We will be updating this value to a reasonable value ( greater than 0) to optimize when should we build graph if user didn't provide this value as next PR.
Similarly, during search, if no graph is available, we will call exact search for conssitency.
Related Issues
Part of #1942
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.