Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new setting to configure when to build graph during segment creation #2007

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

## [Unreleased 3.0](https://github.com/opensearch-project/k-NN/compare/2.x...HEAD)
### Features
* Introduce new setting to configure whether to build vector data structure or not during segment creation [#2007](https://github.com/opensearch-project/k-NN/pull/2007)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change log is added under unreleased 3.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will add it to 2.x when i create PR to 2.x Currently this PR is raised against feature branch

### Enhancements
### Bug Fixes
### Infrastructure
Expand Down
20 changes: 20 additions & 0 deletions src/main/java/org/opensearch/knn/index/KNNSettings.java
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ public class KNNSettings {
* Settings name
*/
public static final String KNN_SPACE_TYPE = "index.knn.space_type";
public static final String INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD = "index.knn.build_vector_data_structure_threshold";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, but I think we need to rename this setting. Need to brainstorm some ideas, but this is too verbose.

Some thoughts

  1. index.knn.exact.threshold
  2. index.ann.threshold

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VijayanB lets take this as AI to conclude on the naming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

public static final String KNN_ALGO_PARAM_M = "index.knn.algo_param.m";
public static final String KNN_ALGO_PARAM_EF_CONSTRUCTION = "index.knn.algo_param.ef_construction";
public static final String KNN_ALGO_PARAM_EF_SEARCH = "index.knn.algo_param.ef_search";
Expand All @@ -92,6 +93,9 @@ public class KNNSettings {
*/
public static final boolean KNN_DEFAULT_FAISS_AVX2_DISABLED_VALUE = false;
public static final String INDEX_KNN_DEFAULT_SPACE_TYPE = "l2";
public static final Integer INDEX_KNN_DEFAULT_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD = 0;
public static final Integer INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD_MIN = -1;
public static final Integer INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD_MAX = Integer.MAX_VALUE - 2;
public static final String INDEX_KNN_DEFAULT_SPACE_TYPE_FOR_BINARY = "hamming";
public static final Integer INDEX_KNN_DEFAULT_ALGO_PARAM_M = 16;
public static final Integer INDEX_KNN_DEFAULT_ALGO_PARAM_EF_SEARCH = 100;
Expand Down Expand Up @@ -131,6 +135,21 @@ public class KNNSettings {
Setting.Property.Deprecated
);

/**
* build_vector_data_structure_threshold - This parameter determines when to build vector data structure for knn fields during indexing
* and merging. Setting -1 (min) will skip building graph, whereas on any other values, the graph will be built if
* number of live docs in segment is greater than this threshold. Since max number of documents in a segment can
* be Integer.MAX_VALUE - 1, this setting will allow threshold to be up to 1 less than max number of documents in a segment
*/
public static final Setting<Integer> INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD_SETTING = Setting.intSetting(
INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD,
INDEX_KNN_DEFAULT_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD,
INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD_MIN,
INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD_MAX,
IndexScope,
Dynamic
);

/**
* M - the number of bi-directional links created for every new element during construction.
* Reasonable range for M is 2-100. Higher M work better on datasets with high intrinsic
Expand Down Expand Up @@ -447,6 +466,7 @@ private Setting<?> getSetting(String key) {
public List<Setting<?>> getSettings() {
List<Setting<?>> settings = Arrays.asList(
INDEX_KNN_SPACE_TYPE,
INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD_SETTING,
INDEX_KNN_ALGO_PARAM_M_SETTING,
INDEX_KNN_ALGO_PARAM_EF_CONSTRUCTION_SETTING,
INDEX_KNN_ALGO_PARAM_EF_SEARCH_SETTING,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@
import org.apache.lucene.codecs.hnsw.FlatVectorScorerUtil;
import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat;
import org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat;
import org.opensearch.index.IndexSettings;
import org.opensearch.index.mapper.MapperService;
import org.opensearch.knn.index.KNNSettings;
import org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsFormat;
import org.opensearch.knn.index.codec.params.KNNScalarQuantizedVectorsFormatParams;
import org.opensearch.knn.index.codec.params.KNNVectorsFormatParams;
Expand Down Expand Up @@ -129,7 +131,21 @@ public KnnVectorsFormat getKnnVectorsFormatForField(final String field) {
}

private NativeEngines990KnnVectorsFormat nativeEngineVectorsFormat() {
return new NativeEngines990KnnVectorsFormat(new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer()));
int buildVectorDatastructureThreshold = getBuildVectorDatastructureThresholdSetting(mapperService.get());
VijayanB marked this conversation as resolved.
Show resolved Hide resolved
return new NativeEngines990KnnVectorsFormat(
new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer()),
buildVectorDatastructureThreshold
);
}

private int getBuildVectorDatastructureThresholdSetting(final MapperService knnMapperService) {
final IndexSettings indexSettings = knnMapperService.getIndexSettings();
final Integer buildVectorDatastructureThreshold = indexSettings.getValue(
KNNSettings.INDEX_KNN_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD_SETTING
);
return buildVectorDatastructureThreshold != null
? buildVectorDatastructureThreshold
: KNNSettings.INDEX_KNN_DEFAULT_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD;
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat;
import org.apache.lucene.index.SegmentReadState;
import org.apache.lucene.index.SegmentWriteState;
import org.opensearch.knn.index.KNNSettings;

import java.io.IOException;

Expand All @@ -30,15 +31,20 @@ public class NativeEngines990KnnVectorsFormat extends KnnVectorsFormat {
/** The format for storing, reading, merging vectors on disk */
private static FlatVectorsFormat flatVectorsFormat;
private static final String FORMAT_NAME = "NativeEngines990KnnVectorsFormat";
private static int buildVectorDatastructureThreshold;

public NativeEngines990KnnVectorsFormat() {
super(FORMAT_NAME);
flatVectorsFormat = new Lucene99FlatVectorsFormat(new DefaultFlatVectorScorer());
this(new Lucene99FlatVectorsFormat(new DefaultFlatVectorScorer()));
}

public NativeEngines990KnnVectorsFormat(final FlatVectorsFormat flatVectorsFormat) {
this(flatVectorsFormat, KNNSettings.INDEX_KNN_DEFAULT_BUILD_VECTOR_DATA_STRUCTURE_THRESHOLD);
}

public NativeEngines990KnnVectorsFormat(final FlatVectorsFormat lucene99FlatVectorsFormat) {
public NativeEngines990KnnVectorsFormat(final FlatVectorsFormat flatVectorsFormat, int buildVectorDatastructureThreshold) {
super(FORMAT_NAME);
flatVectorsFormat = lucene99FlatVectorsFormat;
NativeEngines990KnnVectorsFormat.flatVectorsFormat = flatVectorsFormat;
NativeEngines990KnnVectorsFormat.buildVectorDatastructureThreshold = buildVectorDatastructureThreshold;
}

/**
Expand All @@ -48,7 +54,7 @@ public NativeEngines990KnnVectorsFormat(final FlatVectorsFormat lucene99FlatVect
*/
@Override
public KnnVectorsWriter fieldsWriter(final SegmentWriteState state) throws IOException {
return new NativeEngines990KnnVectorsWriter(state, flatVectorsFormat.fieldsWriter(state));
return new NativeEngines990KnnVectorsWriter(state, flatVectorsFormat.fieldsWriter(state), buildVectorDatastructureThreshold);
}

/**
Expand All @@ -63,6 +69,12 @@ public KnnVectorsReader fieldsReader(final SegmentReadState state) throws IOExce

@Override
public String toString() {
return "NativeEngines99KnnVectorsFormat(name=" + this.getClass().getSimpleName() + ", flatVectorsFormat=" + flatVectorsFormat + ")";
return "NativeEngines99KnnVectorsFormat(name="
+ this.getClass().getSimpleName()
+ ", flatVectorsFormat="
+ flatVectorsFormat
+ ", buildVectorDatastructureThreshold"
+ buildVectorDatastructureThreshold
+ ")";
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,16 @@ public class NativeEngines990KnnVectorsWriter extends KnnVectorsWriter {
private KNN990QuantizationStateWriter quantizationStateWriter;
private final List<NativeEngineFieldVectorsWriter<?>> fields = new ArrayList<>();
private boolean finished;
private final Integer buildVectorDataStructureThreshold;

public NativeEngines990KnnVectorsWriter(SegmentWriteState segmentWriteState, FlatVectorsWriter flatVectorsWriter) {
public NativeEngines990KnnVectorsWriter(
SegmentWriteState segmentWriteState,
FlatVectorsWriter flatVectorsWriter,
Integer buildVectorDataStructureThreshold
) {
this.segmentWriteState = segmentWriteState;
this.flatVectorsWriter = flatVectorsWriter;
this.buildVectorDataStructureThreshold = buildVectorDataStructureThreshold;
}

/**
Expand Down Expand Up @@ -83,24 +89,34 @@ public void flush(int maxDoc, final Sorter.DocMap sortMap) throws IOException {
final FieldInfo fieldInfo = field.getFieldInfo();
final VectorDataType vectorDataType = extractVectorDataType(fieldInfo);
int totalLiveDocs = getLiveDocs(getVectorValues(vectorDataType, field.getDocsWithField(), field.getVectors()));
if (totalLiveDocs > 0) {
KNNVectorValues<?> knnVectorValues = getVectorValues(vectorDataType, field.getDocsWithField(), field.getVectors());
if (totalLiveDocs == 0) {
log.debug("[Flush] No live docs for field {}", fieldInfo.getName());
continue;
}
KNNVectorValues<?> knnVectorValues = getVectorValues(vectorDataType, field.getDocsWithField(), field.getVectors());

final QuantizationState quantizationState = train(field.getFieldInfo(), knnVectorValues, totalLiveDocs);
final NativeIndexWriter writer = NativeIndexWriter.getWriter(fieldInfo, segmentWriteState, quantizationState);
final QuantizationState quantizationState = train(field.getFieldInfo(), knnVectorValues, totalLiveDocs);
// Will consider building vector data structure based on threshold only for non quantization indices
if (quantizationState == null && shouldSkipBuildingVectorDataStructure(totalLiveDocs)) {
log.info(
"Skip building vector data structure for field: {}, as liveDoc: {} is less than the threshold {} during flush",
fieldInfo.name,
totalLiveDocs,
buildVectorDataStructureThreshold
);
continue;
}
final NativeIndexWriter writer = NativeIndexWriter.getWriter(fieldInfo, segmentWriteState, quantizationState);

knnVectorValues = getVectorValues(vectorDataType, field.getDocsWithField(), field.getVectors());
knnVectorValues = getVectorValues(vectorDataType, field.getDocsWithField(), field.getVectors());

StopWatch stopWatch = new StopWatch().start();
StopWatch stopWatch = new StopWatch().start();

writer.flushIndex(knnVectorValues, totalLiveDocs);
writer.flushIndex(knnVectorValues, totalLiveDocs);

long time_in_millis = stopWatch.stop().totalTime().millis();
KNNGraphValue.REFRESH_TOTAL_TIME_IN_MILLIS.incrementBy(time_in_millis);
log.debug("Flush took {} ms for vector field [{}]", time_in_millis, fieldInfo.getName());
} else {
log.debug("[Flush] No live docs for field {}", fieldInfo.getName());
}
long time_in_millis = stopWatch.stop().totalTime().millis();
KNNGraphValue.REFRESH_TOTAL_TIME_IN_MILLIS.incrementBy(time_in_millis);
log.debug("Flush took {} ms for vector field [{}]", time_in_millis, fieldInfo.getName());
}
}

Expand All @@ -118,6 +134,16 @@ public void mergeOneField(final FieldInfo fieldInfo, final MergeState mergeState

KNNVectorValues<?> knnVectorValues = getKNNVectorValuesForMerge(vectorDataType, fieldInfo, mergeState);
final QuantizationState quantizationState = train(fieldInfo, knnVectorValues, totalLiveDocs);
// Will consider building vector data structure based on threshold only for non quantization indices
if (quantizationState == null && shouldSkipBuildingVectorDataStructure(totalLiveDocs)) {
log.info(
"Skip building vector data structure for field: {}, as liveDoc: {} is less than the threshold {} during merge",
fieldInfo.name,
totalLiveDocs,
buildVectorDataStructureThreshold
);
return;
}
final NativeIndexWriter writer = NativeIndexWriter.getWriter(fieldInfo, segmentWriteState, quantizationState);

knnVectorValues = getKNNVectorValuesForMerge(vectorDataType, fieldInfo, mergeState);
Expand Down Expand Up @@ -240,4 +266,11 @@ private void initQuantizationStateWriterIfNecessary() throws IOException {
quantizationStateWriter.writeHeader(segmentWriteState);
}
}

private boolean shouldSkipBuildingVectorDataStructure(final long docCount) {
Copy link
Collaborator

@shatejas shatejas Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path is being used for both merge and flush. Please merge from main, there is an operation name thats need to be considered (should be already in main branch) where this should always return false on merge

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas Are you suggesting that it should always return false on merge? If so, not necessarily, we don't want to build graph during background merges as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the background merge case. A few questions around background merges

  1. When does it trigger?
  2. is there a total indexing time latency if we allow it during background merge? if so by how much
  3. Is there a CPU impact if we allow background merges? whats the ballpark estimates for that

Copy link
Member Author

@VijayanB VijayanB Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenSearch uses TieredMergePolicy to decide when to perform merge. IIRC, segmentsPerTier controls how many segments are allowed per tiers. When segment count goes beyond that limit in tier, it will run merges in the background. Choosing this value is generally a trade-off of indexing speed vs searching speed. Having fewer segments improves search since lucene need to look into less number of segments, but, this will increase indexing time because of more background merges. @navneet1v Please add more context if i miss anything. You can find default values here https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/TieredMergePolicyProvider.java#L138

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link @VijayanB

  1. I am trying to grok why merges are behind this setting. and thus the questions on indexing time and CPU usage. Keeping background merges enabled overall seems to better the user experience in few ways (Please point out if I am missing something or incorrect)
  • Users can have a set and forget setting - once the user sets 0 or a threshold value they don't have to change it for force merges, and then change it back when they want to index again.
  • Better search experience - Background merges may create some graphs, which will lower the latencies if users chose to search during indexing as it won't always hit exact search
  • Force merge optional - Since background merges will create graphs, guidance on force merge can be optional/enforced only for cases where threshold is low/0.
  1. Can the default threshold be kept 0 if merge is kept enabled and the trigger for merge is predictable? This is just a thought to see if its possible as this will give the max total indexing time benefit for all users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to grok why merges are behind this setting. and thus the questions on indexing time and CPU usage. Keeping background merges enabled overall seems to better the user experience in few ways

No it is correct experience. It just defeats the overall purpose why the change at first place is getting added.

Users can have a set and forget setting - once the user sets 0 or a threshold value they don't have to change it for force merges, and then change it back when they want to index again.

This is what threshold based graph creation will do. If user want to just do the graph creation during merges they can keep the threshold high.

Better search experience - Background merges may create some graphs, which will lower the latencies if users chose to search during indexing as it won't always hit exact search

If user is looking to do so then its better to use the threshold.

Force merge optional - Since background merges will create graphs, guidance on force merge can be optional/enforced only for cases where threshold is low/0.

No it is not optional. Because even when you have small segments its not necessary that merges will happen.

Can the default threshold be kept 0 if merge is kept enabled and the trigger for merge is predictable?

Trigger for merge is very much dependent on indexing speed. You might not have merges for 1 whole day but the moment you add 1 more doc in index the merges may get triggered, so it really depends on the state of the shard aka Lucene index.

Since there is no formal documentation or wiki around merges I would recommend reading this blog : https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html which is an old blog from one of lucene maintainer which might help resolve some of the doubts you have related to merges.

Copy link
Collaborator

@shatejas shatejas Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, so seems like we need the user to change the setting for force merge

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas Added unit test by rebasing from main.

if (buildVectorDataStructureThreshold < 0) {
return true;
}
return docCount < buildVectorDataStructureThreshold;
}
}
Loading
Loading