Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Around Mapper and Mapping #1939

Merged
merged 5 commits into from
Aug 10, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,5 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
* Refactor method structure and definitions [#1920](https://github.com/opensearch-project/k-NN/pull/1920)
* Refactor KNNVectorFieldType from KNNVectorFieldMapper to a separate class for better readability. [#1931](https://github.com/opensearch-project/k-NN/pull/1931)
* Generalize lib interface to return context objects [#1925](https://github.com/opensearch-project/k-NN/pull/1925)
* Move k search k-NN query to re-write phase of vector search query for Native Engines [#1877](https://github.com/opensearch-project/k-NN/pull/1877)
* Move k search k-NN query to re-write phase of vector search query for Native Engines [#1877](https://github.com/opensearch-project/k-NN/pull/1877)
* Restructure mappers to better handle null cases and avoid branching in parsing [#1939](https://github.com/opensearch-project/k-NN/pull/1939)
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
import org.opensearch.knn.index.codec.params.KNNScalarQuantizedVectorsFormatParams;
import org.opensearch.knn.index.codec.params.KNNVectorsFormatParams;
import org.opensearch.knn.index.engine.KNNEngine;
import org.opensearch.knn.index.engine.KNNMethodContext;
import org.opensearch.knn.index.mapper.KNNMappingConfig;
import org.opensearch.knn.index.mapper.KNNVectorFieldType;

import java.util.Optional;
Expand Down Expand Up @@ -66,16 +68,19 @@ public KnnVectorsFormat getKnnVectorsFormatForField(final String field) {
);
return defaultFormatSupplier.get();
}
var type = (KNNVectorFieldType) mapperService.orElseThrow(
KNNVectorFieldType mappedFieldType = (KNNVectorFieldType) mapperService.orElseThrow(
() -> new IllegalStateException(
String.format("Cannot read field type for field [%s] because mapper service is not available", field)
)
).fieldType(field);
var params = type.getKnnMethodContext().getMethodComponentContext().getParameters();

if (type.getKnnMethodContext().getKnnEngine() == KNNEngine.LUCENE
&& params != null
&& params.containsKey(METHOD_ENCODER_PARAMETER)) {
KNNMappingConfig knnMappingConfig = mappedFieldType.getKnnMappingConfig();
KNNMethodContext knnMethodContext = knnMappingConfig.getKnnMethodContext()
.orElseThrow(() -> new IllegalArgumentException("KNN method context cannot be empty"));

var params = knnMethodContext.getMethodComponentContext().getParameters();

if (knnMethodContext.getKnnEngine() == KNNEngine.LUCENE && params != null && params.containsKey(METHOD_ENCODER_PARAMETER)) {
KNNScalarQuantizedVectorsFormatParams knnScalarQuantizedVectorsFormatParams = new KNNScalarQuantizedVectorsFormatParams(
params,
defaultMaxConnections,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
import org.opensearch.knn.index.codec.transfer.VectorTransferByte;
import org.opensearch.knn.index.codec.transfer.VectorTransferFloat;
import org.opensearch.knn.jni.JNIService;
import org.opensearch.knn.index.SpaceType;
import org.opensearch.knn.index.codec.util.KNNCodecUtil;
import org.opensearch.knn.index.engine.KNNEngine;
import org.opensearch.knn.indices.Model;
Expand Down Expand Up @@ -57,7 +56,6 @@

import static org.apache.lucene.codecs.CodecUtil.FOOTER_MAGIC;
import static org.opensearch.knn.common.KNNConstants.MODEL_ID;
import static org.opensearch.knn.common.KNNConstants.PARAMETERS;
import static org.opensearch.knn.index.codec.util.KNNCodecUtil.buildEngineFileName;
import static org.opensearch.knn.index.codec.util.KNNCodecUtil.calculateArraySize;
import static org.opensearch.knn.index.engine.faiss.Faiss.FAISS_BINARY_INDEX_DESCRIPTION_PREFIX;
Expand Down Expand Up @@ -214,35 +212,21 @@ private void createKNNIndexFromTemplate(Model model, KNNCodecUtil.Pair pair, KNN

private void createKNNIndexFromScratch(FieldInfo fieldInfo, KNNCodecUtil.Pair pair, KNNEngine knnEngine, String indexPath)
throws IOException {
Map<String, Object> parameters = new HashMap<>();
Map<String, String> fieldAttributes = fieldInfo.attributes();
String parametersString = fieldAttributes.get(KNNConstants.PARAMETERS);

// parametersString will be null when legacy mapper is used
if (parametersString == null) {
parameters.put(KNNConstants.SPACE_TYPE, fieldAttributes.getOrDefault(KNNConstants.SPACE_TYPE, SpaceType.DEFAULT.getValue()));

String efConstruction = fieldAttributes.get(KNNConstants.HNSW_ALGO_EF_CONSTRUCTION);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are removing this code and legacy field mapper, I understand that a new MethodFieldMapper will be used here. But how this change is BWC? because if in a field parametersString is not added in the old segments then even when we have changed the mapper it will not impact the fieldInfo of old segments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the fieldInfo of old segments will not be re-used to create new segments. We will need to use the fieldInfo that is created for new segments. This fieldInfo will be setup using the MethodFieldMapper which will specify the paramsString. So, I dont see an issue with BWC

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the fieldInfo of old segments will not be re-used to create new segments

FieldInfo as per my understanding is at the Lucene index level(aka shard level) and not the lucene segment level. Please validate this, by doing a local test. If already done please let me know the steps, I will check on my side too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I was referring to: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L24-L29. FieldInfo is per-segment. I can look a little bit more into this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. But can we do 1 validation to be 100% sure. because fieldInfo gets created per segment when we are searching. But during indexing are there checks which may prevent be triggered same field has different values.

Example, we cannot create a field with docvalues as false for 1 segment and for another segment make it true. So similarly can we check, for same field can we add 1 attribute and then next time we add another attribute.

Please ignore if we have done this check. Just want to ensure that we have handled/thought about all the cases.

Copy link
Member Author

@jmazanec15 jmazanec15 Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v

@jmazanec15 I thought of another use case and want to know how BWC is maintained here. If lets say 2 old segments are merged where parameters were null. What will happen in that case? Because none of the segments had the parameters value it can lead to exception as per my understanding. Whats your thought?

This case fails:

$ curl -X POST "localhost:9200/target_index/_forcemerge?max_num_segments=1&pretty"
{
  "_shards" : {
    "total" : 1,
    "successful" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 0,
        "index" : "target_index",
        "status" : "INTERNAL_SERVER_ERROR",
        "reason" : {
          "type" : "i_o_exception",
          "reason" : "background merge hit exception: _0(9.12.0):c3:[diagnostics={source=flush, lucene.version=9.12.0, os.version=14.3.1, os.arch=x86_64, java.vendor=Oracle Corporation, os=Mac OS X, java.runtime.version=17.0.2+8-86, timestamp=1723228909394}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=50yf5f1zuh78uivbrgbbmkg1 _2(9.12.0):c4:[diagnostics={source=flush, lucene.version=9.12.0, os.version=14.3.1, os.arch=x86_64, java.vendor=Oracle Corporation, os=Mac OS X, java.runtime.version=17.0.2+8-86, timestamp=1723228920684}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=50yf5f1zuh78uivbrgbbmkg6 _1(9.12.0):c1:[diagnostics={source=flush, lucene.version=9.12.0, os.version=14.3.1, os.arch=x86_64, java.vendor=Oracle Corporation, os=Mac OS X, java.runtime.version=17.0.2+8-86, timestamp=1723228914087}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=50yf5f1zuh78uivbrgbbmkg3 into _3 [maxNumSegments=1]",
          "caused_by" : {
            "type" : "runtime_exception",
            "reason" : "java.lang.IllegalStateException: Parameter string is not set when creating a KNN index. We should not enter this state. It is likely a bug with respect to removal of Legacy field mapper.",
            "caused_by" : {
              "type" : "illegal_state_exception",
              "reason" : "Parameter string is not set when creating a KNN index. We should not enter this state. It is likely a bug with respect to removal of Legacy field mapper."
            }
          }
        }
      }
    ]
  }
}

That being said, I think we probably need to leave the code as is in the KNN80DocValuesConsumer.

For field type/mapping, let me check which they use for initializing the merged segments fileinfo. If its always the latest one, then I think we can leave as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v I think this is the key point:

     * <p>If the field already exists: 1) the provided FieldInfo's schema is checked against the
     * existing field and 2) the provided FieldInfo's attributes are added to the existing
     * FieldInfo's attributes.

So, what we can conclude is that the field info attributes will be a Union of the merged fields. So, we just need to keep the legacy checks in KNN80DocValuesConsumer but we can keep Mapper layer as is in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I think we probably need to leave the code as is in the KNN80DocValuesConsumer.

I am aligned on keeping the check in KNN80DocValuesConsumer and remove the checks from mapper.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add that failure case in bwc test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @heemin32 added

Map<String, Object> algoParams = new HashMap<>();
if (efConstruction != null) {
algoParams.put(KNNConstants.METHOD_PARAMETER_EF_CONSTRUCTION, Integer.parseInt(efConstruction));
}

String m = fieldAttributes.get(KNNConstants.HNSW_ALGO_M);
if (m != null) {
algoParams.put(KNNConstants.METHOD_PARAMETER_M, Integer.parseInt(m));
}
parameters.put(PARAMETERS, algoParams);
} else {
parameters.putAll(
XContentHelper.createParser(
NamedXContentRegistry.EMPTY,
DeprecationHandler.THROW_UNSUPPORTED_OPERATION,
new BytesArray(parametersString),
MediaTypeRegistry.getDefaultMediaType()
).map()
throw new IllegalStateException(
"Parameter string is not set when creating a KNN index. We should not enter this state. It is likely a bug with respect to removal of Legacy field mapper."
);
}
Map<String, Object> parameters = new HashMap<>(
XContentHelper.createParser(
NamedXContentRegistry.EMPTY,
DeprecationHandler.THROW_UNSUPPORTED_OPERATION,
new BytesArray(parametersString),
MediaTypeRegistry.getDefaultMediaType()
).map()
);

// Update index description of Faiss for binary data type
if (KNNEngine.FAISS == knnEngine
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
import lombok.Getter;
import lombok.NonNull;
import lombok.Setter;
import org.opensearch.Version;
import org.opensearch.common.ValidationException;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
Expand All @@ -20,7 +19,6 @@
import org.opensearch.index.mapper.MapperParsingException;

import java.io.IOException;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.stream.Collectors;
Expand All @@ -29,7 +27,6 @@
import org.opensearch.knn.training.VectorSpaceInfo;

import static org.opensearch.knn.common.KNNConstants.KNN_ENGINE;
import static org.opensearch.knn.common.KNNConstants.METHOD_HNSW;
import static org.opensearch.knn.common.KNNConstants.METHOD_PARAMETER_SPACE_TYPE;
import static org.opensearch.knn.common.KNNConstants.NAME;
import static org.opensearch.knn.common.KNNConstants.PARAMETERS;
Expand All @@ -42,21 +39,6 @@
@Getter
public class KNNMethodContext implements ToXContentFragment, Writeable {

private static KNNMethodContext defaultInstance = null;

/**
* This is used only for testing
* @return default KNNMethodContext for testing
*/
public static synchronized KNNMethodContext getDefault() {
if (defaultInstance == null) {
MethodComponentContext methodComponentContext = new MethodComponentContext(METHOD_HNSW, Collections.emptyMap());
methodComponentContext.setIndexVersion(Version.CURRENT);
defaultInstance = new KNNMethodContext(KNNEngine.DEFAULT, SpaceType.DEFAULT, methodComponentContext);
}
return defaultInstance;
}

@NonNull
private final KNNEngine knnEngine;
@NonNull
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/

package org.opensearch.knn.index.mapper;

import org.apache.lucene.document.FieldType;
import org.opensearch.Version;
import org.opensearch.common.Explicit;
import org.opensearch.knn.index.VectorDataType;

import java.util.Map;
import java.util.Optional;

/**
* Mapper used when you dont want to build an underlying KNN struct - you just want to
* store vectors as doc values
*/
public class FlatVectorFieldMapper extends KNNVectorFieldMapper {

private final PerDimensionValidator perDimensionValidator;

public static FlatVectorFieldMapper createFieldMapper(
String fullname,
String simpleName,
Map<String, String> metaValue,
VectorDataType vectorDataType,
Integer dimension,
MultiFields multiFields,
CopyTo copyTo,
Explicit<Boolean> ignoreMalformed,
boolean stored,
boolean hasDocValues,
Version indexCreatedVersion
) {
final KNNVectorFieldType mappedFieldType = new KNNVectorFieldType(fullname, metaValue, vectorDataType, new KNNMappingConfig() {
@Override
public Optional<Integer> getDimension() {
return Optional.of(dimension);
}
});
return new FlatVectorFieldMapper(
simpleName,
mappedFieldType,
multiFields,
copyTo,
ignoreMalformed,
stored,
hasDocValues,
indexCreatedVersion
);
}

private FlatVectorFieldMapper(
String simpleName,
KNNVectorFieldType mappedFieldType,
MultiFields multiFields,
CopyTo copyTo,
Explicit<Boolean> ignoreMalformed,
boolean stored,
boolean hasDocValues,
Version indexCreatedVersion
) {
super(simpleName, mappedFieldType, multiFields, copyTo, ignoreMalformed, stored, hasDocValues, indexCreatedVersion, null);
this.perDimensionValidator = selectPerDimensionValidator(vectorDataType);
this.fieldType = new FieldType(KNNVectorFieldMapper.Defaults.FIELD_TYPE);
this.fieldType.freeze();
}

private PerDimensionValidator selectPerDimensionValidator(VectorDataType vectorDataType) {
if (VectorDataType.BINARY == vectorDataType) {
return PerDimensionValidator.DEFAULT_BIT_VALIDATOR;
}

if (VectorDataType.BYTE == vectorDataType) {
return PerDimensionValidator.DEFAULT_BYTE_VALIDATOR;
}

return PerDimensionValidator.DEFAULT_FLOAT_VALIDATOR;
}

@Override
protected VectorValidator getVectorValidator() {
return VectorValidator.NOOP_VECTOR_VALIDATOR;
}

@Override
protected PerDimensionValidator getPerDimensionValidator() {
return perDimensionValidator;
}

@Override
protected PerDimensionProcessor getPerDimensionProcessor() {
return PerDimensionProcessor.NOOP_PROCESSOR;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/

package org.opensearch.knn.index.mapper;

import org.opensearch.knn.index.engine.KNNMethodContext;

import java.util.Optional;

/**
* Class holds information about how the ANN indices are created. The design of this class ensures that we do not
* accidentally configure an index that has multiple ways it can be created. This class is immutable.
*/
public interface KNNMappingConfig {
/**
*
* @return Optional containing the modelId if created from model, otherwise empty
*/
default Optional<String> getModelId() {
return Optional.empty();
}

/**
*
* @return Optional containing the KNNMethodContext if created from method, otherwise empty
*/
default Optional<KNNMethodContext> getKnnMethodContext() {
return Optional.empty();
}

/**
*
* @return the dimension of the index; for model based indices, it will be null
*/
default Optional<Integer> getDimension() {
return Optional.empty();
}
}
Loading
Loading