Refactor Around Mapper and Mapping #1939

jmazanec15 · 2024-08-08T03:49:17Z

Description

Refactoring of the current FieldMapper structure with the main motivation that simplification will make easier to make changes with confidence. A good amount of complexity has built up given that there are multiple ways to create an index (LegacyFieldMapper, Model, Lucene and Method) with different data types. I wanted to re-organize so that we can extend easier.

In general, this refactor does the following:

It removes branching around encoder and spacetype for parsing. It does so by adding per dimension validators and processors and creating implementations for the different types. Then it adds a couple methods that can be implemented by the specific FieldMappers.
It encapsulates the dimension, knnmethodcontext and modelId behind a new class called ANNConfig. Im not 100% sure on this name, so would appreciate any suggestions. The main purpose of this class is to control the access to this information from calling logic so that it is simpler and safer to handle the branches. For example, no implicit need to deal with -1 dimensions.
It removes the LegacyFieldMapper and uses MethodFieldMapper and builds the knnMethodContext from the settings. The LegacyFieldMapper is not really necessary given that we can just use the information from the settings to build the knnMethodContext. It also eliminates some branching logic in the KNN80DocValuesConsumer
It creates a new FieldMapper called FlatFieldMapper (also open to suggestions). This fieldMapper is used if knn=false. It can be repurposed in the future for more direct control over exact search. This should be a separate class because it doesnt need a lot of things that the other, ANN specific mappings require, such as space types.

One note - I did leave out changing too much around the VectorDataType. I think this should eventually go in the ANNConfig class, but will leave for another time.

Check List

Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

jmazanec15 · 2024-08-08T16:22:01Z

src/test/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesConsumerTests.java

@@ -251,62 +251,6 @@ public void testAddKNNBinaryField_fromScratch_nmslibCurrent() throws IOException
        assertNotEquals(0, (long) KNNGraphValue.MERGE_TOTAL_SIZE_IN_BYTES.getValue());
    }

-    public void testAddKNNBinaryField_fromScratch_nmslibLegacy() throws IOException {


Deleting because Codec doesnt need to be aware of legacy

heemin32 · 2024-08-08T16:48:05Z

src/main/java/org/opensearch/knn/index/mapper/ANNConfig.java

+     * @param modelId Model id used to create the index; null if not created from model
+     * @param dimension Dimension used to create the index; needs to be null for model-based indices
+     */
+    public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) {


Using builder pattern could be better?

Other option is making each type of config as separate class and have its own validation there. Then create them using factory pattern. This will make it clear which validation is for which config type.

+1 on the builder pattern to create ANNConfig object. Now should we use lombok builder pattern or not, I would leave that upto @jmazanec15.

Other option is making each type of config as separate class and have its own validation there.

I would really like validation on the objects present in their own classes, but one thing I am not sure here is does a specific value of lets say KNNMethodContext impact ANNConfigType validations. Then it can become tricky. I would like to understand more on this before we move to more complex way of creating and validating objects.

Another point, why modelId and dimension are not encapsulated in classes? when spaceType will come will that be another parameter to this constructor? @jmazanec15 would like to know your thoughts on this.

Dimension and modelId are direct top-level parameters. Want to hold off for this PR in wrapping these in anything. In the future, we can wrap if we need to.

And yes, after some thought, I am going to make ANNConfig -> KNNMappingConfig and implement it per FieldMapper. This should simplify things.

Switched to make ANNConfig an interface "KNNMappingConfig" and removed ANNConfigType. Now, the individual field mappers will construct KNNMappingConfig as they need to. I think this resolves these comments.

heemin32 · 2024-08-08T16:51:44Z

src/main/java/org/opensearch/knn/index/mapper/ANNConfig.java

+     *
+     * @return Optional containing the modelId if created from model, otherwise empty
+     */
+    public Optional<String> getModelId() {


Why not having modelId as Optional object instead of String?
Is this to prevent the value from being updated?

The main reason around having Optional was to force calling code to handle null case. Otherwise, the calling code has to do their own null checks

Then, why don't we have it like

private Optional<String> modelId;

This is message from intellij warning:

'Optional<String>' used as type for field 'modelId' Inspection info: Reports any cases in which java.util.Optional<T>, java.util.OptionalDouble, java.util.OptionalInt, java.util.OptionalLong, or com.google.common.base.Optional are used as types for fields or parameters. Optional was designed to provide a limited mechanism for library method return types in which a clear way to represent "no result" was needed. Using a field with the java.util.Optional type is also problematic if the class needs to be Serializable, as java.util.Optional is not serializable.

Optional was designed to provide a limited mechanism for library method return types in which a clear way to represent "no result" was needed.

This is main purpose for this I felt

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperUtil.java

src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java

heemin32 · 2024-08-08T17:39:24Z

src/main/java/org/opensearch/knn/index/mapper/ModelFieldMapper.java

+        maybeInitLazyVariables(modelMetadata);
+    }
+
+    private void maybeInitLazyVariables(ModelMetadata modelMetadata) {


Shouldn't we just have a separate class and handle validation and processing there dynamically based on values in modelMetadata?

Not sure I understand the suggestion

Any specific reason of this style of atomic thing? Does this method called multiple time? Also, are these variables used multiple times and that is why we need to keep all these variable in instance level?

Can it be

private static final HashSet<VectorDataType, PerDimensionValidator> VALIDATOR = ... validator.get(vectorDataType).validate();

Or

private static final PerDimensionValidator perDimensionValidator; perDimensionValidator.validate(vectorDataType);

The problem with the model field mapper is that it relies on cluster state to store the model id. But during recovery, the cluster state is not available. So we cannot get the model metadata during field mapper creation. Related PR: #111. So, the idea is that we have to set these onces field parsing begins. The reason for atomics is to ensure they get set once safely.

navneet1v

Thanks for doing the refactoring @jmazanec15 . When I was reviewing a new update got added. So not sure if all my comments are still valid. But if things are fixed please let me know. I will take another pass at the code.

High level comment:

There are some complex conditions added in the KNNVectorFieldMapper I would love to see them getting simplified in an easy to read/understand fashion.

CHANGELOG.md

src/main/java/org/opensearch/knn/index/mapper/ANNConfigType.java

navneet1v · 2024-08-08T17:31:45Z

src/main/java/org/opensearch/knn/index/mapper/ANNConfig.java

+     * @param modelId Model id used to create the index; null if not created from model
+     * @param dimension Dimension used to create the index; needs to be null for model-based indices
+     */
+    public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) {


+1 on the builder pattern to create ANNConfig object. Now should we use lombok builder pattern or not, I would leave that upto @jmazanec15.

Other option is making each type of config as separate class and have its own validation there.

I would really like validation on the objects present in their own classes, but one thing I am not sure here is does a specific value of lets say KNNMethodContext impact ANNConfigType validations. Then it can become tricky. I would like to understand more on this before we move to more complex way of creating and validating objects.

navneet1v · 2024-08-08T17:33:38Z

src/main/java/org/opensearch/knn/index/mapper/ANNConfig.java

+     * @param modelId Model id used to create the index; null if not created from model
+     * @param dimension Dimension used to create the index; needs to be null for model-based indices
+     */
+    public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) {


Another point, why modelId and dimension are not encapsulated in classes? when spaceType will come will that be another parameter to this constructor? @jmazanec15 would like to know your thoughts on this.

navneet1v · 2024-08-08T17:42:24Z

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java

-                // because to get the cluster metadata, you need access to the cluster state. Because this code is
-                // sometimes used to initialize the cluster state/update cluster state, we cannot get the state here
-                // safely. So, we are unable to validate the model. The model gets validated during ingestion.
+            boolean isResolvedNull = resolvedKNNMethodContext == null;


can we add some comments here what does this resolvedKNNMethodContext mean.

navneet1v · 2024-08-08T17:50:39Z

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java

+            // case, the input resolvedKNNMethodContext will be null and the settings wont exist (so flat mapper should
+            // be used). Otherwise, we need to check the setting.
+            boolean isSettingPresent = KNNSettings.IS_KNN_INDEX_SETTING.exists(context.indexSettings());
+            if (isResolvedNull && (!isSettingPresent || !KNNSettings.IS_KNN_INDEX_SETTING.get(context.indexSettings()))) {


can we simplify this check. Its really hard to follow and will be hard to make a sense after few months.

Let me try. It is a little tricky to simplify because of upstream issue where merge does not have access to settings (see https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/ParametrizedFieldMapper.java#L130 and opendistro-for-elasticsearch/k-NN#288)

Please take a shot at it. The condition may look easy to read right now, but in next few months it will be hard. Or give an extensive java docs on this so that it is easy to understand.

I dont think it can be further simplified. I added a fairly large comment here: https://github.com/jmazanec15/k-NN-1/blob/field-mapper-ref/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L174-L188. Note that a similar problem exists with the legacy field mapper today (see https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L181-L196)

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperUtil.java

src/main/java/org/opensearch/knn/index/mapper/VectorValidator.java

src/test/java/org/opensearch/knn/KNNTestCase.java

shatejas · 2024-08-08T17:23:41Z

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldType.java

+     * @param vectorDataType data type of the vector
+     * @param annConfig configuration context for the ANN index
+     */
+    public KNNVectorFieldType(String name, Map<String, String> meta, VectorDataType vectorDataType, ANNConfig annConfig) {


Suggested change

public KNNVectorFieldType(String name, Map<String, String> meta, VectorDataType vectorDataType, ANNConfig annConfig) {

public KNNVectorFieldType(String name, Map<String, String> metadata, VectorDataType vectorDataType, ANNConfig annConfig) {

shatejas · 2024-08-08T18:05:00Z

src/main/java/org/opensearch/knn/index/mapper/ANNConfig.java

+public final class ANNConfig {
+
+    @Getter
+    private final ANNConfigType annConfigType;


Rather than a type here, is it worth simplifying it to have a boolean requiresTraining()

There is a third case where flatmapper when knn=false where it is not model based or method based. Working on cleaning this up a little bit.

shatejas · 2024-08-08T18:19:11Z

src/main/java/org/opensearch/knn/index/mapper/ANNConfig.java

+    private final KNNMethodContext knnMethodContext;
+    private final String modelId;
+    private final Integer dimension;


KNNMethodContext here is dense with layers in it. For instance model based context has KNNEngine inside of model metadata and in a layer above otherwise, this causes if else checks to get KNNEngine. Unwrapping KNNMethodContext and storing specific values will isolate different cases and logic associated with it while setting it rather than during get.

Is it worth considering to have specific configuration with hard types derived from ANNConfig. There is a case here where if modelId is present then dimension is -1. We can model it in a way where both fields are not in the same implementation. I am not sure how much this will help

Note: Please look around for backward compatibility with the approaches suggested

KNNMethodContext here is dense with layers in it. For instance model based context has KNNEngine inside of model metadata and in a layer above otherwise, this causes if else checks to get KNNEngine. Unwrapping KNNMethodContext and storing specific values will isolate different cases and logic associated with it while setting it rather than during get.

I prefer right now to keep it simple and encapsulate the 3 values from the mapping that can have null behavior or null-like behavior. I think this is something that can be circled back to in future.

Is it worth considering to have specific configuration with hard types derived from ANNConfig. There is a case here where if modelId is present then dimension is -1. We can model it in a way where both fields are not in the same implementation. I am not sure how much this will help

Yes, I think I agree with this. Im going to switch ANNConfig to KNNMappingConfig and make it an interface.

Refactors FieldMapper logic. It removes the LegacyFieldMapper and replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to create fields that do not build ANN indices. Additionally, it puts dimension, model_id, and knn_method_context in a new ANNConfig class and adds some safety checks around accessing them. This should make calling logic easier to handle. Lastly, it cleans up the parsing so that there isnt encoder parsing directly in the KNNVectorFieldMapper. Signed-off-by: John Mazanec <[email protected]>

heemin32 · 2024-08-09T00:28:20Z

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperUtil.java

-        }
+        int expectedDimensions = knnVectorFieldType.getKnnMappingConfig()
+            .getDimension()
+            .orElseGet(


Can we make knnMappingConfg read dimension from model?
This code is mixing up something. For example, getDimension is meaningless for model config. Then, getDimensionFromModelId is also meaningless for non model config but this code does not tell that and engineer will not be able to catch that internal logic from this code itself.

I moved get dimension to be non-optional

Signed-off-by: John Mazanec <[email protected]>

Refactors FieldMapper logic. It removes the LegacyFieldMapper and replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to create fields that do not build ANN indices. Additionally, it puts dimension, model_id, and knn_method_context in a new KNNMappingConfig class and adds some safety checks around accessing them. This should make calling logic easier to handle. Lastly, it cleans up the parsing so that there isnt encoder parsing directly in the KNNVectorFieldMapper. Signed-off-by: John Mazanec <[email protected]> (cherry picked from commit 2cd57e8)

Refactors FieldMapper logic. It removes the LegacyFieldMapper and replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to create fields that do not build ANN indices. Additionally, it puts dimension, model_id, and knn_method_context in a new KNNMappingConfig class and adds some safety checks around accessing them. This should make calling logic easier to handle. Lastly, it cleans up the parsing so that there isnt encoder parsing directly in the KNNVectorFieldMapper. Signed-off-by: John Mazanec <[email protected]> Signed-off-by: Akash Shankaran <[email protected]>

jmazanec15 requested review from heemin32, navneet1v, VijayanB, vamshin, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan and luyuncheng as code owners August 8, 2024 03:49

jmazanec15 force-pushed the field-mapper-ref branch from 2dda3a8 to 90530a9 Compare August 8, 2024 03:50

jmazanec15 added backport 2.x Refactoring Improve the design, structure, and implementation while preserving its functionality labels Aug 8, 2024

jmazanec15 force-pushed the field-mapper-ref branch from 90530a9 to 2bb180c Compare August 8, 2024 04:28

jmazanec15 commented Aug 8, 2024

View reviewed changes

jmazanec15 force-pushed the field-mapper-ref branch from 2bb180c to 25a0f5a Compare August 8, 2024 17:19

heemin32 reviewed Aug 8, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java Outdated Show resolved Hide resolved

heemin32 reviewed Aug 8, 2024

View reviewed changes

jmazanec15 force-pushed the field-mapper-ref branch from 25a0f5a to 048f6b5 Compare August 8, 2024 17:54

navneet1v reviewed Aug 8, 2024

View reviewed changes

shatejas reviewed Aug 8, 2024

View reviewed changes

jmazanec15 force-pushed the field-mapper-ref branch 2 times, most recently from 9de1292 to b46d203 Compare August 8, 2024 22:41

jmazanec15 force-pushed the field-mapper-ref branch from b46d203 to 0e1cd5b Compare August 8, 2024 23:10

jmazanec15 requested review from heemin32, navneet1v and shatejas August 8, 2024 23:16

heemin32 reviewed Aug 9, 2024

View reviewed changes

navneet1v mentioned this pull request Aug 9, 2024

Integrate Lucene Vector field with native engines to use KNNVectorFormat during segment creation #1945

Merged

4 tasks

Remove optional from dimension

79e0a4b

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 requested a review from heemin32 August 9, 2024 01:39

jmazanec15 force-pushed the field-mapper-ref branch 2 times, most recently from 5e61e3f to a85ea01 Compare August 9, 2024 03:42

heemin32 previously approved these changes Aug 9, 2024

View reviewed changes

Change ModelFieldMapper to initialize per method

e22fcf8

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 dismissed heemin32’s stale review via e22fcf8 August 9, 2024 14:41

jmazanec15 force-pushed the field-mapper-ref branch from a85ea01 to e22fcf8 Compare August 9, 2024 14:41

jmazanec15 requested a review from heemin32 August 9, 2024 15:49

heemin32 previously approved these changes Aug 9, 2024

View reviewed changes

Add back consumer legacy change

24365ac

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 dismissed heemin32’s stale review via 24365ac August 9, 2024 19:08

add bwc test

8d06041

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 requested a review from heemin32 August 9, 2024 19:38

heemin32 approved these changes Aug 9, 2024

View reviewed changes

navneet1v approved these changes Aug 10, 2024

View reviewed changes

jmazanec15 merged commit 2cd57e8 into opensearch-project:main Aug 10, 2024
30 checks passed

opensearch-trigger-bot bot mentioned this pull request Aug 10, 2024

[Backport 2.x] Refactor Around Mapper and Mapping #1947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Around Mapper and Mapping #1939

Refactor Around Mapper and Mapping #1939

jmazanec15 commented Aug 8, 2024

jmazanec15 Aug 8, 2024

heemin32 Aug 8, 2024

navneet1v Aug 8, 2024

navneet1v Aug 8, 2024

jmazanec15 Aug 8, 2024

jmazanec15 Aug 8, 2024

heemin32 Aug 8, 2024

jmazanec15 Aug 8, 2024

heemin32 Aug 8, 2024

jmazanec15 Aug 8, 2024

heemin32 Aug 8, 2024

jmazanec15 Aug 8, 2024

heemin32 Aug 9, 2024

jmazanec15 Aug 9, 2024

navneet1v left a comment

navneet1v Aug 8, 2024

navneet1v Aug 8, 2024

navneet1v Aug 8, 2024

navneet1v Aug 8, 2024

jmazanec15 Aug 8, 2024

navneet1v Aug 9, 2024

jmazanec15 Aug 9, 2024

shatejas Aug 8, 2024

shatejas Aug 8, 2024

jmazanec15 Aug 8, 2024

shatejas Aug 8, 2024

jmazanec15 Aug 8, 2024

heemin32 Aug 9, 2024 •

edited

Loading

jmazanec15 Aug 9, 2024

	public KNNVectorFieldType(String name, Map<String, String> meta, VectorDataType vectorDataType, ANNConfig annConfig) {
	public KNNVectorFieldType(String name, Map<String, String> metadata, VectorDataType vectorDataType, ANNConfig annConfig) {

Refactor Around Mapper and Mapping #1939

Refactor Around Mapper and Mapping #1939

Conversation

jmazanec15 commented Aug 8, 2024

Description

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heemin32 Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heemin32 Aug 9, 2024 •

edited

Loading