Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Around Mapper and Mapping #1939

Merged
merged 5 commits into from
Aug 10, 2024

Conversation

jmazanec15
Copy link
Member

Description

Refactoring of the current FieldMapper structure with the main motivation that simplification will make easier to make changes with confidence. A good amount of complexity has built up given that there are multiple ways to create an index (LegacyFieldMapper, Model, Lucene and Method) with different data types. I wanted to re-organize so that we can extend easier.

In general, this refactor does the following:

  1. It removes branching around encoder and spacetype for parsing. It does so by adding per dimension validators and processors and creating implementations for the different types. Then it adds a couple methods that can be implemented by the specific FieldMappers.
  2. It encapsulates the dimension, knnmethodcontext and modelId behind a new class called ANNConfig. Im not 100% sure on this name, so would appreciate any suggestions. The main purpose of this class is to control the access to this information from calling logic so that it is simpler and safer to handle the branches. For example, no implicit need to deal with -1 dimensions.
  3. It removes the LegacyFieldMapper and uses MethodFieldMapper and builds the knnMethodContext from the settings. The LegacyFieldMapper is not really necessary given that we can just use the information from the settings to build the knnMethodContext. It also eliminates some branching logic in the KNN80DocValuesConsumer
  4. It creates a new FieldMapper called FlatFieldMapper (also open to suggestions). This fieldMapper is used if knn=false. It can be repurposed in the future for more direct control over exact search. This should be a separate class because it doesnt need a lot of things that the other, ANN specific mappings require, such as space types.

One note - I did leave out changing too much around the VectorDataType. I think this should eventually go in the ANNConfig class, but will leave for another time.

Check List

  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added backport 2.x Refactoring Improve the design, structure, and implementation while preserving its functionality labels Aug 8, 2024
@@ -251,62 +251,6 @@ public void testAddKNNBinaryField_fromScratch_nmslibCurrent() throws IOException
assertNotEquals(0, (long) KNNGraphValue.MERGE_TOTAL_SIZE_IN_BYTES.getValue());
}

public void testAddKNNBinaryField_fromScratch_nmslibLegacy() throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting because Codec doesnt need to be aware of legacy

* @param modelId Model id used to create the index; null if not created from model
* @param dimension Dimension used to create the index; needs to be null for model-based indices
*/
public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using builder pattern could be better?

Other option is making each type of config as separate class and have its own validation there. Then create them using factory pattern. This will make it clear which validation is for which config type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the builder pattern to create ANNConfig object. Now should we use lombok builder pattern or not, I would leave that upto @jmazanec15.

Other option is making each type of config as separate class and have its own validation there.

I would really like validation on the objects present in their own classes, but one thing I am not sure here is does a specific value of lets say KNNMethodContext impact ANNConfigType validations. Then it can become tricky. I would like to understand more on this before we move to more complex way of creating and validating objects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point, why modelId and dimension are not encapsulated in classes? when spaceType will come will that be another parameter to this constructor? @jmazanec15 would like to know your thoughts on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dimension and modelId are direct top-level parameters. Want to hold off for this PR in wrapping these in anything. In the future, we can wrap if we need to.

And yes, after some thought, I am going to make ANNConfig -> KNNMappingConfig and implement it per FieldMapper. This should simplify things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to make ANNConfig an interface "KNNMappingConfig" and removed ANNConfigType. Now, the individual field mappers will construct KNNMappingConfig as they need to. I think this resolves these comments.

*
* @return Optional containing the modelId if created from model, otherwise empty
*/
public Optional<String> getModelId() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not having modelId as Optional object instead of String?
Is this to prevent the value from being updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason around having Optional was to force calling code to handle null case. Otherwise, the calling code has to do their own null checks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, why don't we have it like

private Optional<String> modelId;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is message from intellij warning:

'Optional<String>' used as type for field 'modelId' 
 Inspection info: Reports any cases in which java.util.Optional<T>, java.util.OptionalDouble, java.util.OptionalInt, java.util.OptionalLong, or com.google.common.base.Optional are used as types for fields or parameters.
Optional was designed to provide a limited mechanism for library method return types in which a clear way to represent "no result" was needed.
Using a field with the java.util.Optional type is also problematic if the class needs to be Serializable, as java.util.Optional is not serializable.

Optional was designed to provide a limited mechanism for library method return types in which a clear way to represent "no result" was needed.

This is main purpose for this I felt

maybeInitLazyVariables(modelMetadata);
}

private void maybeInitLazyVariables(ModelMetadata modelMetadata) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just have a separate class and handle validation and processing there dynamically based on values in modelMetadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the suggestion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason of this style of atomic thing? Does this method called multiple time? Also, are these variables used multiple times and that is why we need to keep all these variable in instance level?

Can it be

private static final HashSet<VectorDataType, PerDimensionValidator> VALIDATOR = ...
validator.get(vectorDataType).validate();

Or

private static final PerDimensionValidator perDimensionValidator;
perDimensionValidator.validate(vectorDataType);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the model field mapper is that it relies on cluster state to store the model id. But during recovery, the cluster state is not available. So we cannot get the model metadata during field mapper creation. Related PR: #111. So, the idea is that we have to set these onces field parsing begins. The reason for atomics is to ensure they get set once safely.

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the refactoring @jmazanec15 . When I was reviewing a new update got added. So not sure if all my comments are still valid. But if things are fixed please let me know. I will take another pass at the code.

High level comment:

  1. There are some complex conditions added in the KNNVectorFieldMapper I would love to see them getting simplified in an easy to read/understand fashion.

CHANGELOG.md Outdated Show resolved Hide resolved
* @param modelId Model id used to create the index; null if not created from model
* @param dimension Dimension used to create the index; needs to be null for model-based indices
*/
public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the builder pattern to create ANNConfig object. Now should we use lombok builder pattern or not, I would leave that upto @jmazanec15.

Other option is making each type of config as separate class and have its own validation there.

I would really like validation on the objects present in their own classes, but one thing I am not sure here is does a specific value of lets say KNNMethodContext impact ANNConfigType validations. Then it can become tricky. I would like to understand more on this before we move to more complex way of creating and validating objects.

* @param modelId Model id used to create the index; null if not created from model
* @param dimension Dimension used to create the index; needs to be null for model-based indices
*/
public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point, why modelId and dimension are not encapsulated in classes? when spaceType will come will that be another parameter to this constructor? @jmazanec15 would like to know your thoughts on this.

// because to get the cluster metadata, you need access to the cluster state. Because this code is
// sometimes used to initialize the cluster state/update cluster state, we cannot get the state here
// safely. So, we are unable to validate the model. The model gets validated during ingestion.
boolean isResolvedNull = resolvedKNNMethodContext == null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some comments here what does this resolvedKNNMethodContext mean.

// case, the input resolvedKNNMethodContext will be null and the settings wont exist (so flat mapper should
// be used). Otherwise, we need to check the setting.
boolean isSettingPresent = KNNSettings.IS_KNN_INDEX_SETTING.exists(context.indexSettings());
if (isResolvedNull && (!isSettingPresent || !KNNSettings.IS_KNN_INDEX_SETTING.get(context.indexSettings()))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify this check. Its really hard to follow and will be hard to make a sense after few months.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try. It is a little tricky to simplify because of upstream issue where merge does not have access to settings (see https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/ParametrizedFieldMapper.java#L130 and opendistro-for-elasticsearch/k-NN#288)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a shot at it. The condition may look easy to read right now, but in next few months it will be hard. Or give an extensive java docs on this so that it is easy to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/test/java/org/opensearch/knn/KNNTestCase.java Outdated Show resolved Hide resolved
* @param vectorDataType data type of the vector
* @param annConfig configuration context for the ANN index
*/
public KNNVectorFieldType(String name, Map<String, String> meta, VectorDataType vectorDataType, ANNConfig annConfig) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public KNNVectorFieldType(String name, Map<String, String> meta, VectorDataType vectorDataType, ANNConfig annConfig) {
public KNNVectorFieldType(String name, Map<String, String> metadata, VectorDataType vectorDataType, ANNConfig annConfig) {

public final class ANNConfig {

@Getter
private final ANNConfigType annConfigType;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a type here, is it worth simplifying it to have a boolean requiresTraining()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a third case where flatmapper when knn=false where it is not model based or method based. Working on cleaning this up a little bit.

Comment on lines 21 to 23
private final KNNMethodContext knnMethodContext;
private final String modelId;
private final Integer dimension;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • KNNMethodContext here is dense with layers in it. For instance model based context has KNNEngine inside of model metadata and in a layer above otherwise, this causes if else checks to get KNNEngine. Unwrapping KNNMethodContext and storing specific values will isolate different cases and logic associated with it while setting it rather than during get.

  • Is it worth considering to have specific configuration with hard types derived from ANNConfig. There is a case here where if modelId is present then dimension is -1. We can model it in a way where both fields are not in the same implementation. I am not sure how much this will help

Note: Please look around for backward compatibility with the approaches suggested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KNNMethodContext here is dense with layers in it. For instance model based context has KNNEngine inside of model metadata and in a layer above otherwise, this causes if else checks to get KNNEngine. Unwrapping KNNMethodContext and storing specific values will isolate different cases and logic associated with it while setting it rather than during get.

I prefer right now to keep it simple and encapsulate the 3 values from the mapping that can have null behavior or null-like behavior. I think this is something that can be circled back to in future.

Is it worth considering to have specific configuration with hard types derived from ANNConfig. There is a case here where if modelId is present then dimension is -1. We can model it in a way where both fields are not in the same implementation. I am not sure how much this will help

Yes, I think I agree with this. Im going to switch ANNConfig to KNNMappingConfig and make it an interface.

@jmazanec15 jmazanec15 force-pushed the field-mapper-ref branch 2 times, most recently from 9de1292 to b46d203 Compare August 8, 2024 22:41
Refactors FieldMapper logic. It removes the LegacyFieldMapper and
replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to
create fields that do not build ANN indices.

Additionally, it puts dimension, model_id, and knn_method_context in a
new ANNConfig class and adds some safety checks around accessing them.
This should make calling logic easier to handle.

Lastly, it cleans up the parsing so that there isnt encoder parsing
directly in the KNNVectorFieldMapper.

Signed-off-by: John Mazanec <[email protected]>
}
int expectedDimensions = knnVectorFieldType.getKnnMappingConfig()
.getDimension()
.orElseGet(
Copy link
Collaborator

@heemin32 heemin32 Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make knnMappingConfg read dimension from model?
This code is mixing up something. For example, getDimension is meaningless for model config. Then, getDimensionFromModelId is also meaningless for non model config but this code does not tell that and engineer will not be able to catch that internal logic from this code itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved get dimension to be non-optional

@jmazanec15 jmazanec15 requested a review from heemin32 August 9, 2024 01:39
@jmazanec15 jmazanec15 force-pushed the field-mapper-ref branch 2 times, most recently from 5e61e3f to a85ea01 Compare August 9, 2024 03:42
heemin32
heemin32 previously approved these changes Aug 9, 2024
heemin32
heemin32 previously approved these changes Aug 9, 2024
Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 requested a review from heemin32 August 9, 2024 19:38
@jmazanec15 jmazanec15 merged commit 2cd57e8 into opensearch-project:main Aug 10, 2024
30 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 10, 2024
Refactors FieldMapper logic. It removes the LegacyFieldMapper and
replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to
create fields that do not build ANN indices.

Additionally, it puts dimension, model_id, and knn_method_context in a
new KNNMappingConfig class and adds some safety checks around
accessing them. This should make calling logic easier to handle.

Lastly, it cleans up the parsing so that there isnt encoder parsing
directly in the KNNVectorFieldMapper.

Signed-off-by: John Mazanec <[email protected]>
(cherry picked from commit 2cd57e8)
akashsha1 pushed a commit to akashsha1/k-NN that referenced this pull request Sep 16, 2024
Refactors FieldMapper logic. It removes the LegacyFieldMapper and
replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to
create fields that do not build ANN indices.

Additionally, it puts dimension, model_id, and knn_method_context in a
new KNNMappingConfig class and adds some safety checks around
accessing them. This should make calling logic easier to handle.

Lastly, it cleans up the parsing so that there isnt encoder parsing
directly in the KNNVectorFieldMapper.

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: Akash Shankaran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Refactoring Improve the design, structure, and implementation while preserving its functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants