Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix map type validation issue in processors #687

Merged
merged 12 commits into from
Jun 5, 2024

Conversation

zane-neo
Copy link
Collaborator

@zane-neo zane-neo commented Apr 11, 2024

Description

When user uses map type configuration in processors, the validation will validate all other fields instead of only the configured fields, sometimes if other fields has unsupported types, user will get exception which is not expected.

Multiple processors uses similar validation logic but code are duplicated, in this PR a new Util is been added to extract the duplicated code to a common place and reused by different processors.

Issues Resolved

opensearch-project/ml-commons#2309

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

@chishui chishui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work to refactor and reduce duplicate code

CHANGELOG.md Outdated
@@ -21,6 +21,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Allowing execution of hybrid query on index alias with filters ([#670](https://github.com/opensearch-project/neural-search/pull/670))
### Bug Fixes
- Add support for request_cache flag in hybrid query ([#663](https://github.com/opensearch-project/neural-search/pull/663))
- Fix may type validation issue in multiple pipeline processors ([#661](https://github.com/opensearch-project/neural-search/pull/661))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Fix "map"

@@ -107,12 +109,12 @@ public IngestDocument execute(IngestDocument ingestDocument) throws Exception {
public void execute(IngestDocument ingestDocument, BiConsumer<IngestDocument, Exception> handler) {
try {
validateEmbeddingFieldsValue(ingestDocument);
Map<String, Object> ProcessMap = buildMapWithProcessorKeyAndOriginalValue(ingestDocument);
List<String> inferenceList = createInferenceList(ProcessMap);
Map<String, Object> processMap = buildMapWithTargetKeyAndOriginalValue(ingestDocument);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, I also wanted to change it

import java.util.Map;
import java.util.Objects;

public class ProcessorDocumentUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadoc for better readability


public class ProcessorDocumentUtils {

public static long getMaxDepth(Map<String, Object> sourceAndMetadataMap, ClusterService clusterService, Environment environment) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, java doc for public function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This methd is only used for depth check parameter in function validateMapTypeValue. How about making this method private?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, I was trying to reduce the method's parameter so split them to different methods, but making it a private method is the correct direction.

final String sourceKey,
final Map<String, Object> sourceValue,
final Object fieldMap,
final int depth,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why depth is int but maxDepth is long?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long is from OpenSearch Core(I don't think it make sense BTW), but I've changed the int to long to fit it.


private static void validateDepth(String sourceKey, int depth, long maxDepth) {
if (depth > maxDepth) {
throw new IllegalArgumentException("map type field [" + sourceKey + "] reached max depth limit, cannot process it");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"reaches"?

boolean allowEmpty
) {
validateDepth(sourceKey, depth, maxDepth);
if (sourceValue == null || sourceValue.isEmpty()) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: CollectionUtils.isEmpty() can check both

throw new IllegalArgumentException("list type field [" + sourceKey + "] has non string value, cannot process it");
} else {
if (element == null) {
throw new IllegalArgumentException("list type field [" + sourceKey + "] has null, cannot process it");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you check if (firstNonNullElement instanceof List) and else if (firstNonNullElement instanceof Map), element could also be null, but no exception thrown, the logic seems inconsistent. Why we check both firstNonNullElement and element, it's a bit confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, I'll change this part to follow a consistent principle: for list type, we don't allow null element in it.

import java.util.Map;
import java.util.Objects;

public class ProcessorDocumentUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add UT for this class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, these code are extracted from old code and there're already a lot of tests to cover these code, from the code coverage ran from local, all lines are covered already.

@yuye-aws
Copy link
Member

Happy to see that this PR extracts all validation function into file src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java? Currently, all the functions except getMaxDepth is about validation, can we simply rename this file to DocumentValidator?

@yuye-aws
Copy link
Member

We still have parsing methods like buildMapWithTargetKeyAndOriginalValue and createInferenceList. Can we also extract them into another class named "DocumentParser"?


public class ProcessorDocumentUtils {

public static long getMaxDepth(Map<String, Object> sourceAndMetadataMap, ClusterService clusterService, Environment environment) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This methd is only used for depth check parameter in function validateMapTypeValue. How about making this method private?

@@ -600,7 +619,7 @@ public void testExecute_withFixedTokenLength_andFieldMapNestedMap_thenFail() {
() -> processor.execute(ingestDocument)
);
assertEquals(
String.format(Locale.ROOT, "map type field [%s] has non-string type, cannot process it", INPUT_NESTED_FIELD_KEY),
"[body] configuration doesn't match actual value type, configuration type is: java.lang.String, actual value type is: java.util.ImmutableCollections$Map1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use String.format()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is deterministic not dynamic, I don't think we need to use string.format. Also, use plain text is a little bit more straightforward than string.format.

Comment on lines +660 to +921
assertEquals(
"[body] configuration doesn't match actual value type, configuration type is: java.lang.String, actual value type is: com.google.common.collect.RegularImmutableMap",
illegalArgumentException.getMessage()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.fromat()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

@@ -630,15 +649,18 @@ public void testExecute_withFixedTokenLength_andFieldMapNestedMap_sourceDataList
}

@SneakyThrows
public void testExecute_withFixedTokenLength_andSourceDataListWithHybridType_thenSucceed() {
public void testExecute_withFixedTokenLength_andSourceDataListWithHybridType_thenFail() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this test case should fail?

Copy link
Collaborator Author

@zane-neo zane-neo Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code we didn't have bi-directional validation of document source and configuration map. A map type configuration can't fit both string and map type value, e.g.

{
  "field_map": {
    "a": {
      "b": {
        "c": "c_target"
      }
    }
  }
}

Can not fit to both(case 1)

{
  "a": {
    "b": {
      "c": "text to embedding/chunk"
    }
  }
}

and(case 2)

{
  "a": {
    "b": "text to embedding/chunk"
  }
}

For case 2, text_embedding and text_image_embedding processor needs to build a temporary map with target key, and b's type will be treated as map based on configuration and then cause class cast exception.
For text chunking, current code doesn't throw exception but the result is not correct, based on user's configuration, c is the target chunking field but the actual chunking field in this case is b, and the generated result is:

{
  "c_target": "b's chunking result"
}

Which is unexpected.

So now we enforced the bi-directional to avoid this case, and the base principle we'll follow in the future is: list type always have same type/data structure object in it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with list type always have same type/data structure object in it. However, it maybe a breaking change for not supporting hybrid list. Users previously pass in hybrid input in a list will now receive an error. Can we avoid throwing an error the user?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change means an expected behavior changed, like above mentioned, this in fact is a bug since the result is not what user expected, by adding this validation user can realize the issue and fix their inputs.

@zane-neo
Copy link
Collaborator Author

Happy to see that this PR extracts all validation function into file src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java? Currently, all the functions except getMaxDepth is about validation, can we simply rename this file to DocumentValidator?

No, this file will be adding more and more common methods like parsing data structure etc, so will name as what it is now.

@zane-neo
Copy link
Collaborator Author

We still have parsing methods like buildMapWithTargetKeyAndOriginalValue and createInferenceList. Can we also extract them into another class named "DocumentParser"?

We will supports more complex cases in the future and do this extraction that time, we'll put these common code in the new class added this time instead of creating another class.

@yuye-aws
Copy link
Member

We still have parsing methods like buildMapWithTargetKeyAndOriginalValue and createInferenceList. Can we also extract them into another class named "DocumentParser"?

We will supports more complex cases in the future and do this extraction that time, we'll put these common code in the new class added this time instead of creating another class.

Makes sense.

@zane-neo
Copy link
Collaborator Author

The BWC test failure is not caused by this change and there's already an issue tracking it: #688

@vibrantvarun
Copy link
Member

Why BWC tests are failing in the build with the same issue?

@yuye-aws
Copy link
Member

This PR looks good to me

@vibrantvarun
Copy link
Member

Why are the BWC tests failing @yuye-aws ?

@yuye-aws
Copy link
Member

yuye-aws commented May 1, 2024

Hi @vibrantvarun ! The BWC tests now get passed. What's our next step plan to merge this PR?

@zane-neo zane-neo force-pushed the fix-map-type-validation branch from 84cc5ce to bb82974 Compare May 27, 2024 08:14
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks good, few minor comments for formatting and style:

  • correct all error message formatting for exceptions, use String.format
  • check for single liners if, please use full form with a method inside if in curly braces

allowEmpty
);
} else if (!(nextSourceValue instanceof String)) {
throw new IllegalArgumentException("map type field [" + key + "] is neither string nor nested type, cannot process it");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use String.format with params instead of direct strings concatenation for exception error message. You have used it in this class line 65-71

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why default() locale? We typically use Locale.ROOT, it refers to empty location rather then picking up locale of OS JVM process on data node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: zane-neo <[email protected]>
@zane-neo zane-neo force-pushed the fix-map-type-validation branch from 57306bf to 85cd1ec Compare June 4, 2024 01:15
Signed-off-by: zane-neo <[email protected]>
@zane-neo zane-neo merged commit 54ac672 into opensearch-project:main Jun 5, 2024
70 checks passed
@zane-neo zane-neo added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Jun 5, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 5, 2024
* fix map type validation issue in processors

Signed-off-by: zane-neo <[email protected]>

* fix test failures on main branch

Signed-off-by: zane-neo <[email protected]>

* Fix potential NPE issue in chunking processor; add changee log

Signed-off-by: zane-neo <[email protected]>

* Fix failure tests

Signed-off-by: zane-neo <[email protected]>

* Address comments and add one more UT to cover uncovered line

Signed-off-by: zane-neo <[email protected]>

* Address comments

Signed-off-by: zane-neo <[email protected]>

* Add more UTs

Signed-off-by: zane-neo <[email protected]>

* fix failure ITs

Signed-off-by: zane-neo <[email protected]>

* Add public method with default depth parameter value

Signed-off-by: zane-neo <[email protected]>

* rebase latest code

Signed-off-by: zane-neo <[email protected]>

* address comments

Signed-off-by: zane-neo <[email protected]>

* address comment

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
(cherry picked from commit 54ac672)
zhichao-aws pushed a commit that referenced this pull request Jun 5, 2024
* fix map type validation issue in processors

Signed-off-by: zane-neo <[email protected]>

* fix test failures on main branch

Signed-off-by: zane-neo <[email protected]>

* Fix potential NPE issue in chunking processor; add changee log

Signed-off-by: zane-neo <[email protected]>

* Fix failure tests

Signed-off-by: zane-neo <[email protected]>

* Address comments and add one more UT to cover uncovered line

Signed-off-by: zane-neo <[email protected]>

* Address comments

Signed-off-by: zane-neo <[email protected]>

* Add more UTs

Signed-off-by: zane-neo <[email protected]>

* fix failure ITs

Signed-off-by: zane-neo <[email protected]>

* Add public method with default depth parameter value

Signed-off-by: zane-neo <[email protected]>

* rebase latest code

Signed-off-by: zane-neo <[email protected]>

* address comments

Signed-off-by: zane-neo <[email protected]>

* address comment

Signed-off-by: zane-neo <[email protected]>

---------

Signed-off-by: zane-neo <[email protected]>
(cherry picked from commit 54ac672)

Co-authored-by: zane-neo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants