fix map type validation issue in processors #687

zane-neo · 2024-04-11T11:31:23Z

Description

When user uses map type configuration in processors, the validation will validate all other fields instead of only the configured fields, sometimes if other fields has unsupported types, user will get exception which is not expected.

Multiple processors uses similar validation logic but code are duplicated, in this PR a new Util is been added to extract the duplicated code to a common place and reused by different processors.

Issues Resolved

opensearch-project/ml-commons#2309

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

chishui

Thanks for the work to refactor and reduce duplicate code

chishui · 2024-04-12T08:23:54Z

CHANGELOG.md

@@ -21,6 +21,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 - Allowing execution of hybrid query on index alias with filters ([#670](https://github.com/opensearch-project/neural-search/pull/670))
 ### Bug Fixes
 - Add support for request_cache flag in hybrid query ([#663](https://github.com/opensearch-project/neural-search/pull/663))
+- Fix may type validation issue in multiple pipeline processors ([#661](https://github.com/opensearch-project/neural-search/pull/661))


typo: Fix "map"

chishui · 2024-04-12T08:26:13Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

@@ -107,12 +109,12 @@ public IngestDocument execute(IngestDocument ingestDocument) throws Exception {
    public void execute(IngestDocument ingestDocument, BiConsumer<IngestDocument, Exception> handler) {
        try {
            validateEmbeddingFieldsValue(ingestDocument);
-            Map<String, Object> ProcessMap = buildMapWithProcessorKeyAndOriginalValue(ingestDocument);
-            List<String> inferenceList = createInferenceList(ProcessMap);
+            Map<String, Object> processMap = buildMapWithTargetKeyAndOriginalValue(ingestDocument);


👍, I also wanted to change it

chishui · 2024-04-12T08:34:49Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+import java.util.Map;
+import java.util.Objects;
+
+public class ProcessorDocumentUtils {


Please add javadoc for better readability

chishui · 2024-04-12T08:35:27Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+
+public class ProcessorDocumentUtils {
+
+    public static long getMaxDepth(Map<String, Object> sourceAndMetadataMap, ClusterService clusterService, Environment environment) {


same, java doc for public function

This methd is only used for depth check parameter in function validateMapTypeValue. How about making this method private?

Make sense, I was trying to reduce the method's parameter so split them to different methods, but making it a private method is the correct direction.

chishui · 2024-04-12T08:39:36Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+        final String sourceKey,
+        final Map<String, Object> sourceValue,
+        final Object fieldMap,
+        final int depth,


Why depth is int but maxDepth is long?

long is from OpenSearch Core(I don't think it make sense BTW), but I've changed the int to long to fit it.

chishui · 2024-04-12T08:40:24Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+
+    private static void validateDepth(String sourceKey, int depth, long maxDepth) {
+        if (depth > maxDepth) {
+            throw new IllegalArgumentException("map type field [" + sourceKey + "] reached max depth limit, cannot process it");


chishui · 2024-04-12T08:45:33Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+        boolean allowEmpty
+    ) {
+        validateDepth(sourceKey, depth, maxDepth);
+        if (sourceValue == null || sourceValue.isEmpty()) return;


nit: CollectionUtils.isEmpty() can check both

chishui · 2024-04-12T09:16:58Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+                throw new IllegalArgumentException("list type field [" + sourceKey + "] has non string value, cannot process it");
+            } else {
+                if (element == null) {
+                    throw new IllegalArgumentException("list type field [" + sourceKey + "] has null, cannot process it");


when you check if (firstNonNullElement instanceof List) and else if (firstNonNullElement instanceof Map), element could also be null, but no exception thrown, the logic seems inconsistent. Why we check both firstNonNullElement and element, it's a bit confusing

Make sense, I'll change this part to follow a consistent principle: for list type, we don't allow null element in it.

chishui · 2024-04-12T09:18:25Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+import java.util.Map;
+import java.util.Objects;
+
+public class ProcessorDocumentUtils {


Can we add UT for this class?

In fact, these code are extracted from old code and there're already a lot of tests to cover these code, from the code coverage ran from local, all lines are covered already.

yuye-aws · 2024-04-14T14:54:45Z

Happy to see that this PR extracts all validation function into file src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java? Currently, all the functions except getMaxDepth is about validation, can we simply rename this file to DocumentValidator?

yuye-aws · 2024-04-14T14:58:09Z

We still have parsing methods like buildMapWithTargetKeyAndOriginalValue and createInferenceList. Can we also extract them into another class named "DocumentParser"?

yuye-aws · 2024-04-14T14:56:14Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+
+public class ProcessorDocumentUtils {
+
+    public static long getMaxDepth(Map<String, Object> sourceAndMetadataMap, ClusterService clusterService, Environment environment) {


This methd is only used for depth check parameter in function validateMapTypeValue. How about making this method private?

yuye-aws · 2024-04-14T14:59:45Z

src/test/java/org/opensearch/neuralsearch/processor/TextChunkingProcessorTests.java

@@ -600,7 +619,7 @@ public void testExecute_withFixedTokenLength_andFieldMapNestedMap_thenFail() {
            () -> processor.execute(ingestDocument)
        );
        assertEquals(
-            String.format(Locale.ROOT, "map type field [%s] has non-string type, cannot process it", INPUT_NESTED_FIELD_KEY),
+            "[body] configuration doesn't match actual value type, configuration type is: java.lang.String, actual value type is: java.util.ImmutableCollections$Map1",


Please use String.format()

The error message is deterministic not dynamic, I don't think we need to use string.format. Also, use plain text is a little bit more straightforward than string.format.

yuye-aws · 2024-04-14T15:00:38Z

src/test/java/org/opensearch/neuralsearch/processor/TextChunkingProcessorTests.java

+        assertEquals(
+            "[body] configuration doesn't match actual value type, configuration type is: java.lang.String, actual value type is: com.google.common.collect.RegularImmutableMap",
+            illegalArgumentException.getMessage()
+        );


String.fromat()

Same above.

yuye-aws · 2024-04-14T15:05:35Z

src/test/java/org/opensearch/neuralsearch/processor/TextChunkingProcessorTests.java

@@ -630,15 +649,18 @@ public void testExecute_withFixedTokenLength_andFieldMapNestedMap_sourceDataList
    }

    @SneakyThrows
-    public void testExecute_withFixedTokenLength_andSourceDataListWithHybridType_thenSucceed() {
+    public void testExecute_withFixedTokenLength_andSourceDataListWithHybridType_thenFail() {


Why this test case should fail?

Current code we didn't have bi-directional validation of document source and configuration map. A map type configuration can't fit both string and map type value, e.g.

{ "field_map": { "a": { "b": { "c": "c_target" } } } }

Can not fit to both(case 1)

{ "a": { "b": { "c": "text to embedding/chunk" } } }

and(case 2)

{ "a": { "b": "text to embedding/chunk" } }

For case 2, text_embedding and text_image_embedding processor needs to build a temporary map with target key, and b's type will be treated as map based on configuration and then cause class cast exception.
For text chunking, current code doesn't throw exception but the result is not correct, based on user's configuration, c is the target chunking field but the actual chunking field in this case is b, and the generated result is:

{ "c_target": "b's chunking result" }

Which is unexpected.

So now we enforced the bi-directional to avoid this case, and the base principle we'll follow in the future is: list type always have same type/data structure object in it.

Agree with list type always have same type/data structure object in it. However, it maybe a breaking change for not supporting hybrid list. Users previously pass in hybrid input in a list will now receive an error. Can we avoid throwing an error the user?

Breaking change means an expected behavior changed, like above mentioned, this in fact is a bug since the result is not what user expected, by adding this validation user can realize the issue and fix their inputs.

zane-neo · 2024-04-15T07:57:51Z

Happy to see that this PR extracts all validation function into file src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java? Currently, all the functions except getMaxDepth is about validation, can we simply rename this file to DocumentValidator?

No, this file will be adding more and more common methods like parsing data structure etc, so will name as what it is now.

zane-neo · 2024-04-15T07:59:46Z

We still have parsing methods like buildMapWithTargetKeyAndOriginalValue and createInferenceList. Can we also extract them into another class named "DocumentParser"?

We will supports more complex cases in the future and do this extraction that time, we'll put these common code in the new class added this time instead of creating another class.

yuye-aws · 2024-04-15T08:36:48Z

We still have parsing methods like buildMapWithTargetKeyAndOriginalValue and createInferenceList. Can we also extract them into another class named "DocumentParser"?

We will supports more complex cases in the future and do this extraction that time, we'll put these common code in the new class added this time instead of creating another class.

Makes sense.

zane-neo · 2024-04-15T10:28:49Z

The BWC test failure is not caused by this change and there's already an issue tracking it: #688

vibrantvarun · 2024-04-17T05:41:29Z

Why BWC tests are failing in the build with the same issue?

yuye-aws · 2024-04-17T05:43:25Z

This PR looks good to me

vibrantvarun · 2024-04-17T21:35:20Z

Why are the BWC tests failing @yuye-aws ?

yuye-aws · 2024-05-01T14:11:23Z

Hi @vibrantvarun ! The BWC tests now get passed. What's our next step plan to merge this PR?

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

Signed-off-by: zane-neo <[email protected]>

martin-gaievski

Overall code looks good, few minor comments for formatting and style:

correct all error message formatting for exceptions, use String.format
check for single liners if, please use full form with a method inside if in curly braces

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

martin-gaievski · 2024-06-03T15:40:34Z

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

+                        allowEmpty
+                    );
+                } else if (!(nextSourceValue instanceof String)) {
+                    throw new IllegalArgumentException("map type field [" + key + "] is neither string nor nested type, cannot process it");


please use String.format with params instead of direct strings concatenation for exception error message. You have used it in this class line 65-71

Why default() locale? We typically use Locale.ROOT, it refers to empty location rather then picking up locale of OS JVM process on data node.

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java

Signed-off-by: zane-neo <[email protected]>

* fix map type validation issue in processors Signed-off-by: zane-neo <[email protected]> * fix test failures on main branch Signed-off-by: zane-neo <[email protected]> * Fix potential NPE issue in chunking processor; add changee log Signed-off-by: zane-neo <[email protected]> * Fix failure tests Signed-off-by: zane-neo <[email protected]> * Address comments and add one more UT to cover uncovered line Signed-off-by: zane-neo <[email protected]> * Address comments Signed-off-by: zane-neo <[email protected]> * Add more UTs Signed-off-by: zane-neo <[email protected]> * fix failure ITs Signed-off-by: zane-neo <[email protected]> * Add public method with default depth parameter value Signed-off-by: zane-neo <[email protected]> * rebase latest code Signed-off-by: zane-neo <[email protected]> * address comments Signed-off-by: zane-neo <[email protected]> * address comment Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 54ac672)

* fix map type validation issue in processors Signed-off-by: zane-neo <[email protected]> * fix test failures on main branch Signed-off-by: zane-neo <[email protected]> * Fix potential NPE issue in chunking processor; add changee log Signed-off-by: zane-neo <[email protected]> * Fix failure tests Signed-off-by: zane-neo <[email protected]> * Address comments and add one more UT to cover uncovered line Signed-off-by: zane-neo <[email protected]> * Address comments Signed-off-by: zane-neo <[email protected]> * Add more UTs Signed-off-by: zane-neo <[email protected]> * fix failure ITs Signed-off-by: zane-neo <[email protected]> * Add public method with default depth parameter value Signed-off-by: zane-neo <[email protected]> * rebase latest code Signed-off-by: zane-neo <[email protected]> * address comments Signed-off-by: zane-neo <[email protected]> * address comment Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 54ac672) Co-authored-by: zane-neo <[email protected]>

zane-neo requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners April 11, 2024 11:31

chishui reviewed Apr 12, 2024

View reviewed changes

yuye-aws reviewed Apr 14, 2024

View reviewed changes

yuye-aws mentioned this pull request May 9, 2024

[BUG] Incorrect validation logic for map type in xxxProcessor #739

Closed

zane-neo force-pushed the fix-map-type-validation branch from 73596ed to af24a54 Compare May 24, 2024 01:59

martin-gaievski reviewed May 25, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java Show resolved Hide resolved

zhichao-aws reviewed May 27, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java Show resolved Hide resolved

zane-neo added 10 commits May 27, 2024 15:57

fix map type validation issue in processors

69dcd7f

Signed-off-by: zane-neo <[email protected]>

fix test failures on main branch

2ff657d

Signed-off-by: zane-neo <[email protected]>

Fix potential NPE issue in chunking processor; add changee log

cd6d7fb

Signed-off-by: zane-neo <[email protected]>

Fix failure tests

f579c64

Signed-off-by: zane-neo <[email protected]>

Address comments and add one more UT to cover uncovered line

3e5797a

Signed-off-by: zane-neo <[email protected]>

Address comments

3d75919

Signed-off-by: zane-neo <[email protected]>

Add more UTs

af531a5

Signed-off-by: zane-neo <[email protected]>

fix failure ITs

2a56672

Signed-off-by: zane-neo <[email protected]>

Add public method with default depth parameter value

2a44a4d

Signed-off-by: zane-neo <[email protected]>

rebase latest code

bb82974

Signed-off-by: zane-neo <[email protected]>

zane-neo force-pushed the fix-map-type-validation branch from 84cc5ce to bb82974 Compare May 27, 2024 08:14

zhichao-aws approved these changes May 27, 2024

View reviewed changes

martin-gaievski reviewed Jun 3, 2024

View reviewed changes

address comments

85cd1ec

Signed-off-by: zane-neo <[email protected]>

zane-neo force-pushed the fix-map-type-validation branch from 57306bf to 85cd1ec Compare June 4, 2024 01:15

address comment

ab35ae3

Signed-off-by: zane-neo <[email protected]>

martin-gaievski approved these changes Jun 5, 2024

View reviewed changes

zane-neo merged commit 54ac672 into opensearch-project:main Jun 5, 2024
70 checks passed

zane-neo added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Jun 5, 2024

opensearch-trigger-bot bot mentioned this pull request Jun 5, 2024

[Backport 2.x] fix map type validation issue in processors #773

Merged

zane-neo mentioned this pull request Oct 9, 2024

[BUG] error on complex types list type field [category] has empty string, cannot process it #678

Closed


		public class ProcessorDocumentUtils {

		public static long getMaxDepth(Map<String, Object> sourceAndMetadataMap, ClusterService clusterService, Environment environment) {

fix map type validation issue in processors #687

fix map type validation issue in processors #687

Conversation

zane-neo commented Apr 11, 2024 • edited Loading

Description

Issues Resolved

Check List

chishui left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuye-aws commented Apr 14, 2024

yuye-aws commented Apr 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zane-neo Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zane-neo commented Apr 15, 2024

zane-neo commented Apr 15, 2024

yuye-aws commented Apr 15, 2024

zane-neo commented Apr 15, 2024

vibrantvarun commented Apr 17, 2024

yuye-aws commented Apr 17, 2024

vibrantvarun commented Apr 17, 2024

yuye-aws commented May 1, 2024 • edited Loading

martin-gaievski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zane-neo commented Apr 11, 2024 •

edited

Loading

zane-neo Apr 15, 2024 •

edited

Loading

yuye-aws commented May 1, 2024 •

edited

Loading