-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix map type validation issue in processors #687
fix map type validation issue in processors #687
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work to refactor and reduce duplicate code
CHANGELOG.md
Outdated
@@ -21,6 +21,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
- Allowing execution of hybrid query on index alias with filters ([#670](https://github.com/opensearch-project/neural-search/pull/670)) | |||
### Bug Fixes | |||
- Add support for request_cache flag in hybrid query ([#663](https://github.com/opensearch-project/neural-search/pull/663)) | |||
- Fix may type validation issue in multiple pipeline processors ([#661](https://github.com/opensearch-project/neural-search/pull/661)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Fix "map"
@@ -107,12 +109,12 @@ public IngestDocument execute(IngestDocument ingestDocument) throws Exception { | |||
public void execute(IngestDocument ingestDocument, BiConsumer<IngestDocument, Exception> handler) { | |||
try { | |||
validateEmbeddingFieldsValue(ingestDocument); | |||
Map<String, Object> ProcessMap = buildMapWithProcessorKeyAndOriginalValue(ingestDocument); | |||
List<String> inferenceList = createInferenceList(ProcessMap); | |||
Map<String, Object> processMap = buildMapWithTargetKeyAndOriginalValue(ingestDocument); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, I also wanted to change it
import java.util.Map; | ||
import java.util.Objects; | ||
|
||
public class ProcessorDocumentUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add javadoc for better readability
|
||
public class ProcessorDocumentUtils { | ||
|
||
public static long getMaxDepth(Map<String, Object> sourceAndMetadataMap, ClusterService clusterService, Environment environment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, java doc for public function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This methd is only used for depth check parameter in function validateMapTypeValue
. How about making this method private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I was trying to reduce the method's parameter so split them to different methods, but making it a private method is the correct direction.
final String sourceKey, | ||
final Map<String, Object> sourceValue, | ||
final Object fieldMap, | ||
final int depth, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why depth is int
but maxDepth is long
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long is from OpenSearch Core(I don't think it make sense BTW), but I've changed the int to long to fit it.
|
||
private static void validateDepth(String sourceKey, int depth, long maxDepth) { | ||
if (depth > maxDepth) { | ||
throw new IllegalArgumentException("map type field [" + sourceKey + "] reached max depth limit, cannot process it"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"reaches"?
boolean allowEmpty | ||
) { | ||
validateDepth(sourceKey, depth, maxDepth); | ||
if (sourceValue == null || sourceValue.isEmpty()) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: CollectionUtils.isEmpty()
can check both
throw new IllegalArgumentException("list type field [" + sourceKey + "] has non string value, cannot process it"); | ||
} else { | ||
if (element == null) { | ||
throw new IllegalArgumentException("list type field [" + sourceKey + "] has null, cannot process it"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you check if (firstNonNullElement instanceof List)
and else if (firstNonNullElement instanceof Map)
, element
could also be null, but no exception thrown, the logic seems inconsistent. Why we check both firstNonNullElement and element, it's a bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I'll change this part to follow a consistent principle: for list type, we don't allow null element in it.
import java.util.Map; | ||
import java.util.Objects; | ||
|
||
public class ProcessorDocumentUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add UT for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, these code are extracted from old code and there're already a lot of tests to cover these code, from the code coverage ran from local, all lines are covered already.
Happy to see that this PR extracts all validation function into file |
We still have parsing methods like |
|
||
public class ProcessorDocumentUtils { | ||
|
||
public static long getMaxDepth(Map<String, Object> sourceAndMetadataMap, ClusterService clusterService, Environment environment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This methd is only used for depth check parameter in function validateMapTypeValue
. How about making this method private?
@@ -600,7 +619,7 @@ public void testExecute_withFixedTokenLength_andFieldMapNestedMap_thenFail() { | |||
() -> processor.execute(ingestDocument) | |||
); | |||
assertEquals( | |||
String.format(Locale.ROOT, "map type field [%s] has non-string type, cannot process it", INPUT_NESTED_FIELD_KEY), | |||
"[body] configuration doesn't match actual value type, configuration type is: java.lang.String, actual value type is: java.util.ImmutableCollections$Map1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use String.format()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is deterministic not dynamic, I don't think we need to use string.format. Also, use plain text is a little bit more straightforward than string.format.
assertEquals( | ||
"[body] configuration doesn't match actual value type, configuration type is: java.lang.String, actual value type is: com.google.common.collect.RegularImmutableMap", | ||
illegalArgumentException.getMessage() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.fromat()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above.
@@ -630,15 +649,18 @@ public void testExecute_withFixedTokenLength_andFieldMapNestedMap_sourceDataList | |||
} | |||
|
|||
@SneakyThrows | |||
public void testExecute_withFixedTokenLength_andSourceDataListWithHybridType_thenSucceed() { | |||
public void testExecute_withFixedTokenLength_andSourceDataListWithHybridType_thenFail() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this test case should fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code we didn't have bi-directional validation of document source and configuration map. A map type configuration can't fit both string and map type value, e.g.
{
"field_map": {
"a": {
"b": {
"c": "c_target"
}
}
}
}
Can not fit to both(case 1)
{
"a": {
"b": {
"c": "text to embedding/chunk"
}
}
}
and(case 2)
{
"a": {
"b": "text to embedding/chunk"
}
}
For case 2, text_embedding and text_image_embedding processor needs to build a temporary map with target key, and b
's type will be treated as map based on configuration and then cause class cast exception.
For text chunking, current code doesn't throw exception but the result is not correct, based on user's configuration, c
is the target chunking field but the actual chunking field in this case is b
, and the generated result is:
{
"c_target": "b's chunking result"
}
Which is unexpected.
So now we enforced the bi-directional to avoid this case, and the base principle we'll follow in the future is: list type always have same type/data structure object in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with list type always have same type/data structure object in it
. However, it maybe a breaking change for not supporting hybrid list. Users previously pass in hybrid input in a list will now receive an error. Can we avoid throwing an error the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking change means an expected behavior changed, like above mentioned, this in fact is a bug since the result is not what user expected, by adding this validation user can realize the issue and fix their inputs.
No, this file will be adding more and more common methods like parsing data structure etc, so will name as what it is now. |
We will supports more complex cases in the future and do this extraction that time, we'll put these common code in the new class added this time instead of creating another class. |
Makes sense. |
The BWC test failure is not caused by this change and there's already an issue tracking it: #688 |
Why BWC tests are failing in the build with the same issue? |
This PR looks good to me |
Why are the BWC tests failing @yuye-aws ? |
Hi @vibrantvarun ! The BWC tests now get passed. What's our next step plan to merge this PR? |
73596ed
to
af24a54
Compare
src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java
Show resolved
Hide resolved
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: zane-neo <[email protected]>
84cc5ce
to
bb82974
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall code looks good, few minor comments for formatting and style:
- correct all error message formatting for exceptions, use
String.format
- check for single liners
if
, please use full form with a method insideif
in curly braces
src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java
Outdated
Show resolved
Hide resolved
allowEmpty | ||
); | ||
} else if (!(nextSourceValue instanceof String)) { | ||
throw new IllegalArgumentException("map type field [" + key + "] is neither string nor nested type, cannot process it"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use String.format
with params instead of direct strings concatenation for exception error message. You have used it in this class line 65-71
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why default() locale? We typically use Locale.ROOT, it refers to empty location rather then picking up locale of OS JVM process on data node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/main/java/org/opensearch/neuralsearch/util/ProcessorDocumentUtils.java
Outdated
Show resolved
Hide resolved
Signed-off-by: zane-neo <[email protected]>
57306bf
to
85cd1ec
Compare
Signed-off-by: zane-neo <[email protected]>
* fix map type validation issue in processors Signed-off-by: zane-neo <[email protected]> * fix test failures on main branch Signed-off-by: zane-neo <[email protected]> * Fix potential NPE issue in chunking processor; add changee log Signed-off-by: zane-neo <[email protected]> * Fix failure tests Signed-off-by: zane-neo <[email protected]> * Address comments and add one more UT to cover uncovered line Signed-off-by: zane-neo <[email protected]> * Address comments Signed-off-by: zane-neo <[email protected]> * Add more UTs Signed-off-by: zane-neo <[email protected]> * fix failure ITs Signed-off-by: zane-neo <[email protected]> * Add public method with default depth parameter value Signed-off-by: zane-neo <[email protected]> * rebase latest code Signed-off-by: zane-neo <[email protected]> * address comments Signed-off-by: zane-neo <[email protected]> * address comment Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 54ac672)
* fix map type validation issue in processors Signed-off-by: zane-neo <[email protected]> * fix test failures on main branch Signed-off-by: zane-neo <[email protected]> * Fix potential NPE issue in chunking processor; add changee log Signed-off-by: zane-neo <[email protected]> * Fix failure tests Signed-off-by: zane-neo <[email protected]> * Address comments and add one more UT to cover uncovered line Signed-off-by: zane-neo <[email protected]> * Address comments Signed-off-by: zane-neo <[email protected]> * Add more UTs Signed-off-by: zane-neo <[email protected]> * fix failure ITs Signed-off-by: zane-neo <[email protected]> * Add public method with default depth parameter value Signed-off-by: zane-neo <[email protected]> * rebase latest code Signed-off-by: zane-neo <[email protected]> * address comments Signed-off-by: zane-neo <[email protected]> * address comment Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 54ac672) Co-authored-by: zane-neo <[email protected]>
Description
When user uses map type configuration in processors, the validation will validate all other fields instead of only the configured fields, sometimes if other fields has unsupported types, user will get exception which is not expected.
Multiple processors uses similar validation logic but code are duplicated, in this PR a new Util is been added to extract the duplicated code to a common place and reused by different processors.
Issues Resolved
opensearch-project/ml-commons#2309
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.