Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve processing OpenSearch data types. Fix using subfields for text type. #299

Merged
merged 13 commits into from
Aug 21, 2023

Conversation

Yury-Fridlyand
Copy link

@Yury-Fridlyand Yury-Fridlyand commented Jul 19, 2023

Description

See doc for technical details: https://github.com/Bit-Quill/opensearch-project-sql/blob/dev-add-text-type/docs/dev/text-type.md
See also this comment describing some changes.

Issues Resolved

  1. Pass full types (instances of OpenSearchDataType) though :core module instead of simplified ones (ExprCoreType).
    This unblocks access to important mapping info such as text fields or date formats. This info is required to build proper DSL queries to OpenSearch.
  2. Use text fields instead of hardcoded keyword subfield name.

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov

This comment was marked as spam.

docs/dev/img/type-hierarchy-tree-final.png Outdated Show resolved Hide resolved
docs/dev/query-type-conversion.md Outdated Show resolved Hide resolved
docs/dev/query-type-conversion.md Show resolved Hide resolved
}
return fieldName;
// Pick first field. What to do if there are multiple fields?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pass in the type like how it was with convertTextToKeyword and map that type by finding it in the list of fields?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed?
Now this function isn't static and different types may overload it if needed. Having that we can avoid creating a new function like convertXXXtoYYY in future.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But how would it know which type to convert to? For example, doing aggregation on text with mapping

"textColumn": {
    "type": "text",
         "fields": {
            "date": {
               "type": "date"
             },
             "keyword": {
                "type": "keyword"
              }
          }
      }
}

will do aggregation on textColumn.date. What would be expected here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @GumpacG on this.

keyword field is a convention in OpenSearch to mean "first bit of the text" and conversion is "ok, I guess" for legacy's sake but in general picking the first field would lead to unexpected results that depend on the mapping.

On the other hand, if fielddata is set then it is safe to use textColumn field in this place.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to aggregate on dates inside a text field.
I changed to find a string subfield if present in 8b0671c.

Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
@Yury-Fridlyand Yury-Fridlyand marked this pull request as ready for review July 27, 2023 17:16
Copy link

@acarbonetto acarbonetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For docs/dev/img/type-hierarchy-tree-final.png:

  1. Can you split STRING into TEXT and KEYWORD?
  2. Can you align DATE and TIME
  3. I'm not sure you want STRING --> DATE/TIME/DATETIME/TIMESTAMP since its a very specific set of strings that convert. I think that conversion is 'special' and doesn't need to be defined here.

docs/dev/query-type-conversion.md Show resolved Hide resolved

## Final type hierarchy scheme

![Most relevant type hierarchy](img/type-hierarchy-tree-final.png)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only STRING in that listing. Should we specify TEXT vs KEYWORD there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no TEXT nor KEYWORD in ExprCoreType.

}

public int hashCode() {
return 42 + exprCoreType.hashCode();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be considered a magic number that should be defined as a constant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this override necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to make OpenSearchExprValueFactory::typeActionMap work properly. Without that override it always falls to

throw new IllegalStateException(
String.format(
"Unsupported type: %s for value: %s.", type.typeName(), content.objectValue()));

This could be simplified to always return 0 (or any other constant) to enforce equals check always.


## Solution

The solution is to provide to `:core` non simplified types, but full types. Those objects should be fully compatible with `ExprCoreType` and implement all required APIs to allow `:core` to manipulate with built-in functions. Once those type objects are returned back to `:opensearch`, it can get all required information to build the correct search request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non simplified types: enum
full types: Objects
right?
Can we just say that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified types: enum
full types: Objects

Before full types were converted to enums before passing from :opensearch to :core. With my changes full types are passed from :opensearch to :core, and :core uses an API call to convert them to a enum value whatever it is needed (to pick proper function signature).

OpenSearchDataType.of(MappingType.GeoPoint)),
() -> assertNotEquals(OpenSearchDataType.of(MappingType.GeoPoint),
OpenSearchDataType.of(MappingType.Ip)),
() -> assertEquals(OpenSearchDataType.of(STRING), OpenSearchDataType.of(STRING)),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c for coverage
I had to add 4 tests to satisfy jacoco caprise for line 42 and 4 more tests for line 43

if (mappingType != null && other.mappingType != null) {
return mappingType.equals(other.mappingType) && exprCoreType.equals(other.exprCoreType);

Signed-off-by: Yury-Fridlyand <[email protected]>
Copy link

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered creating one class hierarchy for types? Instead of ExprCoreTypes being enum values, make them classes and derive from each other as appropriate.

Singleton instances can still be used for types that do not have parameters, like ints, keyword, etc.
This would simplify a lot of the type comparison logic.

@@ -21,7 +20,8 @@ public interface ExprType {
* Is compatible with other types.
*/
default boolean isCompatible(ExprType other) {
if (this.equals(other)) {
// Do double direction check with `equals`, because a derived class may override it
if (this.equals(other) || other.equals(this)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By definition, if this.equals(other) then other.equals(this) must be true.

Do we have ExprTypes for which this is necessary? If yes, the problem is there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other maybe an instance of OpenSearchDataType, which has more complex comparison logic.
I have an idea how to fix it, will do soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b04a92e.

@@ -36,8 +36,8 @@ public void test_numeric_data_types() throws IOException {
schema("byte_number", "byte"),
schema("double_number", "double"),
schema("float_number", "float"),
schema("half_float_number", "float"),
schema("scaled_float_number", "double"));
schema("half_float_number", "half_float"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused by changes described in #299 (comment)

Comment on lines +54 to +55
schema("object_value", "object"),
schema("nested_value", "nested"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused by changes described in #299 (comment)

@@ -56,19 +56,18 @@ public void typeof_opensearch_types() throws IOException {
+ " | fields `double`, `long`, `integer`, `byte`, `short`, `float`, `half_float`, `scaled_float`",
TEST_INDEX_DATATYPE_NUMERIC));
verifyDataRows(response,
rows("DOUBLE", "LONG", "INTEGER", "BYTE", "SHORT", "FLOAT", "FLOAT", "DOUBLE"));
rows("DOUBLE", "LONG", "INTEGER", "BYTE", "SHORT", "FLOAT", "HALF_FLOAT", "SCALED_FLOAT"));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to adding text type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused by changes described in #299 (comment)

Comment on lines +97 to +100
() -> assertEquals("TIMESTAMP", defaultDateType.typeName()),
() -> assertEquals("TIME", timeDateType.typeName()),
() -> assertEquals("DATE", dateDateType.typeName()),
() -> assertEquals("DATE", datetimeDateType.typeName())
() -> assertEquals("TIMESTAMP", datetimeDateType.typeName())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very unrelated to adding text type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before

  1. OpenSearchDateType converted to a simplified type when passed to :core module. Actually, ExprCoreType extracted from OSDT, it is stored inside. Values were DATE/TIME/etc.
  2. ExprCoreType names used to build schema in QueryResponse, which was serialized later and sent to user. legacyTypeName method of ExprType used for SQL responses and typeName for PPL ones.

In the middle

  1. OSDT isn't converted
  2. Same methods of OSDT return mappingType which is always date regardless of detected ExprCoreType for this field.

Finally

  1. -//-
  2. OpenSearchDateType overrides these methods to return ExprCoreType
  3. No changes for a user!

Comment on lines +68 to +80
/**
* Perform field name conversion if needed before inserting it into a search query.
*/
default String convertFieldForSearchQuery(String fieldName) {
return fieldName;
}

/**
* Perform value conversion if needed before inserting it into a search query.
*/
default Object convertValueForSearchQuery(ExprValue value) {
return value.value();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be more appropriate for these to be on OpenSearchDataType since they are specific to how we communicate with OpenSearch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, where it is used, ExprType is referenced. I'd like to avoid excessive refactoring there.

public class LikeQuery extends LuceneQuery {
@Override
public QueryBuilder doBuild(String fieldName, ExprType fieldType, ExprValue literal) {
String field = OpenSearchTextType.convertTextToKeyword(fieldName, fieldType);

Any ideas how to do it gracefully?

}

public int hashCode() {
return 42 + exprCoreType.hashCode();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this override necessary?

@@ -163,8 +208,8 @@ public static OpenSearchDataType of(MappingType mappingType, Map<String, Object>
case Ip: return OpenSearchIpType.of();
case Date:
// Default date formatter is used when "" is passed as the second parameter
String format = (String) innerMap.getOrDefault("format", "");
return OpenSearchDateType.of(format);
return innerMap.isEmpty() ? OpenSearchDateType.of()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simplifies creation of OpenSearchDateType. A format string passes a number of checks even when it is empty.

return fieldName + ".keyword";
@Override
public String convertFieldForSearchQuery(String fieldName) {
if (fields.size() == 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the user will end up with OpenSearch error about not being able to aggregate on text. Do I get that right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@Yury-Fridlyand
Copy link
Author

For docs/dev/img/type-hierarchy-tree-final.png:

...

@acarbonetto
I listed all current ExprCoreType possible widenings according to the code.

Signed-off-by: Yury-Fridlyand <[email protected]>
@Yury-Fridlyand Yury-Fridlyand changed the title Add text type. Improve processing OpenSearch data types. Fix using subfields for text type. Aug 15, 2023
@Yury-Fridlyand Yury-Fridlyand merged commit 96285bf into integ-add-text-type Aug 21, 2023
18 of 20 checks passed
@Yury-Fridlyand Yury-Fridlyand deleted the dev-add-text-type branch August 21, 2023 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants