-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve processing OpenSearch data types. Fix using subfields for text
type.
#299
Changes from 7 commits
2136ae8
bb9f30e
33370ff
5b2a659
ccf9138
0339320
e885a44
93e9a34
d3b66ac
5232ad2
8b0671c
b04a92e
efcb86e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
GumpacG marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,7 +9,7 @@ Type conversion means conversion from one data type to another which has two asp | |
1. Whether the conversion is implicit or explicit (implicit conversion is often called coercion) | ||
2. Whether the data is converted within the family or reinterpreted as another data type outside | ||
|
||
It’s common that strong typed language only supports little implicit conversions and no data reinterpretation. While languages with weak typing allows many implicit conversions and flexible reinterpretation. | ||
It's common that strong typed language only supports little implicit conversions and no data reinterpretation. While languages with weak typing allows many implicit conversions and flexible reinterpretation. | ||
|
||
### 1.2 Problem Statement | ||
|
||
|
@@ -30,8 +30,8 @@ However, more general conversions for non-numeric types are missing, such as con | |
|
||
The common use case and motivation include: | ||
|
||
1. *User-friendly*: Although it doesn’t matter for application or BI tool which can always follow the strict grammar rule, it’s more friendly and accessible to human by implicit type conversion, ex. `date > DATE('2020-06-01') => date > '2020-06-01'` | ||
2. *Schema-on-read*: More importantly, implicit conversion from string is required for schema on read (stored as raw string on write and extract field(s) on read), ex. `regex ‘...’ | abs(a)` | ||
1. *User-friendly*: Although it doesn't matter for application or BI tool which can always follow the strict grammar rule, it's more friendly and accessible to human by implicit type conversion, ex. `date > DATE('2020-06-01') => date > '2020-06-01'` | ||
2. *Schema-on-read*: More importantly, implicit conversion from string is required for schema on read (stored as raw string on write and extract field(s) on read), ex. `regex ‘...' | abs(a)` | ||
Yury-Fridlyand marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### 2.2 Functionalities | ||
|
||
|
@@ -50,7 +50,7 @@ Future: | |
|
||
### 3.1 Type Precedence | ||
|
||
Type precedence determines the direction of conversion when fields involved in an expression has different type from resolved signature. Before introducing it into our type system, let’s check how an expression is resolved to a function implementation and why type precedence is required. | ||
Type precedence determines the direction of conversion when fields involved in an expression has different type from resolved signature. Before introducing it into our type system, let's check how an expression is resolved to a function implementation and why type precedence is required. | ||
|
||
``` | ||
Compiling time: | ||
|
@@ -60,7 +60,7 @@ Compiling time: | |
Function builder: returns equal(DOUBLE, DOUBLE) impl | ||
``` | ||
|
||
Now let’s follow the same idea to add support for conversion from `BOOLEAN` to `STRING`. Because all boolean values can be converted to a string (in other word string is “wider”), String type is made the parent of Boolean. However, this leads to wrong semantic as the following expression `false = ‘FALSE’` for example: | ||
Now let's follow the same idea to add support for conversion from `BOOLEAN` to `STRING`. Because all boolean values can be converted to a string (in other word string is 'wider'), String type is made the parent of Boolean. However, this leads to wrong semantic as the following expression `false = 'FALSE'` for example: | ||
|
||
``` | ||
Compiling time: | ||
|
@@ -74,17 +74,18 @@ Runtime: | |
Evaluation result: *false* | ||
``` | ||
|
||
Therefore type precedence is supposed to be defined based on semantic expected rather than intuitive “width” of type. Now let’s reverse the direction and make Boolean the parent of String type. | ||
Therefore type precedence is supposed to be defined based on semantic expected rather than intuitive 'width' of type. Now let's reverse the direction and make Boolean the parent of String type. | ||
|
||
![New type hierarchy](img/type-hierarchy-tree-with-implicit-cast.png) | ||
Note: type hierarchy structure shown on the picture below was implemented in [#166](https://github.com/opensearch-project/sql/pull/166), but was changed later. | ||
|
||
``` | ||
Compiling time: | ||
Expression: false = 'FALSE' | ||
Unresolved signature: equal(BOOL, STRING) | ||
Resovled signature: equal(BOOL, BOOL) | ||
Function builder: 1) returns equal(BOOL, cast_to_bool(STRING)) impl | ||
2) returns equal(BOOL, BOOL) impl | ||
1) returns equal(BOOL, BOOL) impl | ||
Yury-Fridlyand marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Runtime: | ||
equal impl: false.equals(cast_to_bool('FALSE')) | ||
cast_to_bool impl: Boolean.valueOf('FALSE') | ||
|
@@ -147,15 +148,15 @@ public enum ExprCoreType implements ExprType { | |
DOUBLE(FLOAT), | ||
|
||
STRING(UNDEFINED), | ||
BOOLEAN(STRING), // PR: change STRING's parent to BOOLEAN | ||
BOOLEAN(STRING), // #166 changes: STRING's parent to BOOLEAN | ||
Yury-Fridlyand marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
/** | ||
* Date. | ||
*/ | ||
TIMESTAMP(UNDEFINED), | ||
DATE(UNDEFINED), | ||
TIME(UNDEFINED), | ||
DATETIME(UNDEFINED), | ||
DATE(STRING), // #171 changes: datetime types parent to STRING | ||
GumpacG marked this conversation as resolved.
Show resolved
Hide resolved
|
||
TIME(STRING), | ||
DATETIME(STRING, DATE, TIME), // #1196 changes: extend DATETIME and TIMESTAMP parent list | ||
TIMESTAMP(STRING, DATETIME), | ||
INTERVAL(UNDEFINED), | ||
|
||
STRUCT(UNDEFINED), | ||
|
@@ -170,3 +171,12 @@ As with examples in section 3.1, the implementation is: | |
1. Define all possible conversions in CAST function family. | ||
2. Define implicit conversions by type hierarchy tree (auto implicit cast from child to parent) | ||
3. During compile time, wrap original function builder by a new one which cast arguments to target type. | ||
|
||
## Final type hierarchy scheme | ||
|
||
![Most relevant type hierarchy](img/type-hierarchy-tree-final.png) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's only STRING in that listing. Should we specify TEXT vs KEYWORD there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no |
||
|
||
## References | ||
* [#166](https://github.com/opensearch-project/sql/pull/166): Add automatic `STRING` -> `BOOLEAN` cast | ||
* [#171](https://github.com/opensearch-project/sql/pull/171): Add automatic `STRING` -> `DATE`/`TIME`/`DATETIME`/`TIMESTAMP` cast | ||
* [#1196](https://github.com/opensearch-project/sql/pull/1196): Add automatic casts between datetime types `DATE`/`TIME`/`DATETIME` -> `DATE`/`TIME`/`DATETIME`/`TIMESTAMP` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
## History overview | ||
|
||
* [odfe#620](https://github.com/opendistro-for-elasticsearch/sql/pull/620) - added two types for `text` (just text and text with fields), querying and string functions are supported, but searching is not. | ||
* [odfe#682](https://github.com/opendistro-for-elasticsearch/sql/pull/682) and [odfe#730](https://github.com/opendistro-for-elasticsearch/sql/pull/730) - added support for querying (in 682) and aggregation (730); if `text` has sub-fields, `.keyword` prefix is added to the field name regardless of the sub-field names. | ||
* [#1314](https://github.com/opensearch-project/sql/pull/1314) and [#1664](https://github.com/opensearch-project/sql/pull/1664) - changed format of storing `text` type; it is one type now, subfield information is stored, but not yet used. | ||
* Current changes (no PR number yet) - correctly resolve sub-field name if a `text` field has only one subfield. Fixes [#1112](https://github.com/opensearch-project/sql/issues/1112) and [#1038](https://github.com/opensearch-project/sql/issues/1038). | ||
|
||
## Further changes | ||
|
||
* Support search for text sub-fields ([#1113](https://github.com/opensearch-project/sql/issues/1113)). | ||
* Support multiple sub-fields for text ([#1887](https://github.com/opensearch-project/sql/issues/1887)). | ||
* Support non-default date formats for search queries ([#1847](https://github.com/opensearch-project/sql/issues/1847)). Fix for this bug depends on the current changes. | ||
|
||
## Problem statement | ||
|
||
`:opensearch` module parses index mapping and builds instances of `OpenSearchDataType` (a base class), but ships simplified types (`ExprCoreType` - a enum) to `:core` module, because `:core` uses `ExprCoreType`s to resolve functions. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant core as :core | ||
participant opensearch as :opensearch | ||
|
||
core ->>+ opensearch : resolve types | ||
opensearch ->>- core : simplified types | ||
note over core: preparing query | ||
core ->> opensearch : translate request into datasource DSL | ||
``` | ||
|
||
Later, `:core` returns to `:opensearch` the DSL request with types stored. Since types were simplified, all mapping information is lost. Adding new `TEXT` entry to `ExprCoreType` enum is not enough, because `ExprCoreType` is datasource agnostic and can't store any specific mapping info. | ||
|
||
## Solution | ||
|
||
The solution is to provide to `:core` non simplified types, but full types. Those objects should be fully compatible with `ExprCoreType` and implement all required APIs to allow `:core` to manipulate with built-in functions. Once those type objects are returned back to `:opensearch`, it can get all required information to build the correct search request. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. non simplified types: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. simplified types: Before full types were converted to enums before passing from |
||
|
||
1. Pass full (non simplified) types to and through `:core`. | ||
2. Update `OpenSearchDataType` (and inheritors if needed) to be comparable with `ExprCoreType`. | ||
3. Update `:core` to do proper comparison (use `.equals` instead of `==`). | ||
5. Update `:opensearch` to use the mapping information received from `:core` and properly build the search query. | ||
|
||
## Type Schema | ||
|
||
| JDBC type | `ExprCoreType` | `OpenSearchDataType` | OpenSearch type | | ||
| --- | --- | --- | --- | | ||
| `VARCHAR`/`CHAR` | `STRING` | -- | `keyword` | | ||
| `LONGVARCHAR`/`TEXT` | `STRING` | `OpenSearchTextType` | `text` | |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,8 +36,8 @@ public void test_numeric_data_types() throws IOException { | |
schema("byte_number", "byte"), | ||
schema("double_number", "double"), | ||
schema("float_number", "float"), | ||
schema("half_float_number", "float"), | ||
schema("scaled_float_number", "double")); | ||
schema("half_float_number", "half_float"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this change necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This caused by changes described in #299 (comment) |
||
schema("scaled_float_number", "scaled_float")); | ||
} | ||
|
||
@Test | ||
|
@@ -51,8 +51,8 @@ public void test_nonnumeric_data_types() throws IOException { | |
schema("binary_value", "binary"), | ||
schema("date_value", "timestamp"), | ||
schema("ip_value", "ip"), | ||
schema("object_value", "struct"), | ||
schema("nested_value", "array"), | ||
schema("object_value", "object"), | ||
schema("nested_value", "nested"), | ||
Comment on lines
+54
to
+55
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this change necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This caused by changes described in #299 (comment) |
||
schema("geo_point_value", "geo_point")); | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,19 +56,18 @@ public void typeof_opensearch_types() throws IOException { | |
+ " | fields `double`, `long`, `integer`, `byte`, `short`, `float`, `half_float`, `scaled_float`", | ||
TEST_INDEX_DATATYPE_NUMERIC)); | ||
verifyDataRows(response, | ||
rows("DOUBLE", "LONG", "INTEGER", "BYTE", "SHORT", "FLOAT", "FLOAT", "DOUBLE")); | ||
rows("DOUBLE", "LONG", "INTEGER", "BYTE", "SHORT", "FLOAT", "HALF_FLOAT", "SCALED_FLOAT")); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this relate to adding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This caused by changes described in #299 (comment) |
||
|
||
response = executeQuery(String.format("source=%s | eval " | ||
+ "`text` = typeof(text_value), `date` = typeof(date_value)," | ||
+ "`boolean` = typeof(boolean_value), `object` = typeof(object_value)," | ||
+ "`keyword` = typeof(keyword_value), `ip` = typeof(ip_value)," | ||
+ "`binary` = typeof(binary_value), `geo_point` = typeof(geo_point_value)" | ||
// TODO activate this test once `ARRAY` type supported, see ExpressionAnalyzer::isTypeNotSupported | ||
//+ ", `nested` = typeof(nested_value)" | ||
+ " | fields `text`, `date`, `boolean`, `object`, `keyword`, `ip`, `binary`, `geo_point`", | ||
+ "`binary` = typeof(binary_value), `geo_point` = typeof(geo_point_value)," | ||
+ "`nested` = typeof(nested_value)" | ||
+ " | fields `text`, `date`, `boolean`, `object`, `keyword`, `ip`, `binary`, `geo_point`, `nested`", | ||
TEST_INDEX_DATATYPE_NONNUMERIC)); | ||
verifyDataRows(response, | ||
rows("TEXT", "TIMESTAMP", "BOOLEAN", "OBJECT", "KEYWORD", | ||
"IP", "BINARY", "GEO_POINT")); | ||
"IP", "BINARY", "GEO_POINT", "NESTED")); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be more appropriate for these to be on
OpenSearchDataType
since they are specific to how we communicate with OpenSearch.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, where it is used,
ExprType
is referenced. I'd like to avoid excessive refactoring there.opensearch-project-sql/opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/filter/lucene/LikeQuery.java
Lines 16 to 19 in 6c3744e
Any ideas how to do it gracefully?