-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve processing OpenSearch data types. Fix using subfields for text
type.
#299
Conversation
This comment was marked as spam.
This comment was marked as spam.
core/src/main/java/org/opensearch/sql/expression/operator/convert/TypeCastOperator.java
Outdated
Show resolved
Hide resolved
} | ||
return fieldName; | ||
// Pick first field. What to do if there are multiple fields? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pass in the type like how it was with convertTextToKeyword and map that type by finding it in the list of fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed?
Now this function isn't static and different types may overload it if needed. Having that we can avoid creating a new function like convertXXXtoYYY
in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. But how would it know which type to convert to? For example, doing aggregation on text with mapping
"textColumn": {
"type": "text",
"fields": {
"date": {
"type": "date"
},
"keyword": {
"type": "keyword"
}
}
}
}
will do aggregation on textColumn.date
. What would be expected here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @GumpacG on this.
keyword
field is a convention in OpenSearch to mean "first bit of the text" and conversion is "ok, I guess" for legacy's sake but in general picking the first field would lead to unexpected results that depend on the mapping.
On the other hand, if fielddata
is set then it is safe to use textColumn
field in this place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to aggregate on dates inside a text field.
I changed to find a string subfield if present in 8b0671c.
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
554a460
to
e885a44
Compare
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For docs/dev/img/type-hierarchy-tree-final.png:
- Can you split STRING into TEXT and KEYWORD?
- Can you align DATE and TIME
- I'm not sure you want STRING --> DATE/TIME/DATETIME/TIMESTAMP since its a very specific set of strings that convert. I think that conversion is 'special' and doesn't need to be defined here.
core/src/main/java/org/opensearch/sql/expression/operator/convert/TypeCastOperator.java
Outdated
Show resolved
Hide resolved
|
||
## Final type hierarchy scheme | ||
|
||
![Most relevant type hierarchy](img/type-hierarchy-tree-final.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only STRING in that listing. Should we specify TEXT vs KEYWORD there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no TEXT
nor KEYWORD
in ExprCoreType
.
} | ||
|
||
public int hashCode() { | ||
return 42 + exprCoreType.hashCode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be considered a magic number that should be defined as a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe https://xkcd.com/221/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this override necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to make OpenSearchExprValueFactory::typeActionMap
work properly. Without that override it always falls to
Lines 208 to 210 in 5232ad2
throw new IllegalStateException( | |
String.format( | |
"Unsupported type: %s for value: %s.", type.typeName(), content.objectValue())); |
This could be simplified to always return 0 (or any other constant) to enforce equals
check always.
opensearch/src/main/java/org/opensearch/sql/opensearch/data/type/OpenSearchDataType.java
Outdated
Show resolved
Hide resolved
|
||
## Solution | ||
|
||
The solution is to provide to `:core` non simplified types, but full types. Those objects should be fully compatible with `ExprCoreType` and implement all required APIs to allow `:core` to manipulate with built-in functions. Once those type objects are returned back to `:opensearch`, it can get all required information to build the correct search request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non simplified types: enum
full types: Objects
right?
Can we just say that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplified types: enum
full types: Objects
Before full types were converted to enums before passing from :opensearch
to :core
. With my changes full types are passed from :opensearch
to :core
, and :core
uses an API call to convert them to a enum value whatever it is needed (to pick proper function signature).
opensearch/src/main/java/org/opensearch/sql/opensearch/data/type/OpenSearchDataType.java
Show resolved
Hide resolved
OpenSearchDataType.of(MappingType.GeoPoint)), | ||
() -> assertNotEquals(OpenSearchDataType.of(MappingType.GeoPoint), | ||
OpenSearchDataType.of(MappingType.Ip)), | ||
() -> assertEquals(OpenSearchDataType.of(STRING), OpenSearchDataType.of(STRING)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the purpose of this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c for coverage
I had to add 4 tests to satisfy jacoco caprise for line 42 and 4 more tests for line 43
Lines 42 to 43 in 5232ad2
if (mappingType != null && other.mappingType != null) { | |
return mappingType.equals(other.mappingType) && exprCoreType.equals(other.exprCoreType); |
opensearch/src/test/java/org/opensearch/sql/opensearch/data/type/OpenSearchDataTypeTest.java
Show resolved
Hide resolved
opensearch/src/test/java/org/opensearch/sql/opensearch/data/type/OpenSearchDataTypeTest.java
Show resolved
Hide resolved
Signed-off-by: Yury-Fridlyand <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered creating one class hierarchy for types? Instead of ExprCoreTypes being enum values, make them classes and derive from each other as appropriate.
Singleton instances can still be used for types that do not have parameters, like ints, keyword, etc.
This would simplify a lot of the type comparison logic.
@@ -21,7 +20,8 @@ public interface ExprType { | |||
* Is compatible with other types. | |||
*/ | |||
default boolean isCompatible(ExprType other) { | |||
if (this.equals(other)) { | |||
// Do double direction check with `equals`, because a derived class may override it | |||
if (this.equals(other) || other.equals(this)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By definition, if this.equals(other)
then other.equals(this)
must be true.
Do we have ExprType
s for which this is necessary? If yes, the problem is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other
maybe an instance of OpenSearchDataType
, which has more complex comparison logic.
I have an idea how to fix it, will do soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in b04a92e.
@@ -36,8 +36,8 @@ public void test_numeric_data_types() throws IOException { | |||
schema("byte_number", "byte"), | |||
schema("double_number", "double"), | |||
schema("float_number", "float"), | |||
schema("half_float_number", "float"), | |||
schema("scaled_float_number", "double")); | |||
schema("half_float_number", "half_float"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This caused by changes described in #299 (comment)
schema("object_value", "object"), | ||
schema("nested_value", "nested"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This caused by changes described in #299 (comment)
@@ -56,19 +56,18 @@ public void typeof_opensearch_types() throws IOException { | |||
+ " | fields `double`, `long`, `integer`, `byte`, `short`, `float`, `half_float`, `scaled_float`", | |||
TEST_INDEX_DATATYPE_NUMERIC)); | |||
verifyDataRows(response, | |||
rows("DOUBLE", "LONG", "INTEGER", "BYTE", "SHORT", "FLOAT", "FLOAT", "DOUBLE")); | |||
rows("DOUBLE", "LONG", "INTEGER", "BYTE", "SHORT", "FLOAT", "HALF_FLOAT", "SCALED_FLOAT")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this relate to adding text
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This caused by changes described in #299 (comment)
() -> assertEquals("TIMESTAMP", defaultDateType.typeName()), | ||
() -> assertEquals("TIME", timeDateType.typeName()), | ||
() -> assertEquals("DATE", dateDateType.typeName()), | ||
() -> assertEquals("DATE", datetimeDateType.typeName()) | ||
() -> assertEquals("TIMESTAMP", datetimeDateType.typeName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very unrelated to adding text
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before
OpenSearchDateType
converted to a simplified type when passed to:core
module. Actually,ExprCoreType
extracted from OSDT, it is stored inside. Values wereDATE
/TIME
/etc.ExprCoreType
names used to build schema inQueryResponse
, which was serialized later and sent to user.legacyTypeName
method ofExprType
used for SQL responses andtypeName
forPPL
ones.
In the middle
- OSDT isn't converted
- Same methods of OSDT return
mappingType
which is alwaysdate
regardless of detectedExprCoreType
for this field.
Finally
- -//-
OpenSearch
DateType
overrides these methods to returnExprCoreType
- No changes for a user!
/** | ||
* Perform field name conversion if needed before inserting it into a search query. | ||
*/ | ||
default String convertFieldForSearchQuery(String fieldName) { | ||
return fieldName; | ||
} | ||
|
||
/** | ||
* Perform value conversion if needed before inserting it into a search query. | ||
*/ | ||
default Object convertValueForSearchQuery(ExprValue value) { | ||
return value.value(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be more appropriate for these to be on OpenSearchDataType
since they are specific to how we communicate with OpenSearch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, where it is used, ExprType
is referenced. I'd like to avoid excessive refactoring there.
Lines 16 to 19 in 6c3744e
public class LikeQuery extends LuceneQuery { | |
@Override | |
public QueryBuilder doBuild(String fieldName, ExprType fieldType, ExprValue literal) { | |
String field = OpenSearchTextType.convertTextToKeyword(fieldName, fieldType); |
Any ideas how to do it gracefully?
} | ||
|
||
public int hashCode() { | ||
return 42 + exprCoreType.hashCode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this override necessary?
opensearch/src/main/java/org/opensearch/sql/opensearch/data/type/OpenSearchDataType.java
Show resolved
Hide resolved
@@ -163,8 +208,8 @@ public static OpenSearchDataType of(MappingType mappingType, Map<String, Object> | |||
case Ip: return OpenSearchIpType.of(); | |||
case Date: | |||
// Default date formatter is used when "" is passed as the second parameter | |||
String format = (String) innerMap.getOrDefault("format", ""); | |||
return OpenSearchDateType.of(format); | |||
return innerMap.isEmpty() ? OpenSearchDateType.of() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simplifies creation of OpenSearchDateType
. A format string passes a number of checks even when it is empty.
return fieldName + ".keyword"; | ||
@Override | ||
public String convertFieldForSearchQuery(String fieldName) { | ||
if (fields.size() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the user will end up with OpenSearch error about not being able to aggregate on text. Do I get that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
@acarbonetto |
Signed-off-by: Yury-Fridlyand <[email protected]>
Signed-off-by: Yury-Fridlyand <[email protected]>
text
type.text
type.
* Implement creation of ip2geo feature (#257) * Update gradle version to 7.6 (#265) Signed-off-by: Vijayan Balasubramanian <[email protected]> * Implement creation of ip2geo feature * Implementation of ip2geo datasource creation * Implementation of ip2geo processor creation Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> Signed-off-by: Heemin Kim <[email protected]> Co-authored-by: Vijayan Balasubramanian <[email protected]> * Added unit tests with some refactoring of codes (#271) * Add Unit tests * Set cache true for search query * Remove in memory cache implementation (Two way door decision) * Relying on search cache without custom cache * Renamed datasource state from FAILED to CREATE_FAILED * Renamed class name from *Helper to *Facade * Changed updateIntervalInDays to updateInterval * Changed value type of default update_interval from TimeValue to Long * Read setting value from cluster settings directly Signed-off-by: Heemin Kim <[email protected]> * Sync from main (#280) * Update gradle version to 7.6 (#265) Signed-off-by: Vijayan Balasubramanian <[email protected]> * Exclude lombok generated code from jacoco coverage report (#268) Signed-off-by: Heemin Kim <[email protected]> * Make jacoco report to be generated faster in local (#267) Signed-off-by: Heemin Kim <[email protected]> * Update dependency org.json:json to v20230227 (#273) Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com> * Baseline owners and maintainers (#275) Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> Signed-off-by: Heemin Kim <[email protected]> Co-authored-by: Vijayan Balasubramanian <[email protected]> Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com> * Add datasource name validation (#281) Signed-off-by: Heemin Kim <[email protected]> * Refactoring of code (#282) 1. Change variable name from datasourceName to name 2. Change variable name from id to name 3. Added helper methods in test code Signed-off-by: Heemin Kim <[email protected]> * Change field name from md5 to sha256 (#285) Signed-off-by: Heemin Kim <[email protected]> * Implement get datasource api (#279) Signed-off-by: Heemin Kim <[email protected]> * Update index option (#284) 1. Make geodata index as hidden 2. Make geodata index as read only allow delete after creation is done 3. Refresh datasource index immediately after update Signed-off-by: Heemin Kim <[email protected]> * Make some fields in manifest file as mandatory (#289) Signed-off-by: Heemin Kim <[email protected]> * Create datasource index explicitly (#283) Signed-off-by: Heemin Kim <[email protected]> * Add wrapper class of job scheduler lock service (#290) Signed-off-by: Heemin Kim <[email protected]> * Remove all unused client attributes (#293) Signed-off-by: Heemin Kim <[email protected]> * Update copyright header (#298) Signed-off-by: Heemin Kim <[email protected]> * Run system index handling code with stashed thread context (#297) Signed-off-by: Heemin Kim <[email protected]> * Reduce lock duration and renew the lock during update (#299) Signed-off-by: Heemin Kim <[email protected]> * Implements delete datasource API (#291) Signed-off-by: Heemin Kim <[email protected]> * Set User-Agent in http request (#300) Signed-off-by: Heemin Kim <[email protected]> * Implement datasource update API (#292) Signed-off-by: Heemin Kim <[email protected]> * Refactoring test code (#302) Make buildGeoJSONFeatureProcessorConfig method to be more general Signed-off-by: Heemin Kim <[email protected]> * Add ip2geo processor integ test for failure case (#303) Signed-off-by: Heemin Kim <[email protected]> * Bug fix and refactoring of code (#305) 1. Bugfix: Ingest metadata can be null if there is no processor created 2. Refactoring: Moved private method to another class for better testing support 3. Refactoring: Set some private static final variable as public so that unit test can use it 4. Refactoring: Changed string value to static variable Signed-off-by: Heemin Kim <[email protected]> * Add integration test for Ip2GeoProcessor (#306) Signed-off-by: Heemin Kim <[email protected]> * Add ConcurrentModificationException (#308) Signed-off-by: Heemin Kim <[email protected]> * Add integration test for UpdateDatasource API (#307) Signed-off-by: Heemin Kim <[email protected]> * Bug fix on lock management and few performance improvements (#310) * Release lock before response back to caller for update/delete API * Release lock in background task for creation API * Change index settings to improve indexing performance Signed-off-by: Heemin Kim <[email protected]> * Change index setting from read_only_allow_delete to write (#311) read_only_allow_delete does not block write to an index. The disk-based shard allocator may add and remove this block automatically. Therefore, use index.blocks.write instead. Signed-off-by: Heemin Kim <[email protected]> * Fix bug in get datasource API and improve memory usage (#313) Signed-off-by: Heemin Kim <[email protected]> * Change package for Strings.hasText (#314) (#317) Signed-off-by: Heemin Kim <[email protected]> * Remove jitter and move index setting from DatasourceFacade to DatasourceExtension (#319) Signed-off-by: Heemin Kim <[email protected]> * Do not index blank value and do not enrich null property (#320) Signed-off-by: Heemin Kim <[email protected]> * Move index setting keys to constants (#321) Signed-off-by: Heemin Kim <[email protected]> * Return null index name for expired data (#322) Return null index name for expired data so that it can be deleted by clean up process. Clean up process exclude current index from deleting. Signed-off-by: Heemin Kim <[email protected]> * Add new fields in datasource (#325) Signed-off-by: Heemin Kim <[email protected]> * Delete index once it is expired (#326) Signed-off-by: Heemin Kim <[email protected]> * Add restoring event listener (#328) In the listener, we trigger a geoip data update Signed-off-by: Heemin Kim <[email protected]> * Reverse forcemerge and refresh order (#331) Otherwise, opensearch does not clear old segment files Signed-off-by: Heemin Kim <[email protected]> * Removed parameter and settings (#332) * Removed first_only parameter * Removed max_concurrency and batch_size setting first_only parameter was added as current geoip processor has it. However, the parameter have no benefit for ip2geo processor as we don't do a sequantial search for array data but use multi search. max_concurrency and batch_size setting is removed as these are only reveal internal implementation and could be a future blocker to improve performance later. Signed-off-by: Heemin Kim <[email protected]> * Add a field in datasource for current index name (#333) Signed-off-by: Heemin Kim <[email protected]> * Delete GeoIP data indices after restoring complete (#334) We don't want to use restored GeoIP data indices. Therefore we delete the indices once restoring process complete. When GeoIP metadata index is restored, we create a new GeoIP data index instead. Signed-off-by: Heemin Kim <[email protected]> * Use bool query for array form of IPs (#335) Signed-off-by: Heemin Kim <[email protected]> * Run update/delete request in a new thread (#337) This is not to block transport thread Signed-off-by: Heemin Kim <[email protected]> * Remove IP2Geo processor validation (#336) Cannot query index to get data to validate IP2Geo processor. Will add validation when we decide to store some of data in cluster state metadata. Signed-off-by: Heemin Kim <[email protected]> * Acquire lock sychronously (#339) By acquiring lock asychronously, the remaining part of the code is being run by transport thread which does not allow blocking code. We want only single update happen in a node using single thread. However, it cannot be acheived if I acquire lock asynchronously and pass the listener. Signed-off-by: Heemin Kim <[email protected]> * Added a cache to store datasource metadata (#338) Signed-off-by: Heemin Kim <[email protected]> * Changed class name and package (#341) Signed-off-by: Heemin Kim <[email protected]> * Refactoring of code (#342) 1. Changed class name from Ip2GeoCache to Ip2GeoCachedDao 2. Moved the Ip2GeoCachedDao from cache to dao package Signed-off-by: Heemin Kim <[email protected]> * Add geo data cache (#340) Signed-off-by: Heemin Kim <[email protected]> * Add cache layer to reduce GeoIp data retrieval latency (#343) Signed-off-by: Heemin Kim <[email protected]> * Use _primary in query preference and few changes (opensearch-project#347) 1. Use _primary preference to get datasource metadata so that it can read the latest data. RefreshPolicy.IMMEDIATE won't refresh replica shards immediately according to #346 2. Update datasource metadata index mapping 3. Move batch size from static value to setting Signed-off-by: Heemin Kim <[email protected]> * Wait until GeoIP data to be replicated to all data nodes (opensearch-project#348) Signed-off-by: Heemin Kim <[email protected]> * Update packages according to a change in OpenSearch core (opensearch-project#354) * Update packages according to a change in OpenSearch core Signed-off-by: Heemin Kim <[email protected]> * Update packages according to a change in OpenSearch core (opensearch-project#353) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]> Signed-off-by: Heemin Kim <[email protected]> Co-authored-by: Vijayan Balasubramanian <[email protected]> Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>
Description
See doc for technical details: https://github.com/Bit-Quill/opensearch-project-sql/blob/dev-add-text-type/docs/dev/text-type.md
See also this comment describing some changes.
Issues Resolved
OpenSearchDataType
) though:core
module instead of simplified ones (ExprCoreType
).This unblocks access to important mapping info such as text fields or date formats. This info is required to build proper DSL queries to OpenSearch.
keyword
subfield name.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.