[FEATURE] Add new data type for text #1038

Yury-Fridlyand · 2022-11-04T19:00:36Z

Is your feature request related to a problem?

SQL plugin doesn't distinguish between text and keyword data types. OpenSearch supports aggregation on keywords and texts with fielddata and/or fields.

It is possible to aggregate on keyword or text (conditions apply)

opensearchsql> select sum(int0) from calcs GROUP BY str0;
fetched rows / total rows = 3/3
+-------------+
| sum(int0)   |
|-------------|
| 1           |
| 18          |
| 49          |
+-------------+

But impossible to aggregate on general text:

opensearchsql> select gender, count(firstname) from bank-with-null-values group by gender;
TransportError(500, 'SearchPhaseExecutionException', {'error': {'type': 'SearchPhaseExecutionException', 'reason': 'Error occurred in OpenSearch engine: all shards failed', 'details': 'Shard[0]: java.lang.IllegalArgumentException: Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [gender] in order to load field data by uninverting the inverted index. Note that this can use significant memory.\n\nFor more details, please send request for Json format to see the raw response from OpenSearch engine.'}, 'status': 503})

Existing mapping

JDBC type	`ExprCoreType`	`OpenSearchDataType`	OpenSearch type
`VARCHAR`	`STRING`	`OPENSEARCH_TEXT_KEYWORD`	`keyword`
`VARCHAR`	`STRING`	`OPENSEARCH_TEXT`	`text`

See OpenSearch mapping samples available for aggregation:

sql/integ-test/src/test/resources/correctness/opensearch_dashboards_sample_data_flights.json

Lines 25 to 27 in b56edc7

    
           "DestCityName": { 
        
             "type": "keyword" 
        
           },

sql/integ-test/src/test/resources/correctness/opensearch_dashboards_sample_data_flights.json

Lines 61 to 69 in b56edc7

    
           "Origin": { 
        
             "type": "text", 
        
             "fields": { 
        
               "keyword": { 
        
                 "type": "keyword", 
        
                 "ignore_above": 256 
        
               } 
        
             } 
        
           },

sql/integ-test/src/test/resources/indexDefinitions/account_index_mapping.json

Lines 12 to 21 in b56edc7

    
           "firstname": { 
        
             "type": "text", 
        
             "fielddata": true, 
        
             "fields": { 
        
               "keyword": { 
        
                 "type": "keyword", 
        
                 "ignore_above": 256 
        
               } 
        
             } 
        
           },

Not available for aggregation:

sql/integ-test/src/test/resources/indexDefinitions/bank_with_null_values_index_mapping.json

Lines 16 to 18 in b56edc7

    
           "gender": { 
        
             "type": "text" 
        
           },

What solution would you like?

Have 2 different data types which are mapped to different JDBC/ODBC types.

JDBC type	`ExprCoreType`	`OpenSearchDataType`	OpenSearch type
`VARCHAR`/`CHAR`	`STRING`	`OPENSEARCH_KEYWORD`	`keyword` `text` with `fielddata` `text` with `fields`
`LONGVARCHAR`/`TEXT`	`TEXT`	`OPENSEARCH_TEXT`	`text` without `fielddata` and `fields`

What alternatives have you considered?

N/A

Do you have any additional context?

Opened on behalf of @kylepbit
Ref:

sql/sql-jdbc/src/main/java/org/opensearch/jdbc/types/OpenSearchType.java

Lines 59 to 61 in b56edc7

    
           KEYWORD(JDBCType.VARCHAR, String.class, 256, 0, false), 
        
           TEXT(JDBCType.VARCHAR, String.class, Integer.MAX_VALUE, 0, false), 
        
           STRING(JDBCType.VARCHAR, String.class, Integer.MAX_VALUE, 0, false),

sql/opensearch/src/main/java/org/opensearch/sql/opensearch/data/type/OpenSearchDataType.java

Lines 25 to 46 in b56edc7

    
             /** 
        
              * OpenSearch Text. Rather than cast text to other types (STRING), leave it alone to prevent 
        
              * cast_to_string(OPENSEARCH_TEXT). 
        
              * Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html 
        
              */ 
        
             OPENSEARCH_TEXT(Collections.singletonList(STRING), "string") { 
        
               @Override 
        
               public boolean shouldCast(ExprType other) { 
        
                 return false; 
        
               } 
        
             }, 
        
             /** 
        
              * OpenSearch multi-fields which has text and keyword. 
        
              * Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html 
        
              */ 
        
             OPENSEARCH_TEXT_KEYWORD(Arrays.asList(STRING, OPENSEARCH_TEXT), "string") { 
        
               @Override 
        
               public boolean shouldCast(ExprType other) { 
        
                 return false; 
        
               } 
        
             },

sql/core/src/main/java/org/opensearch/sql/data/type/ExprCoreType.java

Lines 44 to 47 in b56edc7

    
             /** 
        
              * String. 
        
              */ 
        
             STRING(UNDEFINED),

The text was updated successfully, but these errors were encountered:

penghuo · 2022-11-07T16:49:24Z

OpenSearch storage engine support TEXT data type? Do you propose add TEXT data type in core engine?

Yury-Fridlyand · 2022-11-07T20:30:44Z

Yes and Yes.

MaxKsyunz · 2022-11-07T20:47:35Z

Aggregating on text is disabled by default because of performance implications.

The recommended approach is to keep a copy of the raw string.

JDBC has CLOB (character large object) data type that looks more appropriate for text fields.

@kylepbit is another use case besides aggregation that did not work as expected?

Yury-Fridlyand · 2022-11-08T02:19:37Z

Another trouble I just met caused by the same issue:

SELECT COUNT(*) FROM account WHERE address LIKE '% Street';

returns 0.
address field is text - not searchable. LIKE builds WildcardQuery under the hood, which works on keywords only.

sql/integ-test/src/test/resources/indexDefinitions/account_index_mapping.json

Lines 8 to 11 in b56edc7

    
           "address": { 
        
             "type": "text", 
        
             "fielddata": true 
        
           },

UPD

SELECT address, address LIKE '% Street' FROM account;

works and returns valid results. This confuses a lot.
Actually, in SELECT LIST function LIKE is executed in memory (in SQL plugin itself), which can do search in text too.

acarbonetto · 2022-11-09T21:51:59Z

The crux of the problem is that we expose opensearch text fields as VARCHARs to BI tools, which is indistinguishable from keyword fields, because both are also exposed as VARCHARs. But the two field types behave differently, and produce different results.
BI tooling doesn't necessarily need to know how and where the two field types work, but need to be able to distinguish the two types apart.
Exposing text fields as the MySQL text type (or long varchar) would accomplish this.

acarbonetto · 2022-11-21T19:34:12Z

The LIKE command (which calls WILDCARD on the OpenSearch side) does an automatic conversion between text and keyword types by using the text .keyword nested field (assuming it exists). Whereas the WILDCARD function in the SQL language (which also calls the WILDCARD query on the OpenSearch side) does not do an automatic conversion between text and keyword types (@GumpacG to confirm).

It needs to be well documented which cases will be flexible (and automatically convert) and which cases expect the function to work with a specific field type (text or keyword). IT tests are needed to support this.

see: #1032

Yury-Fridlyand · 2023-07-21T02:12:52Z

Answering my own question posted in #1038 (comment):
The query is incorrect. The valid one is:

SELECT count(*) FROM account WHERE address LIKE '%Street';

Nothing there related to text support implementation.

Why it is not working on text type field ?

Because wildcard query is the term level query and it don't apply any analyzer at query time. It will consider your entire query as one pattern. You are matching query on text type field which used standard analyzer at indexing time and token your text to multiple terms and index hence it is working for one term and not multiple term.

Full explanation is available on hivemind: https://stackoverflow.com/a/72084568.

Yury-Fridlyand · 2023-07-21T02:14:10Z

RFC for `text` type support

History overview

odfe#620 - added two types for text (just text and text with fields), querying and string functions are supported, but searching is not.
odfe#682 and odfe#730 - added support for querying (in 682) and aggregation (730); if text has sub-fields, .keyword prefix is added to the field name regardless of the sub-field names.
#1314 and #1664 - changed format of storing text type; it is one type now, subfield information is stored, but not yet used.
Proposed changes - correctly resolve sub-field name if a text field has only one subfield. This fixes #1112 and #1038.

Further changes

Support search for text sub-fields (#1113).
Support multiple sub-fields for text (#1887).
Support non-default date formats for search queries (#1847). Fix for this bug depends on the current changes.

Problem statement

:opensearch module parses index mapping and builds instances of OpenSearchDataType (a base class), but ships simplified types (ExprCoreType - a enum) to :core module, because :core uses ExprCoreTypes to resolve functions.

sequenceDiagram
  participant core as :core
  participant opensearch as :opensearch
  
  core ->>+ opensearch : resolve types
  opensearch ->>- core : simplified types
  note over core: preparing query
  core ->> opensearch : translate request into datasource DSL

Later, :core returns to :opensearch the DSL request with types stored. Since types were simplified, all mapping information is lost. Adding new TEXT entry to ExprCoreType enum is not enough, because ExprCoreType is datasource agnostic and can't store any specific mapping info.

Solution

The solution is to provide to :core non simplified types, but full types. Those objects should be fully compatible with ExprCoreType and implement all required APIs to allow :core to manipulate with built-in functions. Once those type objects are returned back to :opensearch, it can get all required information to build the correct search request.

Pass full (non simplified) types to and through :core.
Update OpenSearchDataType (and inheritors if needed) to be comparable with ExprCoreType.
Update :core to do proper comparison (use .equals instead of ==).
Update :opensearch to use the mapping information received from :core and properly build the search query.

Code PoC

Available on Bit-Quill#299

Type schema

JDBC type	`ExprCoreType`	`OpenSearchDataType`	OpenSearch type
`VARCHAR`/`CHAR`	`STRING`	--	`keyword`
`LONGVARCHAR`/`TEXT`	`STRING`	`OpenSearchTextType`	`text`

Yury-Fridlyand added enhancement New feature or request untriaged labels Nov 4, 2022

penghuo removed the untriaged label Nov 7, 2022

MaxKsyunz added the bi-tooling label Nov 9, 2022

MaxKsyunz added the data-correctness label Nov 9, 2022

Yury-Fridlyand mentioned this issue Nov 14, 2022

[Bug] Support Multi-fields in WHERE Conditions #1074

Open

penghuo added the Priority-High label Nov 14, 2022

MitchellGale mentioned this issue Nov 23, 2022

Text and Keyword aggregation integration tests Bit-Quill/opensearch-project-sql#176

Draft

6 tasks

acarbonetto mentioned this issue Nov 24, 2022

SQL ingestion and ops - standup #791

Open

This was referenced Nov 26, 2022

[FEATURE] Support non-keyword fields for text #1112

Open

[FEATURE] Support search for fields in text #1113

Open

Rework on OpenSearchDataType: parse, store and use mapping information Bit-Quill/opensearch-project-sql#180

Merged

This was referenced Jan 27, 2023

[BUG] SQL and PPL responses different types for the same things #1296

Open

Rework on OpenSearchDataType: parse, store and use mapping information #1314

Merged

Yury-Fridlyand mentioned this issue Jul 19, 2023

[FEATURE] Support search for non-keyword fields for text #1887

Open

Yury-Fridlyand mentioned this issue Jul 31, 2023

[BUG] DESCRIBE query returns incorrect data for text fields #259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add new data type for text #1038

[FEATURE] Add new data type for text #1038

Yury-Fridlyand commented Nov 4, 2022

penghuo commented Nov 7, 2022

Yury-Fridlyand commented Nov 7, 2022

MaxKsyunz commented Nov 7, 2022 •

edited

Loading

Yury-Fridlyand commented Nov 8, 2022 •

edited

Loading

acarbonetto commented Nov 9, 2022

acarbonetto commented Nov 21, 2022

Yury-Fridlyand commented Jul 21, 2023

Yury-Fridlyand commented Jul 21, 2023 •

edited

Loading

[FEATURE] Add new data type for text #1038

[FEATURE] Add new data type for text #1038

Comments

Yury-Fridlyand commented Nov 4, 2022

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

penghuo commented Nov 7, 2022

Yury-Fridlyand commented Nov 7, 2022

MaxKsyunz commented Nov 7, 2022 • edited Loading

Yury-Fridlyand commented Nov 8, 2022 • edited Loading

acarbonetto commented Nov 9, 2022

acarbonetto commented Nov 21, 2022

Yury-Fridlyand commented Jul 21, 2023

Yury-Fridlyand commented Jul 21, 2023 • edited Loading

RFC for text type support

History overview

Further changes

Problem statement

Solution

Code PoC

Type schema

MaxKsyunz commented Nov 7, 2022 •

edited

Loading

Yury-Fridlyand commented Nov 8, 2022 •

edited

Loading

Yury-Fridlyand commented Jul 21, 2023 •

edited

Loading

RFC for `text` type support