Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add new data type for text #1038

Open
Yury-Fridlyand opened this issue Nov 4, 2022 · 8 comments
Open

[FEATURE] Add new data type for text #1038

Yury-Fridlyand opened this issue Nov 4, 2022 · 8 comments

Comments

@Yury-Fridlyand
Copy link
Collaborator

Is your feature request related to a problem?

SQL plugin doesn't distinguish between text and keyword data types. OpenSearch supports aggregation on keywords and texts with fielddata and/or fields.

It is possible to aggregate on keyword or text (conditions apply)

opensearchsql> select sum(int0) from calcs GROUP BY str0;
fetched rows / total rows = 3/3
+-------------+
| sum(int0)   |
|-------------|
| 1           |
| 18          |
| 49          |
+-------------+

But impossible to aggregate on general text:

opensearchsql> select gender, count(firstname) from bank-with-null-values group by gender;
TransportError(500, 'SearchPhaseExecutionException', {'error': {'type': 'SearchPhaseExecutionException', 'reason': 'Error occurred in OpenSearch engine: all shards failed', 'details': 'Shard[0]: java.lang.IllegalArgumentException: Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [gender] in order to load field data by uninverting the inverted index. Note that this can use significant memory.\n\nFor more details, please send request for Json format to see the raw response from OpenSearch engine.'}, 'status': 503})

Existing mapping

JDBC type ExprCoreType OpenSearchDataType OpenSearch type
VARCHAR STRING OPENSEARCH_TEXT_KEYWORD keyword
VARCHAR STRING OPENSEARCH_TEXT text

See OpenSearch mapping samples available for aggregation:


"Origin": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},

"firstname": {
"type": "text",
"fielddata": true,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},

Not available for aggregation:

What solution would you like?

Have 2 different data types which are mapped to different JDBC/ODBC types.

JDBC type ExprCoreType OpenSearchDataType OpenSearch type
VARCHAR/CHAR STRING OPENSEARCH_KEYWORD keyword
text with fielddata
text with fields
LONGVARCHAR/TEXT TEXT OPENSEARCH_TEXT text without fielddata and fields

What alternatives have you considered?

N/A

Do you have any additional context?

Opened on behalf of @kylepbit
Ref:

  1. KEYWORD(JDBCType.VARCHAR, String.class, 256, 0, false),
    TEXT(JDBCType.VARCHAR, String.class, Integer.MAX_VALUE, 0, false),
    STRING(JDBCType.VARCHAR, String.class, Integer.MAX_VALUE, 0, false),
  2. /**
    * OpenSearch Text. Rather than cast text to other types (STRING), leave it alone to prevent
    * cast_to_string(OPENSEARCH_TEXT).
    * Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html
    */
    OPENSEARCH_TEXT(Collections.singletonList(STRING), "string") {
    @Override
    public boolean shouldCast(ExprType other) {
    return false;
    }
    },
    /**
    * OpenSearch multi-fields which has text and keyword.
    * Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
    */
    OPENSEARCH_TEXT_KEYWORD(Arrays.asList(STRING, OPENSEARCH_TEXT), "string") {
    @Override
    public boolean shouldCast(ExprType other) {
    return false;
    }
    },
  3. /**
    * String.
    */
    STRING(UNDEFINED),
@Yury-Fridlyand Yury-Fridlyand added enhancement New feature or request untriaged labels Nov 4, 2022
@penghuo penghuo removed the untriaged label Nov 7, 2022
@penghuo
Copy link
Collaborator

penghuo commented Nov 7, 2022

OpenSearch storage engine support TEXT data type? Do you propose add TEXT data type in core engine?

@Yury-Fridlyand
Copy link
Collaborator Author

Yes and Yes.

@MaxKsyunz
Copy link
Collaborator

MaxKsyunz commented Nov 7, 2022

Aggregating on text is disabled by default because of performance implications.

The recommended approach is to keep a copy of the raw string.

JDBC has CLOB (character large object) data type that looks more appropriate for text fields.

@kylepbit is another use case besides aggregation that did not work as expected?

@Yury-Fridlyand
Copy link
Collaborator Author

Yury-Fridlyand commented Nov 8, 2022

Another trouble I just met caused by the same issue:

SELECT COUNT(*) FROM account WHERE address LIKE '% Street';

returns 0.
address field is text - not searchable. LIKE builds WildcardQuery under the hood, which works on keywords only.

"address": {
"type": "text",
"fielddata": true
},

UPD

SELECT address, address LIKE '% Street' FROM account;

works and returns valid results. This confuses a lot.
Actually, in SELECT LIST function LIKE is executed in memory (in SQL plugin itself), which can do search in text too.

@acarbonetto
Copy link
Collaborator

The crux of the problem is that we expose opensearch text fields as VARCHARs to BI tools, which is indistinguishable from keyword fields, because both are also exposed as VARCHARs. But the two field types behave differently, and produce different results.
BI tooling doesn't necessarily need to know how and where the two field types work, but need to be able to distinguish the two types apart.
Exposing text fields as the MySQL text type (or long varchar) would accomplish this.

@acarbonetto
Copy link
Collaborator

The LIKE command (which calls WILDCARD on the OpenSearch side) does an automatic conversion between text and keyword types by using the text .keyword nested field (assuming it exists). Whereas the WILDCARD function in the SQL language (which also calls the WILDCARD query on the OpenSearch side) does not do an automatic conversion between text and keyword types (@GumpacG to confirm).

It needs to be well documented which cases will be flexible (and automatically convert) and which cases expect the function to work with a specific field type (text or keyword). IT tests are needed to support this.

see: #1032

@Yury-Fridlyand
Copy link
Collaborator Author

Answering my own question posted in #1038 (comment):
The query is incorrect. The valid one is:

SELECT count(*) FROM account WHERE address LIKE '%Street';

Nothing there related to text support implementation.

Why it is not working on text type field ?

Because wildcard query is the term level query and it don't apply any analyzer at query time. It will consider your entire query as one pattern. You are matching query on text type field which used standard analyzer at indexing time and token your text to multiple terms and index hence it is working for one term and not multiple term.

Full explanation is available on hivemind: https://stackoverflow.com/a/72084568.

@Yury-Fridlyand
Copy link
Collaborator Author

Yury-Fridlyand commented Jul 21, 2023

RFC for text type support

History overview

  • odfe#620 - added two types for text (just text and text with fields), querying and string functions are supported, but searching is not.
  • odfe#682 and odfe#730 - added support for querying (in 682) and aggregation (730); if text has sub-fields, .keyword prefix is added to the field name regardless of the sub-field names.
  • #1314 and #1664 - changed format of storing text type; it is one type now, subfield information is stored, but not yet used.
  • Proposed changes - correctly resolve sub-field name if a text field has only one subfield. This fixes #1112 and #1038.

Further changes

  • Support search for text sub-fields (#1113).
  • Support multiple sub-fields for text (#1887).
  • Support non-default date formats for search queries (#1847). Fix for this bug depends on the current changes.

Problem statement

:opensearch module parses index mapping and builds instances of OpenSearchDataType (a base class), but ships simplified types (ExprCoreType - a enum) to :core module, because :core uses ExprCoreTypes to resolve functions.

sequenceDiagram
  participant core as :core
  participant opensearch as :opensearch
  
  core ->>+ opensearch : resolve types
  opensearch ->>- core : simplified types
  note over core: preparing query
  core ->> opensearch : translate request into datasource DSL
Loading

Later, :core returns to :opensearch the DSL request with types stored. Since types were simplified, all mapping information is lost. Adding new TEXT entry to ExprCoreType enum is not enough, because ExprCoreType is datasource agnostic and can't store any specific mapping info.

Solution

The solution is to provide to :core non simplified types, but full types. Those objects should be fully compatible with ExprCoreType and implement all required APIs to allow :core to manipulate with built-in functions. Once those type objects are returned back to :opensearch, it can get all required information to build the correct search request.

  1. Pass full (non simplified) types to and through :core.
  2. Update OpenSearchDataType (and inheritors if needed) to be comparable with ExprCoreType.
  3. Update :core to do proper comparison (use .equals instead of ==).
  4. Update :opensearch to use the mapping information received from :core and properly build the search query.

Code PoC

Available on Bit-Quill#299

Type schema

JDBC type ExprCoreType OpenSearchDataType OpenSearch type
VARCHAR/CHAR STRING -- keyword
LONGVARCHAR/TEXT STRING OpenSearchTextType text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants