Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semantic_text as string type in ES|QL - support for functions and operators #115243

Merged
merged 14 commits into from
Nov 1, 2024

Conversation

ioanatia
Copy link
Contributor

@ioanatia ioanatia commented Oct 21, 2024

tracked in #115103
Related: #114334 that disallows functions to return TEXT type values. We need to pay attention to this one, it will likely be merged first.

Description

This adds support for semantic_text as a string type in ES|QL and checks the support for functions and operators (https://www.elastic.co/guide/en/elasticsearch/reference/master/esql-functions-operators.html).

The process was:

  • make semantic_text a string type:
    public static boolean isString(DataType t) {
    if (EsqlCorePlugin.SEMANTIC_TEXT_FEATURE_FLAG.isEnabled() && t == SEMANTIC_TEXT) {
    return true;
    }
    return t == KEYWORD || t == TEXT;
    }
  • run the tests and fix any failures
  • for each function/operator that supports keyword/text (docs)write CSV tests to use them with semantic_text; fix any failures
  • for each function/operator that support semantic_text ensure we have test coverage when the arguments are of type semantic_text

Next

This PR takes care of only functions/operator support. Things that are not covered by tests here, but will be in a follow up:

  • the STATS command:
    • aggregate functions for STATS: docs
    • grouping functions for STATS e.g. BUCKET docs
  • support for all commands (from docs) e.g.:
    • STATS
    • DISSECT
    • ENRICH
    • GROK
    • LOOKUP
    • INLINESTATS

If preferred I can make the rest of the changes here to ensure the support for commands too, but I figured it might be easier to review this in 2 parts.

@ioanatia ioanatia added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Oct 21, 2024
@@ -123,7 +123,7 @@ protected static void bytesRefs(
Function<DataType, DataType> expectedDataType,
BiFunction<Integer, Stream<BytesRef>, Matcher<Object>> matcher
) {
for (DataType type : new DataType[] { DataType.KEYWORD, DataType.TEXT, DataType.IP, DataType.VERSION }) {
for (DataType type : new DataType[] { DataType.KEYWORD, DataType.TEXT, DataType.IP, DataType.VERSION, DataType.SEMANTIC_TEXT }) {
Copy link
Contributor Author

@ioanatia ioanatia Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the mv functions, I only modified the tests that explicitly tested for keyword/text fields to also add tests for semantic_text. For most of the mv function tests I did not need to make any changes because I modified the bytesRef method in the parent abstract class.

mv_concat and mv_zip do not use bytesRef, but use the DataType.isString method to only test string types, so implicitly they tests semantic_text:

for (DataType fieldType : DataType.types()) {
if (DataType.isString(fieldType) == false) {
continue;
}
for (DataType delimType : DataType.types()) {
if (DataType.isString(delimType) == false) {
continue;
}
for (int l = 1; l < 10; l++) {
int length = l;
suppliers.add(new TestCaseSupplier(fieldType + "/" + l + " " + delimType, List.of(fieldType, delimType), () -> {

for (DataType leftType : DataType.types()) {
if (leftType != DataType.NULL && DataType.isString(leftType) == false) {
continue;
}
for (DataType rightType : DataType.types()) {
if (rightType != DataType.NULL && DataType.isString(rightType) == false) {
continue;
}

the rest of the mv functions that take string params use bytesRef and did not need modifications:

bytesRefs(cases, "mv_count", "MvCount", t -> DataType.INTEGER, (size, values) -> equalTo(Math.toIntExact(values.count())));

bytesRefs(cases, "mv_dedupe", "MvDedupe", (size, values) -> getMatcher(values));

bytesRefs(cases, "mv_first", "MvFirst", Function.identity(), (size, values) -> equalTo(values.findFirst().get()));

bytesRefs(cases, "mv_last", "MvLast", Function.identity(), (size, values) -> equalTo(values.reduce((f, s) -> s).get()));

bytesRefs(cases, "mv_max", "MvMax", (size, values) -> equalTo(values.max(Comparator.naturalOrder()).get()));

bytesRefs(cases, "mv_min", "MvMin", (size, values) -> equalTo(values.min(Comparator.naturalOrder()).get()));

@@ -217,8 +217,8 @@ public static Iterable<Object[]> parameters() {
}

private static String typeErrorString =
"boolean, cartesian_point, cartesian_shape, datetime, date_nanos, double, geo_point, geo_shape, integer, ip, keyword, long, text, "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the comparison operators tests I did not need to make any other changes because they already use TestCase.stringCases:

suppliers.addAll(
TestCaseSupplier.stringCases(
Object::equals,
(lhsType, rhsType) -> "EqualsKeywordsEvaluator[lhs=Attribute[channel=0], rhs=Attribute[channel=1]]",
List.of(),
DataType.BOOLEAN
)
);

The same applies for the rest of the comparison predicates:

suppliers.addAll(
TestCaseSupplier.stringCases(
(l, r) -> ((BytesRef) l).compareTo((BytesRef) r) >= 0,
(lhsType, rhsType) -> "GreaterThanOrEqualKeywordsEvaluator[lhs=Attribute[channel=0], rhs=Attribute[channel=1]]",
List.of(),
DataType.BOOLEAN
)
);

suppliers.addAll(
TestCaseSupplier.stringCases(
(l, r) -> ((BytesRef) l).compareTo((BytesRef) r) > 0,
(lhsType, rhsType) -> "GreaterThanKeywordsEvaluator[lhs=Attribute[channel=0], rhs=Attribute[channel=1]]",
List.of(),
DataType.BOOLEAN
)
);

suppliers.addAll(
TestCaseSupplier.stringCases(
(l, r) -> ((BytesRef) l).compareTo((BytesRef) r) <= 0,
(lhsType, rhsType) -> "LessThanOrEqualKeywordsEvaluator[lhs=Attribute[channel=0], rhs=Attribute[channel=1]]",
List.of(),
DataType.BOOLEAN
)
);

suppliers.addAll(
TestCaseSupplier.stringCases(
(l, r) -> ((BytesRef) l).compareTo((BytesRef) r) < 0,
(lhsType, rhsType) -> "LessThanKeywordsEvaluator[lhs=Attribute[channel=0], rhs=Attribute[channel=1]]",
List.of(),
DataType.BOOLEAN
)
);

suppliers.addAll(
TestCaseSupplier.stringCases(
(l, r) -> false == l.equals(r),
(lhsType, rhsType) -> "NotEqualsKeywordsEvaluator[lhs=Attribute[channel=0], rhs=Attribute[channel=1]]",
List.of(),
DataType.BOOLEAN
)

TestCaseSupplier uses DataType.stringTypes() which contains semantic_text:

public static List<TestCaseSupplier> stringCases(
BinaryOperator<Object> expected,
BiFunction<DataType, DataType, String> evaluatorToString,
List<String> warnings,
DataType expectedType
) {
List<TypedDataSupplier> lhsSuppliers = new ArrayList<>();
List<TypedDataSupplier> rhsSuppliers = new ArrayList<>();
List<TestCaseSupplier> suppliers = new ArrayList<>();
for (DataType type : DataType.stringTypes()) {
lhsSuppliers.addAll(stringCases(type));
rhsSuppliers.addAll(stringCases(type));

@@ -173,3 +173,946 @@ host:keyword | semantic_text_field:semantic_text
"host2" | all we have to decide is what to do with the time that is given to us
"host3" | be excellent to each other
;

case
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 | null
;

convertToBool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 | null
;

concat
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests for String functions

live long and prosper
;

mvAppend
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 | null
;

equalityWithConstant
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests for Binary operators - the ones that support string types

3 | null
;

isNull
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests for ES|QL predicates

@ioanatia ioanatia changed the title semantic_text as string type in ES|QL semantic_text as string type in ES|QL - support for functions and operators Oct 22, 2024
@ioanatia ioanatia mentioned this pull request Oct 19, 2024
15 tasks
@ioanatia
Copy link
Contributor Author

@elasticmachine update branch

@ioanatia ioanatia marked this pull request as ready for review October 22, 2024 15:11
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few thoughts from someone who's been working on data types for the past couple of months. Hopefully adding the new type hasn't been too difficult. Please feel free to ping me directly if you have questions.

@@ -369,6 +370,9 @@ public static DataType commonType(DataType left, DataType right) {
if (left == TEXT || right == TEXT) {
return TEXT;
}
if (left == SEMANTIC_TEXT || right == SEMANTIC_TEXT) {
return TEXT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if the two types are the same, we always return that type, so this is saying that only in case of mixed type operations, we expect SEMANTIC_TEXT to cast to TEXT. It seems a little odd to me that KEYWORD == SEMANTIC_TEXT would have a common type of TEXT, but maybe that's expected? If so, I think a comment would help here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this so that the common type is KEYWORD in the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we shouldn't do the same for TEXT (ie. return KEYWORD). Not part of this PR, though, but worth considering.

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM - great work on supporting this new data type! 💯

Left a question about the data type to use when combining with other types

@@ -99,7 +100,7 @@ private static TestCaseSupplier supplier(String name, DataType type, int length,

private static void add(List<TestCaseSupplier> suppliers, String name, int length, Supplier<String> valueSupplier) {
Map<Integer, List<List<DataType>>> permutations = new HashMap<Integer, List<List<DataType>>>();
List<DataType> supportedDataTypes = List.of(DataType.KEYWORD, DataType.TEXT);
List<DataType> supportedDataTypes = List.of(DataType.KEYWORD, DataType.TEXT, DataType.SEMANTIC_TEXT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<DataType> supportedDataTypes = List.of(DataType.KEYWORD, DataType.TEXT, DataType.SEMANTIC_TEXT);
List<DataType> supportedDataTypes = List.of(DataType.stringTypes());

@ioanatia ioanatia closed this Oct 29, 2024
@ioanatia ioanatia reopened this Oct 29, 2024
@ioanatia ioanatia requested a review from not-napoleon October 29, 2024 21:30
@astefan
Copy link
Contributor

astefan commented Oct 30, 2024

@craigtaverner do you have time to take a look at this one?

Copy link
Contributor

@craigtaverner craigtaverner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Although I'd prefer new lines containing SEMANTIC_TEXT to be inserted immediately after the similar TEXT lines instead of later in the code. Keeping SEMANTIC_TEXT close to TEXT feels better. This also applies to block of code.

@@ -369,6 +370,9 @@ public static DataType commonType(DataType left, DataType right) {
if (left == TEXT || right == TEXT) {
return TEXT;
}
if (left == SEMANTIC_TEXT || right == SEMANTIC_TEXT) {
return TEXT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we shouldn't do the same for TEXT (ie. return KEYWORD). Not part of this PR, though, but worth considering.

@@ -43,7 +43,8 @@ public class NotEquals extends EsqlBinaryComparison implements Negatable<EsqlBin
Map.entry(DataType.KEYWORD, NotEqualsKeywordsEvaluator.Factory::new),
Map.entry(DataType.TEXT, NotEqualsKeywordsEvaluator.Factory::new),
Map.entry(DataType.VERSION, NotEqualsKeywordsEvaluator.Factory::new),
Map.entry(DataType.IP, NotEqualsKeywordsEvaluator.Factory::new)
Map.entry(DataType.IP, NotEqualsKeywordsEvaluator.Factory::new),
Map.entry(DataType.SEMANTIC_TEXT, NotEqualsKeywordsEvaluator.Factory::new)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line should be after the TEXT line.

@@ -38,7 +38,8 @@ public class LessThanOrEqual extends EsqlBinaryComparison implements Negatable<E
Map.entry(DataType.KEYWORD, LessThanOrEqualKeywordsEvaluator.Factory::new),
Map.entry(DataType.TEXT, LessThanOrEqualKeywordsEvaluator.Factory::new),
Map.entry(DataType.VERSION, LessThanOrEqualKeywordsEvaluator.Factory::new),
Map.entry(DataType.IP, LessThanOrEqualKeywordsEvaluator.Factory::new)
Map.entry(DataType.IP, LessThanOrEqualKeywordsEvaluator.Factory::new),
Map.entry(DataType.SEMANTIC_TEXT, LessThanOrEqualKeywordsEvaluator.Factory::new)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line should be after the TEXT line.

@ioanatia ioanatia added auto-backport Automatically create backport pull requests when merged v8.17.0 labels Oct 31, 2024
@ioanatia
Copy link
Contributor Author

@elasticmachine update branch

@ioanatia ioanatia merged commit 0a5b1c6 into elastic:main Nov 1, 2024
16 checks passed
@ioanatia ioanatia deleted the semantic_text_as_string branch November 1, 2024 06:18
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 115243

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
…rators (elastic#115243)

* fix tests

* Add CSV tests

* Add function tests

* Refactor tests

* spotless

* Use DataType.stringTypes() where possible

* Add tests for conditional functions and expressions

* Fix tests after merge

* Reorder semantic_text evaluators and tests

* Re-ordered two more places for SEMANTIC_TEXT after TEXT

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: Craig Taverner <[email protected]>
ioanatia added a commit to ioanatia/elasticsearch that referenced this pull request Nov 5, 2024
…rators (elastic#115243)

* fix tests

* Add CSV tests

* Add function tests

* Refactor tests

* spotless

* Use DataType.stringTypes() where possible

* Add tests for conditional functions and expressions

* Fix tests after merge

* Reorder semantic_text evaluators and tests

* Re-ordered two more places for SEMANTIC_TEXT after TEXT

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: Craig Taverner <[email protected]>
ioanatia added a commit that referenced this pull request Nov 5, 2024
* semantic_text as string type in ES|QL - support for functions and operators (#115243)

* fix tests

* Add CSV tests

* Add function tests

* Refactor tests

* spotless

* Use DataType.stringTypes() where possible

* Add tests for conditional functions and expressions

* Fix tests after merge

* Reorder semantic_text evaluators and tests

* Re-ordered two more places for SEMANTIC_TEXT after TEXT

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: Craig Taverner <[email protected]>

* Fix release tests for semantic_text (#116202)

---------

Co-authored-by: Elastic Machine <[email protected]>
Co-authored-by: Craig Taverner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged backport pending >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants