ES|QL: better management of exact subfields for TEXT fields #103510

luigidellaquila · 2023-12-18T14:16:53Z

For a mapping like:

"field": {
    "type": "text",
    "fields": {
        "raw":{
            "type":"keyword",
        }
    }
},

and a query like

FROM idx | sort text | where text == "foo" | keep text

ESQL optimizer can decide to use the subfield text.raw rather than the original text field, for performance reasons that include:

fetch performance
ability to push sort and filter operations down to Lucene

There are cases though when the subfield value is not exactly the same as the original one (eg. because of ignore_above or `normalizer).
This PR improves the detection of these cases, so that the subfield is used only when it's really accurate.

luigidellaquila · 2023-12-18T14:18:12Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/planner/EsPhysicalOperationProviders.java

@@ -69,9 +68,6 @@ public final PhysicalOperation fieldExtractPhysicalOperation(FieldExtractExec fi
        List<ValuesSourceReaderOperator.FieldInfo> fields = new ArrayList<>();
        int docChannel = source.layout.get(sourceAttr.id()).channel();
        for (Attribute attr : fieldExtractExec.attributesToExtract()) {
-            if (attr instanceof FieldAttribute fa && fa.getExactInfo().hasExact()) {
-                attr = fa.exactAttribute();


This is not needed, TextFieldMapper will take care of fetching values from the exact subfield

elasticsearchmachine · 2023-12-18T15:49:32Z

Pinging @elastic/es-ql (Team:QL)

elasticsearchmachine · 2023-12-18T15:49:32Z

Hi @luigidellaquila, I've created a changelog YAML for you.

elasticsearchmachine · 2023-12-18T15:49:32Z

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

costin

LGTM - left some minute comments. 👍 for the tests.
Please pass this by Nik as well.

costin · 2023-12-20T02:05:58Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -937,9 +937,15 @@ public boolean isAggregatable() {
            return fielddata;
        }

+        public boolean isSyntheticSourceDelegateIdentical() {


canUseSyntheticSourceDelegate

costin · 2023-12-21T03:48:19Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

+
+    public static boolean isPushableFieldAttribute(Expression exp, Function<FieldAttribute, Boolean> hasIdenticalDelegate) {
+        if (exp instanceof FieldAttribute fa && fa.getExactInfo().hasExact() && isAggregatable(fa)) {
+            return fa.dataType().equals(DataTypes.TEXT) ? hasIdenticalDelegate.apply(fa) : true;


return fa.dataType().equals(DataTypes.TEXT) && hasIdenticalDelegate.test(fa)

I think that wouldn't be right for non-TEXT, but return fa.dataType() != DataTypes.TEXT || hasIdenticalDelegate.test(fa) should be.
Or return exp instanceof FieldAttribute fa && fa.getExactInfo().hasExact() && isAggregatable(fa) && (fa.dataType() != DataTypes.TEXT || hasIdenticalDelegate.test(fa));.

costin · 2023-12-21T03:49:50Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

@@ -302,10 +322,9 @@ protected PhysicalPlan rule(TopNExec topNExec) {
            return plan;
        }

-        private boolean canPushDownOrders(List<Order> orders) {
+        private boolean canPushDownOrders(List<Order> orders, Function<FieldAttribute, Boolean> hasIdenticalDelegate) {


Better to use java.util.function.Predicate

costin · 2023-12-21T03:55:36Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

@@ -101,9 +104,9 @@ protected List<Batch<PhysicalPlan>> rules(boolean optimizeForEsSource) {
        esSourceRules.add(new ReplaceAttributeSourceWithDocId());

        if (optimizeForEsSource) {
-            esSourceRules.add(new PushTopNToSource());
+            esSourceRules.add(new PushTopNToSource(context()));


Since these two rules have similar logic it's worth making the rules non-static, move hasIdenticalDelegate as a private method to the outer class and have the classes refer to that.
canPushDown will still have to use the static signature however it should be less bureaucratic to access the SearchStats.

Scratch the above - use the same pattern as PushStatsToSource that is change the rule to be a ParameterizedOptimizerRule so the context will be passed in per method instead inside the constructor.

…nto esql/exact_subfields

luigidellaquila · 2023-12-27T11:20:55Z

@nik9000 when you have a moment, can you please have a quick look?

elasticsearchmachine · 2024-01-02T19:49:03Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

nik9000 · 2024-01-04T13:45:09Z

server/src/test/java/org/elasticsearch/index/mapper/TextFieldMapperTests.java

@@ -1190,6 +1190,18 @@ protected Function<Object, Object> loadBlockExpected() {
        return v -> ((BytesRef) v).utf8ToString();
    }

+    protected boolean nullLoaderExpected(MapperService mapper, String fieldName) {


Probably want @Override for this one.

Maybe we should move this logic into loadBlockExpected and add the required parameters.

It makes sense

My only doubt is: should I consider MapperTestCase.loadBlockExpected() a public API? I know it's in tests, but these are standard tests for field mappers, and this change will likely break the codebase of third party plugins implementing custom data types.

WDYT?

Nah, it's fine to change MapperTestCase - we do it all the time. I think I did it last week.

Plugins indeed might use it for mapped field types, but if they want to upgrade they'll have to handle all kinds of changes there. We're really not stable here. In some sense that's good because their upgrades will make them think about things like synthetic _source. But in other senses it's bad because it'll break compilation. But, at least for now, we break.

Good to know, I'll go with that then.

Thanks!

i'd switch this method to private now that it's only used locally.

nik9000 · 2024-01-04T13:49:49Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+            return syntheticSourceDelegate != null
+                && syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
+                && syntheticSourceDelegate.hasNormalizer() == false;
+        }


A synthetic _source delegate will never have a normalizer. It might have ignore_above though. If it does I think it's still safe to use it for fetching but it won't help at the moment.

This method should probably have a name like canUseSyntheticSourceDelegateForQuerying.

Oh! There's no use in using the delegate for querying if the field isn't indexed. maybe check that?

And! I'm not 100% sure block loading works for synthetic source keyword fields with doc values disabled. We should fall to the originalName() stored field. If we don't that's a bug.

And! I'm not 100% sure block loading works for synthetic source keyword fields with doc values disabled. We should fall to the originalName() stored field. If we don't that's a bug.

I'm not sure I understand the exact use case here. I tried a few cases with KEYWORD fields alone and with TEXT fields with KEYWORD subfields, but every time I try to disable doc_values, I get an error at index creation time like field .. doesn't support synthetic source unless it is stored or has a sub-field of type [keyword] with doc values or stored and without a normalizer.

Do you have an example?

Sorry -it looks like you need both doc_values: false, stored: true, ignore_above: 12 or something. I think it's probably just an issue to file and run down later.

nik9000 · 2024-01-04T13:50:36Z

test/framework/src/main/java/org/elasticsearch/index/mapper/MapperTestCase.java

@@ -1348,6 +1348,10 @@ public String parentField(String field) {
        }
    }

+    protected boolean nullLoaderExpected(MapperService mapper, String loaderFieldName) {


If we keep it this'll need javadoc. There are tons of extensions to this class so it's super worth having something.

I think we can remove it and move the logic into loadBlockExpected()

nik9000 · 2024-01-04T13:52:26Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

@@ -405,4 +416,15 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate
        }
    }

+    public static boolean hasIdenticalDelegate(FieldAttribute attr, SearchStats stats) {


I feel like this should return the MappedFieldType for the identical delegate. Or something like that. Some identifier.

It's possible, but it would require a bit of refactoring in the pushdown rules to really take advantage of the returned MappedFieldType, and it's not a trivial change. I think we can consider it for a follow-up

Cool. We just have to make sure the sub-field it finds is the right one. In case there is more than one candidate. Evil, I know.

We are safe in this sense, if there are two subfields, none is used, even if one of them is good (it's QL logic, not optimal, but in this case it makes our life easier)

nik9000 · 2024-01-08T17:07:54Z

server/src/test/java/org/elasticsearch/index/mapper/TextFieldMapperTests.java

@@ -1190,6 +1190,18 @@ protected Function<Object, Object> loadBlockExpected() {
        return v -> ((BytesRef) v).utf8ToString();
    }

+    protected boolean nullLoaderExpected(MapperService mapper, String fieldName) {


i'd switch this method to private now that it's only used locally.

nik9000 · 2024-01-08T17:08:35Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizer.java

@@ -405,4 +416,15 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate
        }
    }

+    public static boolean hasIdenticalDelegate(FieldAttribute attr, SearchStats stats) {


Cool. We just have to make sure the sub-field it finds is the right one. In case there is more than one candidate. Evil, I know.

luigidellaquila · 2024-01-09T12:39:44Z

@elasticmachine run elasticsearch-ci/eql-correctness

luigidellaquila added 3 commits December 14, 2023 11:04

Improve precision when using exact subfields

1d062a2

Add tests

81b9af6

Merge branch 'main' into esql/exact_subfields

81fd0c8

luigidellaquila requested review from costin and nik9000 December 18, 2023 14:16

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.13.0 labels Dec 18, 2023

luigidellaquila commented Dec 18, 2023

View reviewed changes

luigidellaquila added >bug :Analytics/ES|QL AKA ESQL labels Dec 18, 2023

elasticsearchmachine added Team:QL (Deprecated) Meta label for query languages team and removed needs:triage Requires assignment of a team area label labels Dec 18, 2023

Update docs/changelog/103510.yaml

d51155e

costin approved these changes Dec 21, 2023

View reviewed changes

luigidellaquila added 5 commits December 22, 2023 11:05

Implement review suggestions

389251d

Merge branch 'main' into esql/exact_subfields

8031d34

Fix merge

e8803b9

Merge remote-tracking branch 'luigidellaquila/esql/exact_subfields' i…

aa7e362

…nto esql/exact_subfields

Fix tests

597ab2a

luigidellaquila added 3 commits December 29, 2023 14:46

Merge branch 'main' into esql/exact_subfields

dcca601

Fix FieldMapper tests (again)

bc0c910

Merge branch 'main' into esql/exact_subfields

6eb7c09

wchaparro added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 2, 2024

elasticsearchmachine removed the Team:QL (Deprecated) Meta label for query languages team label Jan 2, 2024

nik9000 reviewed Jan 4, 2024

View reviewed changes

luigidellaquila added 2 commits January 8, 2024 09:55

Implement review suggestions

ee197aa

Fix checkstyle

2cc0cb1

nik9000 approved these changes Jan 8, 2024

View reviewed changes

luigidellaquila added 2 commits January 9, 2024 13:25

Merge branch 'main' into esql/exact_subfields

ea1bd1a

Private method

df9375c

Fix checks for non indexed but stored subfields

940f406

nik9000 approved these changes Jan 9, 2024

View reviewed changes

luigidellaquila merged commit 089435c into elastic:main Jan 9, 2024
15 checks passed

craigtaverner mentioned this pull request Nov 8, 2024

Use SearchStats instead of field.isAggregatable in data node planning #115744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES|QL: better management of exact subfields for TEXT fields #103510

ES|QL: better management of exact subfields for TEXT fields #103510

luigidellaquila commented Dec 18, 2023

luigidellaquila Dec 18, 2023

elasticsearchmachine commented Dec 18, 2023

elasticsearchmachine commented Dec 18, 2023

elasticsearchmachine commented Dec 18, 2023

costin left a comment

costin Dec 20, 2023

costin Dec 21, 2023

bpintea Dec 21, 2023

costin Dec 21, 2023

costin Dec 21, 2023 •

edited

Loading

costin Dec 22, 2023 •

edited

Loading

luigidellaquila commented Dec 27, 2023

elasticsearchmachine commented Jan 2, 2024

nik9000 Jan 4, 2024

nik9000 Jan 4, 2024

luigidellaquila Jan 5, 2024

nik9000 Jan 5, 2024

luigidellaquila Jan 5, 2024

nik9000 Jan 8, 2024

nik9000 Jan 4, 2024

nik9000 Jan 4, 2024

nik9000 Jan 4, 2024

luigidellaquila Jan 8, 2024 •

edited

Loading

nik9000 Jan 8, 2024

nik9000 Jan 4, 2024

luigidellaquila Jan 8, 2024

nik9000 Jan 4, 2024

luigidellaquila Jan 8, 2024

nik9000 Jan 8, 2024

luigidellaquila Jan 9, 2024

nik9000 Jan 8, 2024

nik9000 Jan 8, 2024

luigidellaquila commented Jan 9, 2024

ES|QL: better management of exact subfields for TEXT fields #103510

ES|QL: better management of exact subfields for TEXT fields #103510

Conversation

luigidellaquila commented Dec 18, 2023

Choose a reason for hiding this comment

elasticsearchmachine commented Dec 18, 2023

elasticsearchmachine commented Dec 18, 2023

elasticsearchmachine commented Dec 18, 2023

costin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin Dec 21, 2023 • edited Loading

Choose a reason for hiding this comment

costin Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

luigidellaquila commented Dec 27, 2023

elasticsearchmachine commented Jan 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luigidellaquila Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luigidellaquila commented Jan 9, 2024

costin Dec 21, 2023 •

edited

Loading

costin Dec 22, 2023 •

edited

Loading

luigidellaquila Jan 8, 2024 •

edited

Loading