Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL: better management of exact subfields for TEXT fields #103510

Merged
merged 17 commits into from
Jan 9, 2024

Conversation

luigidellaquila
Copy link
Contributor

Fixes #99899

For a mapping like:

"field": {
    "type": "text",
    "fields": {
        "raw":{
            "type":"keyword",
        }
    }
},

and a query like

FROM idx | sort text | where text == "foo" | keep text

ESQL optimizer can decide to use the subfield text.raw rather than the original text field, for performance reasons that include:

  • fetch performance
  • ability to push sort and filter operations down to Lucene

There are cases though when the subfield value is not exactly the same as the original one (eg. because of ignore_above or `normalizer).
This PR improves the detection of these cases, so that the subfield is used only when it's really accurate.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.13.0 labels Dec 18, 2023
@@ -69,9 +68,6 @@ public final PhysicalOperation fieldExtractPhysicalOperation(FieldExtractExec fi
List<ValuesSourceReaderOperator.FieldInfo> fields = new ArrayList<>();
int docChannel = source.layout.get(sourceAttr.id()).channel();
for (Attribute attr : fieldExtractExec.attributesToExtract()) {
if (attr instanceof FieldAttribute fa && fa.getExactInfo().hasExact()) {
attr = fa.exactAttribute();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed, TextFieldMapper will take care of fetching values from the exact subfield

@elasticsearchmachine elasticsearchmachine added Team:QL (Deprecated) Meta label for query languages team and removed needs:triage Requires assignment of a team area label labels Dec 18, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - left some minute comments. 👍 for the tests.
Please pass this by Nik as well.

@@ -937,9 +937,15 @@ public boolean isAggregatable() {
return fielddata;
}

public boolean isSyntheticSourceDelegateIdentical() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canUseSyntheticSourceDelegate


public static boolean isPushableFieldAttribute(Expression exp, Function<FieldAttribute, Boolean> hasIdenticalDelegate) {
if (exp instanceof FieldAttribute fa && fa.getExactInfo().hasExact() && isAggregatable(fa)) {
return fa.dataType().equals(DataTypes.TEXT) ? hasIdenticalDelegate.apply(fa) : true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return fa.dataType().equals(DataTypes.TEXT) && hasIdenticalDelegate.test(fa)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that wouldn't be right for non-TEXT, but return fa.dataType() != DataTypes.TEXT || hasIdenticalDelegate.test(fa) should be.
Or return exp instanceof FieldAttribute fa && fa.getExactInfo().hasExact() && isAggregatable(fa) && (fa.dataType() != DataTypes.TEXT || hasIdenticalDelegate.test(fa));.

@@ -302,10 +322,9 @@ protected PhysicalPlan rule(TopNExec topNExec) {
return plan;
}

private boolean canPushDownOrders(List<Order> orders) {
private boolean canPushDownOrders(List<Order> orders, Function<FieldAttribute, Boolean> hasIdenticalDelegate) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use java.util.function.Predicate

@@ -101,9 +104,9 @@ protected List<Batch<PhysicalPlan>> rules(boolean optimizeForEsSource) {
esSourceRules.add(new ReplaceAttributeSourceWithDocId());

if (optimizeForEsSource) {
esSourceRules.add(new PushTopNToSource());
esSourceRules.add(new PushTopNToSource(context()));
Copy link
Member

@costin costin Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these two rules have similar logic it's worth making the rules non-static, move hasIdenticalDelegate as a private method to the outer class and have the classes refer to that.
canPushDown will still have to use the static signature however it should be less bureaucratic to access the SearchStats.

Copy link
Member

@costin costin Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch the above - use the same pattern as PushStatsToSource that is change the rule to be a ParameterizedOptimizerRule so the context will be passed in per method instead inside the constructor.

@luigidellaquila
Copy link
Contributor Author

@nik9000 when you have a moment, can you please have a quick look?

@wchaparro wchaparro added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 2, 2024
@elasticsearchmachine elasticsearchmachine removed the Team:QL (Deprecated) Meta label for query languages team label Jan 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@@ -1190,6 +1190,18 @@ protected Function<Object, Object> loadBlockExpected() {
return v -> ((BytesRef) v).utf8ToString();
}

protected boolean nullLoaderExpected(MapperService mapper, String fieldName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want @Override for this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this logic into loadBlockExpected and add the required parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense

My only doubt is: should I consider MapperTestCase.loadBlockExpected() a public API? I know it's in tests, but these are standard tests for field mappers, and this change will likely break the codebase of third party plugins implementing custom data types.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it's fine to change MapperTestCase - we do it all the time. I think I did it last week.

Plugins indeed might use it for mapped field types, but if they want to upgrade they'll have to handle all kinds of changes there. We're really not stable here. In some sense that's good because their upgrades will make them think about things like synthetic _source. But in other senses it's bad because it'll break compilation. But, at least for now, we break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, I'll go with that then.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd switch this method to private now that it's only used locally.

return syntheticSourceDelegate != null
&& syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
&& syntheticSourceDelegate.hasNormalizer() == false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A synthetic _source delegate will never have a normalizer. It might have ignore_above though. If it does I think it's still safe to use it for fetching but it won't help at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should probably have a name like canUseSyntheticSourceDelegateForQuerying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! There's no use in using the delegate for querying if the field isn't indexed. maybe check that?

And! I'm not 100% sure block loading works for synthetic source keyword fields with doc values disabled. We should fall to the originalName() stored field. If we don't that's a bug.

Copy link
Contributor Author

@luigidellaquila luigidellaquila Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And! I'm not 100% sure block loading works for synthetic source keyword fields with doc values disabled. We should fall to the originalName() stored field. If we don't that's a bug.

I'm not sure I understand the exact use case here. I tried a few cases with KEYWORD fields alone and with TEXT fields with KEYWORD subfields, but every time I try to disable doc_values, I get an error at index creation time like field .. doesn't support synthetic source unless it is stored or has a sub-field of type [keyword] with doc values or stored and without a normalizer.

Do you have an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry -it looks like you need both doc_values: false, stored: true, ignore_above: 12 or something. I think it's probably just an issue to file and run down later.

@@ -1348,6 +1348,10 @@ public String parentField(String field) {
}
}

protected boolean nullLoaderExpected(MapperService mapper, String loaderFieldName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep it this'll need javadoc. There are tons of extensions to this class so it's super worth having something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it and move the logic into loadBlockExpected()

@@ -405,4 +416,15 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate
}
}

public static boolean hasIdenticalDelegate(FieldAttribute attr, SearchStats stats) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should return the MappedFieldType for the identical delegate. Or something like that. Some identifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but it would require a bit of refactoring in the pushdown rules to really take advantage of the returned MappedFieldType, and it's not a trivial change. I think we can consider it for a follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. We just have to make sure the sub-field it finds is the right one. In case there is more than one candidate. Evil, I know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are safe in this sense, if there are two subfields, none is used, even if one of them is good (it's QL logic, not optimal, but in this case it makes our life easier)

@@ -1190,6 +1190,18 @@ protected Function<Object, Object> loadBlockExpected() {
return v -> ((BytesRef) v).utf8ToString();
}

protected boolean nullLoaderExpected(MapperService mapper, String fieldName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd switch this method to private now that it's only used locally.

@@ -405,4 +416,15 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate
}
}

public static boolean hasIdenticalDelegate(FieldAttribute attr, SearchStats stats) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. We just have to make sure the sub-field it finds is the right one. In case there is more than one candidate. Evil, I know.

@luigidellaquila
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/eql-correctness

@luigidellaquila luigidellaquila merged commit 089435c into elastic:main Jan 9, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: better management of exact subfields
6 participants