-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES|QL: better management of exact subfields for TEXT fields #103510
ES|QL: better management of exact subfields for TEXT fields #103510
Conversation
@@ -69,9 +68,6 @@ public final PhysicalOperation fieldExtractPhysicalOperation(FieldExtractExec fi | |||
List<ValuesSourceReaderOperator.FieldInfo> fields = new ArrayList<>(); | |||
int docChannel = source.layout.get(sourceAttr.id()).channel(); | |||
for (Attribute attr : fieldExtractExec.attributesToExtract()) { | |||
if (attr instanceof FieldAttribute fa && fa.getExactInfo().hasExact()) { | |||
attr = fa.exactAttribute(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, TextFieldMapper
will take care of fetching values from the exact subfield
Pinging @elastic/es-ql (Team:QL) |
Hi @luigidellaquila, I've created a changelog YAML for you. |
Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - left some minute comments. 👍 for the tests.
Please pass this by Nik as well.
@@ -937,9 +937,15 @@ public boolean isAggregatable() { | |||
return fielddata; | |||
} | |||
|
|||
public boolean isSyntheticSourceDelegateIdentical() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
canUseSyntheticSourceDelegate
|
||
public static boolean isPushableFieldAttribute(Expression exp, Function<FieldAttribute, Boolean> hasIdenticalDelegate) { | ||
if (exp instanceof FieldAttribute fa && fa.getExactInfo().hasExact() && isAggregatable(fa)) { | ||
return fa.dataType().equals(DataTypes.TEXT) ? hasIdenticalDelegate.apply(fa) : true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fa.dataType().equals(DataTypes.TEXT) && hasIdenticalDelegate.test(fa)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that wouldn't be right for non-TEXT, but return fa.dataType() != DataTypes.TEXT || hasIdenticalDelegate.test(fa)
should be.
Or return exp instanceof FieldAttribute fa && fa.getExactInfo().hasExact() && isAggregatable(fa) && (fa.dataType() != DataTypes.TEXT || hasIdenticalDelegate.test(fa));
.
@@ -302,10 +322,9 @@ protected PhysicalPlan rule(TopNExec topNExec) { | |||
return plan; | |||
} | |||
|
|||
private boolean canPushDownOrders(List<Order> orders) { | |||
private boolean canPushDownOrders(List<Order> orders, Function<FieldAttribute, Boolean> hasIdenticalDelegate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use java.util.function.Predicate
@@ -101,9 +104,9 @@ protected List<Batch<PhysicalPlan>> rules(boolean optimizeForEsSource) { | |||
esSourceRules.add(new ReplaceAttributeSourceWithDocId()); | |||
|
|||
if (optimizeForEsSource) { | |||
esSourceRules.add(new PushTopNToSource()); | |||
esSourceRules.add(new PushTopNToSource(context())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these two rules have similar logic it's worth making the rules non-static, move hasIdenticalDelegate as a private method to the outer class and have the classes refer to that.
canPushDown will still have to use the static signature however it should be less bureaucratic to access the SearchStats
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch the above - use the same pattern as PushStatsToSource
that is change the rule to be a ParameterizedOptimizerRule
so the context will be passed in per method instead inside the constructor.
@nik9000 when you have a moment, can you please have a quick look? |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
@@ -1190,6 +1190,18 @@ protected Function<Object, Object> loadBlockExpected() { | |||
return v -> ((BytesRef) v).utf8ToString(); | |||
} | |||
|
|||
protected boolean nullLoaderExpected(MapperService mapper, String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want @Override
for this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this logic into loadBlockExpected
and add the required parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense
My only doubt is: should I consider MapperTestCase.loadBlockExpected()
a public API? I know it's in tests, but these are standard tests for field mappers, and this change will likely break the codebase of third party plugins implementing custom data types.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, it's fine to change MapperTestCase
- we do it all the time. I think I did it last week.
Plugins indeed might use it for mapped field types, but if they want to upgrade they'll have to handle all kinds of changes there. We're really not stable here. In some sense that's good because their upgrades will make them think about things like synthetic _source. But in other senses it's bad because it'll break compilation. But, at least for now, we break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, I'll go with that then.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd switch this method to private
now that it's only used locally.
return syntheticSourceDelegate != null | ||
&& syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE | ||
&& syntheticSourceDelegate.hasNormalizer() == false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A synthetic _source delegate will never have a normalizer. It might have ignore_above
though. If it does I think it's still safe to use it for fetching but it won't help at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should probably have a name like canUseSyntheticSourceDelegateForQuerying
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! There's no use in using the delegate for querying if the field isn't indexed. maybe check that?
And! I'm not 100% sure block loading works for synthetic source keyword fields with doc values disabled. We should fall to the originalName()
stored field. If we don't that's a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And! I'm not 100% sure block loading works for synthetic source keyword fields with doc values disabled. We should fall to the originalName() stored field. If we don't that's a bug.
I'm not sure I understand the exact use case here. I tried a few cases with KEYWORD fields alone and with TEXT fields with KEYWORD subfields, but every time I try to disable doc_values, I get an error at index creation time like field .. doesn't support synthetic source unless it is stored or has a sub-field of type [keyword] with doc values or stored and without a normalizer
.
Do you have an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry -it looks like you need both doc_values: false, stored: true, ignore_above: 12
or something. I think it's probably just an issue to file and run down later.
@@ -1348,6 +1348,10 @@ public String parentField(String field) { | |||
} | |||
} | |||
|
|||
protected boolean nullLoaderExpected(MapperService mapper, String loaderFieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep it this'll need javadoc. There are tons of extensions to this class so it's super worth having something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove it and move the logic into loadBlockExpected()
@@ -405,4 +416,15 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate | |||
} | |||
} | |||
|
|||
public static boolean hasIdenticalDelegate(FieldAttribute attr, SearchStats stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should return the MappedFieldType
for the identical delegate. Or something like that. Some identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, but it would require a bit of refactoring in the pushdown rules to really take advantage of the returned MappedFieldType
, and it's not a trivial change. I think we can consider it for a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. We just have to make sure the sub-field it finds is the right one. In case there is more than one candidate. Evil, I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are safe in this sense, if there are two subfields, none is used, even if one of them is good (it's QL logic, not optimal, but in this case it makes our life easier)
@@ -1190,6 +1190,18 @@ protected Function<Object, Object> loadBlockExpected() { | |||
return v -> ((BytesRef) v).utf8ToString(); | |||
} | |||
|
|||
protected boolean nullLoaderExpected(MapperService mapper, String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd switch this method to private
now that it's only used locally.
@@ -405,4 +416,15 @@ private Tuple<List<Attribute>, List<Stat>> pushableStats(AggregateExec aggregate | |||
} | |||
} | |||
|
|||
public static boolean hasIdenticalDelegate(FieldAttribute attr, SearchStats stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. We just have to make sure the sub-field it finds is the right one. In case there is more than one candidate. Evil, I know.
@elasticmachine run elasticsearch-ci/eql-correctness |
Fixes #99899
For a mapping like:
and a query like
ESQL optimizer can decide to use the subfield
text.raw
rather than the originaltext
field, for performance reasons that include:There are cases though when the subfield value is not exactly the same as the original one (eg. because of
ignore_above
or `normalizer).This PR improves the detection of these cases, so that the subfield is used only when it's really accurate.