-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Implement FIRST/LAST aggregate functions #37936
Conversation
FIRST and LAST can be used with one argument and work similarly to MIN and MAX but they are implemented using a Top Hits aggregation and therefore can also operate on keyword fields. When a second argument is provided then they return the first/last value of the first arg when its values are ordered ascending/descending (respectively) by the values of the second argument. Currently because of the usage of a Top Hits aggregation FIRST and LAST cannot be used in the HAVING clause of a GROUP BY query to filter on the results of the aggregation. Closes: elastic#35639
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall however there some things that need improving.
In particular the rewrite optimization I don't think is complete and probably needs a follow-up PR.
|
||
.Synopsis: | ||
[source, sql] | ||
-------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for optional arguments, it would be nice to add an example with FIRST
just with the first argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean here in the synopsis? because we have an example here: https://github.com/elastic/elasticsearch/pull/37936/files/1be7ba194c950ab175ca0e35d927ab0b54306be2#diff-f71f0c2cd27202241aec808b033b8510R166
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, for synopsis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@costin our documentation, until now, lists one version of a function in synopsis where the optional arguments are between square brackets. See https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate and https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/math.asciidoc#round (and others). Listing all versions of a function (with/without optional params) imo would look a bit crowded and not-so-clean and sleek. In this particular case of FIRST/FIRST_VALUE/LAST/LAST_VALUE means we need to list four versions of the same function in the synopsis and I personally believe it doesn't look that great.
Do you believe that, maybe, it is not clear that parameter is optional? (the square brackets and the description saying "optional")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the optional parameter but the method name that I'd like to be exemplified through a query to make the function name more obvious.
It might just be easier to create a separate section for the alias, mentioning that it's an alias for ...
This can be done in a separate PR though since it's just documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Synopsis: | ||
[source, sql] | ||
-------------------------------------------------- | ||
FIRST(field_name<1>, sort_by_field_name<2>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort_by_field_name -> ordering field
|
||
*Input*: | ||
|
||
<1> a field name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<1> target field for the aggregation
<2> optional field used for ordering
|
||
[cols="<,<"] | ||
|=== | ||
s|a|b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use indentation to have the table columns aligned.
@@ -113,6 +113,129 @@ Returns the total number of _distinct non-null_ values in input values. | |||
include-tagged::{sql-specs}/docs.csv-spec[aggCountDistinct] | |||
-------------------------------------------------- | |||
|
|||
[[sql-functions-aggs-first]] | |||
===== `FIRST/FIRST_VALUE` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally please add an example for FIRST_VALUE
.
return parameters().isEmpty() ? null : parameters().get(0); | ||
} | ||
|
||
public DataType sortFieldDataType() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this method is needed - one can always say sortField().dataType()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that to avoid the caller checking if the sortField()
is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All expressions return only one datatype - of their result, adding a new method is confusing. It's simple to do the null check - in fact, it's is mandatory for sorting since one cannot just add a null, null directive (if the sort field would be null).
super(source, field, sortField != null ? Collections.singletonList(sortField) : Collections.emptyList()); | ||
} | ||
|
||
public Expression sortField() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why sort is used instead of order/orderby?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, can change.
AggregationBuilder toBuilder() { | ||
// Sort missing values (NULLs) as last to get the first/last non-null value | ||
List<SortBuilder<?>> sortBuilderList; | ||
if (sortField != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is redundant since the field ordering code is duplicated.
if (sortField!=null) {
sort.add(fieldSortBuilder(sortField);
}
sort.add(fieldSortBuilder(fieldName());
// Sort missing values (NULLs) as last to get the first/last non-null value | ||
List<SortBuilder<?>> sortBuilderList; | ||
if (sortField != null) { | ||
sortBuilderList = Arrays.asList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To increase readability please use one method per line.
@@ -240,4 +243,8 @@ public static DataType fromTypeName(String esType) { | |||
return DataType.UNSUPPORTED; | |||
} | |||
} | |||
|
|||
public String format() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not okay conceptually and implementation wise since the base class is aware of its child type.
Make format() return DocValueFieldsContext by default and override DateTime and Date to return DATE_PARSE_FORMA - basic overriding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is an enum, I don't get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase this - a dataType can have different formats. Currently we store that not in the DataType
but rather inside the DateEsField
as it can vary from field to field.
Ironically due to our usage of field_caps this information is not accessible anymore however it might be in the future.
DateEsField allows a custom format to be specified though I'm not sure this is something we need to use necessarily.
If the format could be used on DateEsField
I would opt for that, if not, it can stay as is in the DataType
but the two classes need to be aligned so the format is defined and stored only in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getFormats()
on DateEsField
is currently not used anywhere so I'll remove it from there to avoid any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Left few minor comments.
Also, I would like to see tests for all values NULL. Something like SELECT FIRST(whatever) FROM bla WHERE whatever IS NULL GROUP BY X
.
Also, tests that catch the error generated when FIRST/LAST/MIN/MAX is used in HAVING.
.Synopsis: | ||
[source, sql] | ||
-------------------------------------------------- | ||
FIRST(field_name<1>, sort_by_field_name<2>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sort_by_field_name
is optional, I think you need to put it between square brackets to remain consistent with the rest of the documentation.
|
||
.Description: | ||
|
||
When only one argument is provided it returns the first **non-NULL** value across input values in the field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When only the first argument is provided".... (only the second one is optional).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first non-null value is returned regardless of the presence of the optional argument.
I would rephrase this to "Returns the first non-NULL value (if it exists) of the input column sorted by the sort_column. If sort_column is missing, the field_name column is used for sorting`. or something along those lines.
----------------------------------------------------------- | ||
|
||
[NOTE] | ||
`FIRST` cannot be used in to create a filter in a `HAVING` clause of a `GROUP BY` query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "in" is not needed here?
Also, worth putting this as a limitation in the docs? (meaning, in the dedicated Limitations page)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIRST
cannot be used in a HAVING
clause (no need to specify GROUP BY
as that's understood and also might be implicit).
.Synopsis: | ||
[source, sql] | ||
-------------------------------------------------- | ||
LAST(field_name<1>, sort_by_field_name<2>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above for optional fields and square brackets.
|
||
.Description: | ||
|
||
It's the inverse of <<sql-functions-aggs-first>>. When only one argument is provided it returns the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"first argument" instead of "one argument".
...rc/main/java/org/elasticsearch/xpack/sql/execution/search/extractor/TopHitsAggExtractor.java
Show resolved
Hide resolved
|
||
import static org.elasticsearch.common.logging.LoggerMessageFormat.format; | ||
|
||
public abstract class TopHits extends AggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add a comment describing what the class is used for.
List<SortBuilder<?>> sortBuilderList; | ||
if (sortField != null) { | ||
sortBuilderList = Arrays.asList( | ||
new FieldSortBuilder(sortField).order(sortOrder).missing(Sort.Missing.LAST.position()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe import static
for LAST
would make the code a bit more readable.
} | ||
TopHitsAgg that = (TopHitsAgg) o; | ||
return Objects.equals(sortField, that.sortField) && | ||
sortOrder == that.sortOrder && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picky, and I apologize, but please put the &&
at start of the line. I think this is the consistent approach for this in code overall.
@@ -161,6 +284,9 @@ Returns the minimum value across input values in the field `field_name`. | |||
include-tagged::{sql-specs}/docs.csv-spec[aggMin] | |||
-------------------------------------------------- | |||
|
|||
[NOTE] | |||
`MIN` on a field of type <<text, `text`>> or <<keyword, `keyword`>> is translated into <<sql-functions-aggs-first>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'm assuming MIN/MAX on keywords cannot be used in HAVING
s...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I made the link here. Do you think I should add the info about HAVING here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a new batch of comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments.
FIRST_VALUE(field_name<1>) | ||
|
||
FIRST(field_name<1>, ordering_field_name<2>) | ||
FIRST_VALUE(field_name<1>, ordering_field_name<2>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Square brackets for optional fields, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now we list both with 1 arg and 2args, do you think is still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do it like this (all versions listed - with/without the optional parameter), our documentation will be inconsistent: https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate. If I would have to choose between the two approaches, I would go with [ ]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion you either have one signature with the 2nd arg in square brackets or you list the 2 variants. Personally I prefer the one signature and the square brackets. @costin what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opt for the signature with the second arguments in square brackets.
|
||
*Input*: | ||
|
||
<1> target field for the aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"aggregation" refers to technical implementation in the background. My personal approach would be not to expose this, but try to explain what the field is used for from the end-user perspective. Again, personal preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily - for the aggregate function (aggregation is ES terminology, aggregate is SQL).
LAST_VALUE(field_name<1>) | ||
|
||
LAST(field_name<1>, ordering_field_name<2>) | ||
LAST_VALUE(field_name<1>, ordering_field_name<2>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Square brackets for optional fields, please.
@@ -669,7 +669,7 @@ public void testTopHitsAggregationWithTwoArgs() { | |||
assertTrue(eqe.output().get(0).dataType() == DataType.KEYWORD); | |||
assertThat(eqe.queryContainer().aggs().asAggBuilder().toString().replaceAll("\\s+", ""), | |||
endsWith("\"top_hits\":{\"from\":0,\"size\":1,\"version\":false,\"seq_no_primary_term\":false," + | |||
"\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\",\"format\":\"use_field_mapping\"}]," + | |||
"\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\"}]," + | |||
"\"sort\":[{\"int\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"integer\"}}," + | |||
"{\"keyword\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"keyword\"}}]}}}}}")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot wrap my head around missing
option for top_hits in the context of first/last/min/max... It's about how null
/missing values are treated when sorting: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_missing_values. Is it ok to keep missing
with the default (_last
), or I'm missing something, or there are no tests that have missing: _first
scenario included...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I clicked "add comment" I realized these are about first/last non-NULL values. So yeah, we don't care about missing
and we like the fact that the default is _last
.
FIRST and LAST can be used with one argument and work similarly to MIN and MAX but they are implemented using a Top Hits aggregation and therefore can also operate on keyword fields. When a second argument is provided then they return the first/last value of the first arg when its values are ordered ascending/descending (respectively) by the values of the second argument. Currently because of the usage of a Top Hits aggregation FIRST and LAST cannot be used in the HAVING clause of a GROUP BY query to filter on the results of the aggregation. Closes: #35639
Backported to |
…ersion * elastic/master: Do not set up NodeAndClusterIdStateListener in test (elastic#38110) ML: better handle task state race condition (elastic#38040) Soft-deletes policy should always fetch latest leases (elastic#37940) Handle scheduler exceptions (elastic#38014) Minor logging improvements (elastic#38084) Fix Painless void return bug (elastic#38046) Update PutFollowAction serialization post-backport (elastic#37989) fix a few versionAdded values in ElasticsearchExceptions (elastic#37877) Reenable BWC tests after backport of elastic#37899 (elastic#38093) Mute failing test Mute failing test Fail start on obsolete indices documentation (elastic#37786) SQL: Implement FIRST/LAST aggregate functions (elastic#37936) Un-mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock remove unused parser fields in RemoteResponseParsers
Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to elastic#41195 Follows elastic#37936
Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to elastic#41195 Follows elastic#37936
FIRST and LAST can be used with one argument and work similarly to MIN
and MAX but they are implemented using a Top Hits aggregation and
therefore can also operate on keyword fields. When a second argument is
provided then they return the first/last value of the first arg when its
values are ordered ascending/descending (respectively) by the values of
the second argument. Currently because of the usage of a Top Hits
aggregation FIRST and LAST cannot be used in the HAVING clause of a
GROUP BY query to filter on the results of the aggregation.
Closes: #35639