Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures less ambiguous #12145

Merged
merged 6 commits into from
Jan 12, 2022

Conversation

clintropolis
Copy link
Member

Description

Follow-up to #11949, this PR splits explicit time column functions into standalone EARLIEST_BY and LATEST_BY SQL functions to avoid the method signatures being ambiguous and dependent on the column types of the inputs.

Prior to this PR, something like latest(x, 10) could either translate to "latest" value of x with max bytes of 10 if a string, but if x was a number it would treat the 10 as a timestamp, so instead of a validation exception like would happen prior to #11949, it would allow it but then explode with strange cast exceptions due to unintended behavior on the users part.

Splitting this out into separate functions makes it much less likely for the user to issue the incorrect query.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`EARLIEST(expr, maxBytesPerString, timeColumn)`|Like `EARLIEST(expr, timeColumn)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
|`EARLIEST_BY(expr, timeColumn)`|Returns the earliest value of `expr`, which must be numeric. Earliest value is defined as the value first encountered with the minimum overall value of time column of all values being aggregated.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`EARLIEST_BY(expr, maxBytesPerString, timeColumn)`|Like `EARLIEST_BY(expr, timeColumn)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With EARLIEST_BY I'd suggest we swap the location of timeColumn and maxBytesPerString. Given that timeColumn is non-optional on the _BY items, we want the optional parameter to be at the end of the signature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, changed

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the one comment.

OperandTypes.sequence(
"'" + aggregatorType.name() + "(expr, timeColumn)'\n",
OperandTypes.ANY,
OperandTypes.NUMERIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should allow either NUMERIC or TIMESTAMP here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking about it more, IMO we should only allow TIMESTAMP. That's consistent with what our other time functions do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, and this found a bug when i had to switch to column that could actually be converted to TIMESTAMP (a long), we weren't correctly checking if the time column selector had null values, so I've fixed up the native aggregators to handle this case.

@@ -70,7 +70,6 @@
</option>
<option name="IGNORE_FIELDS_USED_IN_MULTIPLE_METHODS" value="true" />
</inspection_tool>
<inspection_tool class="FieldMayBeFinal" enabled="true" level="WARNING" enabled_by_default="true" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change (& the others in this file) intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no heh, will fix

@clintropolis clintropolis merged commit f2ce769 into apache:master Jan 12, 2022
@clintropolis clintropolis deleted the earliest-latest-ambiguous branch January 12, 2022 11:48
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
sachinsagare pushed a commit to pinterest/druid that referenced this pull request Oct 28, 2022
… less ambiguous (apache#12145)

* add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous

* switcheroo

* EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs

* revert unintended change

* fix docs

* fix docs better

(cherry picked from commit f2ce769)
Signed-off-by: ssagare <[email protected]>
sachinsagare pushed a commit to sachinsagare/druid that referenced this pull request Nov 3, 2022
… less ambiguous (apache#12145)

* add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous

* switcheroo

* EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs

* revert unintended change

* fix docs

* fix docs better

Signed-off-by: ssagare <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants