Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Post Aggregators for Tuple Sketches #13819

Conversation

anshu-makkar
Copy link
Contributor

@anshu-makkar anshu-makkar commented Feb 17, 2023

Description

Added New Post Aggregators for Tuple Sketches

Release note

New: You can now do the following operations with TupleSketches in Post Aggregation Step

  • Get the Sketch Output as Base64 String
  • Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations
  • Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch

Key changed/added classes in this PR
Added
  • ArrayOfDoublesSketchToEncodedStringPostAggregator.java
  • ArrayOfDoublesSketchToMetricsEstimatePostAggregator.java
  • ArrayOfDoublesSketchConstantPostAggregator.java
  • ArrayOfDoublesSketchToMetricsEstimatePostAggregatorTest.java
  • ArrayOfDoublesSketchToEncodedStringPostAggregatorTest.java
Modified
  • AggregatorUtil.java
  • ArrayOfDoublesSketchModule.java

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@anshu-makkar
Copy link
Contributor Author

@techdocsmith Hello, if you could help me tag a reviewer for these new features.

@techdocsmith
Copy link
Contributor

Thanks for the contribution @anshu-makkar . I'm not sure who should do this review--I focus mostly on markdown :D . @abhishekagarwal87 do you have a recommendation?

@abhishekagarwal87
Copy link
Contributor

@anshu-makkar - Thank you for your contribution. can you please address build failures? You also need to update docs - https://github.com/apache/druid/blob/master/docs/development/extensions-core/datasketches-tuple.md#post-aggregators

@anshu-makkar
Copy link
Contributor Author

@anshu-makkar - Thank you for your contribution. can you please address build failures? You also need to update docs - https://github.com/apache/druid/blob/master/docs/development/extensions-core/datasketches-tuple.md#post-aggregators

@abhishekagarwal87 : Yes I will work on adding the documentation, and address the build failures.

Any other comments if you have I can address those.

@abhishekagarwal87
Copy link
Contributor

@anshu-makkar - are you going to update the code later? You have resolved the comments.

@anshu-makkar
Copy link
Contributor Author

anshu-makkar commented Feb 21, 2023

@anshu-makkar - are you going to update the code later? You have resolved the comments.

Hi @abhishekagarwal87 I have modified the code and addressed all the comments.

@anshu-makkar
Copy link
Contributor Author

@abhishekagarwal87 : Request to you review if there are more changes required.


### Estimated metrics values for each column of ArrayOfDoublesSketch

This post aggregator returns a list of estimated values(sum) from a given ArrayOfDoublesSketch. The result is _N_ double values, where _N_ is the number of double values kept in the sketch per key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This post aggregator returns a list of estimated values(sum) from a given ArrayOfDoublesSketch. The result is _N_ double values, where _N_ is the number of double values kept in the sketch per key.
This post aggregator returns a list of estimated values(sum of distincts) from a given ArrayOfDoublesSketch. The result is _N_ double values, where _N_ is the number of double values kept in the sketch per key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs an overall rephrase, basically with each distinct value, there is a array of doubles, with this post-aggregator we want to provide a double array with sum.

For E.g.

Key_1, {1.0, 2.0}
Key_2, {2.0, 5.0}

Output from Post-aggregator is {3.0, 7.0}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@techdocsmith @vtlim Can you help rephrase this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation works. Though it does need to be called out that result is an estimated sum and not accurate sum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could This post aggregator returns a list of estimated values(sum) from a given ArrayOfDoublesSketch. be more like This post aggregator returns a list of estimated sum for each metric value from a given ArrayOfDoublesSketch. ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just saw this. The example isn't clear to me; is the sum being added row-wise or column-wise?

For this example,

Key_1, {1.0, 3.0}
Key_2, {2.0, 5.0}

Would the result be {3.0, 8.0} or {4.0, 7.0}?

Commented a suggestion in https://github.com/apache/druid/pull/13819/files?file-filters%5B%5D=.md&show-viewed-files=true#r1120990146. It would be great to include this example in the docs as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result would be {3.0, 8.0}

Copy link
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshu-makkar : Thanks a lot for the changes! I have left some comments on the core changes, please let me know your thoughts on them.

Copy link
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the review @anshu-makkar! 👍 I have done a second pass over the changes and left some comments. Request you to please take a look at it.


### Estimated metrics values for each column of ArrayOfDoublesSketch

This post aggregator returns a list of estimated values(sum) from a given ArrayOfDoublesSketch. The result is _N_ double values, where _N_ is the number of double values kept in the sketch per key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could This post aggregator returns a list of estimated values(sum) from a given ArrayOfDoublesSketch. be more like This post aggregator returns a list of estimated sum for each metric value from a given ArrayOfDoublesSketch. ?

@anshu-makkar
Copy link
Contributor Author

@abhishekagarwal87 @rohangarg Thanks for taking time to review this. I will work on remaining review comment to add a test case to ArrayOfDoublesSketchAggregationTest

frankgrimes97 added a commit to frankgrimes97/druid that referenced this pull request Feb 28, 2023
Based on Druid native query support changes being added on:
  apache#13819
frankgrimes97 added a commit to frankgrimes97/druid that referenced this pull request Feb 28, 2023
Based on Druid native query support changes being added on:
  apache#13819
@abhishekagarwal87
Copy link
Contributor

thanks @anshu-makkar. Please let us know once you think it's ready for review. I think we are almost there.

@anshu-makkar
Copy link
Contributor Author

@abhishekagarwal87 : I have added the tests as suggested for native query testing for newly added post-aggs.

Add description of post-aggregator with example

Co-authored-by: Victoria Lim <[email protected]>
Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments. LGTM otherwise.

Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for docs

@anshu-makkar
Copy link
Contributor Author

Request to merge, since we have another development activity on top of this to add SQL support to Tuple Sketches

@abhishekagarwal87 abhishekagarwal87 merged commit a10e415 into apache:master Mar 3, 2023
@abhishekagarwal87
Copy link
Contributor

Merged. Thank you for your contribution @anshu-makkar

frankgrimes97 added a commit to frankgrimes97/druid that referenced this pull request Mar 3, 2023
Includes tests which exercise the use-case outlined on:
  apache#13819
frankgrimes97 added a commit to frankgrimes97/druid that referenced this pull request Mar 6, 2023
Includes tests which exercise the use-case outlined on:
  apache#13819
frankgrimes97 added a commit to frankgrimes97/druid that referenced this pull request Mar 6, 2023
Includes tests which exercise the use-case outlined on:
  apache#13819
frankgrimes97 added a commit to frankgrimes97/druid that referenced this pull request Mar 6, 2023
Includes tests which exercise the use-case outlined on:
  apache#13819
@frankgrimes97 frankgrimes97 mentioned this pull request Mar 6, 2023
4 tasks
317brian pushed a commit to 317brian/druid that referenced this pull request Mar 10, 2023
You can now do the following operations with TupleSketches in Post Aggregation Step

Get the Sketch Output as Base64 String
Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations
Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch
abhishekagarwal87 pushed a commit that referenced this pull request Mar 28, 2023
This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
gianm added a commit to gianm/druid that referenced this pull request Jun 5, 2024
The version originally added in apache#13819 was missing an annotation for
the "value" property. Fixes apache#16539.

Line endings for ArrayOfDoublesSketchConstantPostAggregator.java are changed
from \r\n to \n.

Adds a serde test, and improves various other datasketches post-aggregator
serde tests to deserialize into PostAggregator. This verifies that the type
information is set up correctly.
gianm added a commit that referenced this pull request Jun 6, 2024
* Fix serde for ArrayOfDoublesSketchConstantPostAggregator.

The version originally added in #13819 was missing an annotation for
the "value" property. Fixes #16539.

Line endings for ArrayOfDoublesSketchConstantPostAggregator.java are changed
from \r\n to \n.

Adds a serde test, and improves various other datasketches post-aggregator
serde tests to deserialize into PostAggregator. This verifies that the type
information is set up correctly.

* Fix excessive imports.

* Fix equals, hashCode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants