-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apache DataSketches plugin for Trino #6643
Apache DataSketches plugin for Trino #6643
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Can we have some tests for like validating the result when we hit it to Trino. And a docs describing the functions would also be helpful.
...ino-datasketches/src/main/java/io/trino/plugin/datasketches/state/SketchStateSerializer.java
Outdated
Show resolved
Hide resolved
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Estimate.java
Outdated
Show resolved
Hide resolved
return GroupedSketchState.class; | ||
} | ||
|
||
public static class GroupedSketchState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since they are more details to capture, can we move it to a new file ? WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping the structure consistent with other state factories would make more sense. Ex. HyperLogLog, LongApproximateMostFrequent, NumericHistogram etc.
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Union.java
Show resolved
Hide resolved
Please change the commit message to Trino |
b1d0670
to
6fec5ed
Compare
6fec5ed
to
52fc9b9
Compare
636a2f5
to
f814770
Compare
Apologies for the delay in getting back to this. I have addressed all the comments and added the doc as well. Thanks for reviewing the PR, do let me know if it needs more changes. |
@@ -0,0 +1,36 @@ | |||
====================== | |||
DataSketches Connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this plugin is set of functions as Teradata functions and Machine learning functions. I would move to functions
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the suggested changes to the doc. Please let me know if the there is any issue with sphinx syntax or anything needs to be changed
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/SketchUDFPlugin.java
Outdated
Show resolved
Hide resolved
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Estimate.java
Outdated
Show resolved
Hide resolved
union.update(sketch2); | ||
Sketch unionResult = union.getResult(); | ||
|
||
assertEquals(unionResult.getEstimate(), estimate, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add tests using Trino queries? You can refer to TestMLQueries
or any other function tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into this as soon as I get some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked into TestMLQueries and adding test cases using Trino queries, but, for that, we will either need to support the creation of datasketches in Presto or register Hive UDFs for test cases for inserting Sketch in the database in binary format. We can add this when we add support for the creation of Sketches in Presto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we talked in Slack, I assume it's testable without such supports. We can prepare Sketches data in Java library and then cast to varbinary format at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BytesWritable
from hadoop turned out to be the key to insert the sketches in binary format in the database. Added the test case.
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Union.java
Outdated
Show resolved
Hide resolved
f814770
to
8a09d22
Compare
127131b
to
a638a5e
Compare
a638a5e
to
b772682
Compare
@ebyhr, Do let me know if any more changes are needed to this. |
👋 @ShashwatArghode - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that. We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks. |
Hi @bitsondatadev, |
public interface SketchState | ||
extends AccumulatorState | ||
{ | ||
Slice getSketch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#14290 added a Theta sketch aggregation function: IcebergThetaSketchForStats
I forgot about this PR (and it is still waiting for "syntax review"), but anyway that function is for Iceberg plugin's internal use. Also this happens to be sketch building aggregation function that's not being added here.
Anyway, i was able to use UpdateSketch
directly in the state class.
For sketch union aggregation, the state could probably hold directly onto an org.apache.datasketches.theta.Union
object
private UnionWithParams() {} | ||
|
||
@InputFunction | ||
public static void input(@AggregationState SketchState state, @SqlType(StandardTypes.VARBINARY) Slice inputValue, @SqlType(StandardTypes.INTEGER) Integer normEntries, @SqlType(StandardTypes.BIGINT) Long seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does normEntries mean?
{ | ||
state.setNominalEntries(normEntries); | ||
state.setSeed(seed); | ||
state.setSketch(inputValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this overwrite the state/result after previous input
function invocation?
DataSketches functions | ||
---------------------- | ||
|
||
.. function:: theta_sketch_union(sketches) -> sketch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two variants of theta_sketch_union
, one having parameters. The parameters need to be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, should the docs say this is an aggregation function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> sketch
there is no "sketch" type. should this be "varbinary"?
I was just skimming. Not reviewing until the syntax is reviewed (#6643 (comment)) |
Hi @findepi , could I ask what open concerns are there on the syntax? I couldn't find relevant context from the comment above. |
We are interested in this as well and can provide any help required to take this forward. |
👋 @ShashwatArghode - this PR is inactive and doesn't seem to be under development. If you'd like to continue work on this at any point in the future, feel free to re-open. |
Plugin to query Apache Datasketches (https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html). This includes merge and estimate function for theta sketch.