-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KLL sketch #12498
KLL sketch #12498
Conversation
Very cool! Thanks @AlexanderSaydakov! Has this code been used in a production environment? Any gotches we should know about? There appears to be a legitimate issue in
|
This code is included in our datasketches-java 3.2.0 release dated 27 Apr 2022. There are no gotches we know about, yet :) |
No, this code has not been used yet. I am not aware of any gotchas. |
FYI I haven't had a chance to take a look at this but it is in my queue. If anyone gets to review it before I do then thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience @AlexanderSaydakov. I'm nearly prepared to merge this one. Other than the comments on docs and merging factory, my biggest question is whether the array return from postaggs is going to play well with usage in expressions. (For example: expression virtual columns in an outer query, or an expression post-aggregator in this same query.) @clintropolis would you be able to take a look at this specific part of it?
public static final ColumnType STRING_ARRAY = new ColumnType(ValueType.ARRAY, null, STRING); | ||
public static final ColumnType LONG_ARRAY = new ColumnType(ValueType.ARRAY, null, LONG); | ||
public static final ColumnType DOUBLE_ARRAY = new ColumnType(ValueType.ARRAY, null, DOUBLE); | ||
public static final ColumnType FLOAT_ARRAY = new ColumnType(ValueType.ARRAY, null, FLOAT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintropolis what are your thoughts on the implications of adding float array? Do we need to update any other code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float arrays are allowed to exist in the type system, we just didn't have anything producing them so I didn't make a constant for it. The type system even allows nested arrays if druid.expressions.allowNestedArrays
is true (but it is off by default), so there should be no problem with it. I will ensure that float[]
is handled correctly in the expression system when I make the other adjustments.
|type|This String should be "KllFloatsSketch" or "KllDoublesSketch"|yes| | ||
|name|A String for the output (result) name of the calculation.|yes| | ||
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes| | ||
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be from 8 to 65535.|no, defaults to 200| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a table available of k -> space required and expected accuracy? Would be great to provide some additional guidance to users about this. It's the most common question I get about using sketches in Druid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on adding such a table on our web site, and I will refer to it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|name|A String for the output (result) name of the calculation.|yes| | ||
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes| | ||
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be from 8 to 65535.|no, defaults to 200| | ||
|maxStreamLength|This parameter defines the number of items presented to each sketch before it might, in the context of a BufferAggregator, grow larger than a preallocated memory region and need to move on heap. Ideally just a few sketches should grow that large.|no, defaults to 1000000000| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users won't necessarily know what BufferAggregators are, since they're an implementation detail. Suggested alternate wording:
This parameter defines the number of items that can be presented to each sketch before it may need to move from off-heap to on-heap memory. This is relevant to query types that use off-heap memory, including
[TopN](../../querying/topnquery.md)
and[GroupBy](../../querying/groupbyquery.md)
. Ideally, should be set high enough such that most sketches can stay off-heap.
This raises the question in users' minds: why not set it to some huge value like Long.MAX_VALUE
? I guess that the downside of setting maxStreamLength
higher is increased off-heap memory requirements. Is there a table or formula available about the relationship between maxStreamLength
and required off-heap memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I like your suggestion to reword. And again we need a table of sketch sizes, which I will provide.
@@ -56,7 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch | |||
|name|A String for the output (result) name of the calculation.|yes| | |||
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes| | |||
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128| | |||
|maxStreamLength|This parameter is a temporary solution to avoid a [known issue](https://github.com/apache/druid/issues/11544). It may be removed in a future release after the bug is fixed. This parameter defines the maximum number of items to store in each sketch. If a sketch reaches the limit, the query can throw `IllegalStateException`. To workaround this issue, increase the maximum stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.|no, defaults to 1000000000| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call. Documenting something as a temporary solution is a great way to ensure it sticks around forever 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the parameter was introduced to temporarily work around a bug, but after some discussion it was decided that it might be useful regardless
return new KllDoublesSketchMergeAggregatorFactory( | ||
getName(), | ||
Math.max(getK(), ((KllDoublesSketchAggregatorFactory) other).getK()), | ||
getMaxStreamLength() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nicer for this stuff to be symmetric; would it make sense to use max here?
Like, Math.max(getMaxStreamLength(), ((KllDoublesSketchAggregatorFactory) other).getMaxStreamLength())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think it makes sense.
@Override | ||
public boolean canVectorize(ColumnInspector columnInspector) | ||
{ | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🚀 🚀
if (sketch.isEmpty()) { | ||
final double[] cdf = new double[splitPoints.length + 1]; | ||
Arrays.fill(cdf, Double.NaN); | ||
return cdf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintropolis is double[]
a valid return class for something declared as DOUBLE_ARRAY
? (Or should it be Double[]
or Object[]
?) I didn't find javadocs where we codify this. Ideally it should be linked from ColumnType or ValueType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think double[]
is ok to return here (it actually isn't currently, if the output of this is fed into an expression without a round-trip through Jackson, but this I think is a flaw and I'm going to fix that). When I fix the issue with double[]
I will also update the javadocs for ColumnType
to indicate the allowed return types.
{ | ||
final KllDoublesSketch sketch = (KllDoublesSketch) field.compute(combinedAggregators); | ||
if (sketch.isEmpty()) { | ||
final double[] quantiles = new double[fractions.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintropolis similar question here about DOUBLE_ARRAY
and double[]
?
return new KllFloatsSketchMergeAggregatorFactory( | ||
getName(), | ||
Math.max(getK(), ((KllFloatsSketchAggregatorFactory) other).getK()), | ||
getMaxStreamLength() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to KllDoublesSketchAggregatorFactory: It's nicer for this stuff to be symmetric; would it make sense to use max here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to let you know, Alex is on vacation and won't be back until Aug 15th.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting us know. We took long enough to review this, so we can certainly wait a week to continue the conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after CI passes!
The remaining issue in phase 1 of CI is the docs spellcheck. The following terms are not in the dictionary: KLL, KllFloatsSketch, KllDoublesSketch, PMF, CDF, maxStreamLength, and toString (https://app.travis-ci.com/github/apache/druid/jobs/580267197).
@AlexanderSaydakov you can address these by adding them to the dictionary (the website/.spelling
file).
Two test runs failed with the below segfault in Judging by the fact it's In looking at the code, I suspect it's due to a use-after-free due to returning an object from Similar code in other sketch helpers creates on-heap copies (hll uses
|
So close! There's a conflict in ColumnType, probably from #12914, which was a response to the discussion in this PR about better supporting the arrays returned by the KLL postaggs. I took the liberty of resolving the conflict using GitHub's "resolve conflict" feature. |
Hmm. Another crash.
|
I suspect the additional crash was due to a similar problem, but for unions. I pushed 7e117c7 to the branch which will copy those too. @AlexanderSaydakov if there's a better way to do this, please let me know (or just change it). |
Looks fine to me. Sorry, I missed that part. |
Looks like the latest build passed tests. I'll merge it. Thanks @AlexanderSaydakov ! |
This is a module to support new KllFloatsSketch and KllDoublesSketch from Apache DataSketches