Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KLL sketch #12498

Merged
merged 13 commits into from
Aug 27, 2022
Merged

KLL sketch #12498

merged 13 commits into from
Aug 27, 2022

Conversation

AlexanderSaydakov
Copy link
Contributor

This is a module to support new KllFloatsSketch and KllDoublesSketch from Apache DataSketches

@gianm
Copy link
Contributor

gianm commented May 9, 2022

Very cool! Thanks @AlexanderSaydakov!

Has this code been used in a production environment? Any gotches we should know about?

There appears to be a legitimate issue in KllDoublesSketchAggregatorTest.buildingSketchesAtQueryTime when SQL-compatible null handling is enabled: https://app.travis-ci.com/github/apache/druid/jobs/569163704

[ERROR] org.apache.druid.query.aggregation.datasketches.kll.KllFloatsSketchAggregatorTest.buildingSketchesAtQueryTime[groupByConfig = v2, vectorize = true]
[ERROR]   Run 1: KllFloatsSketchAggregatorTest.buildingSketchesAtQueryTime:367 expected:<377> but was:<355>
[ERROR]   Run 2: KllFloatsSketchAggregatorTest.buildingSketchesAtQueryTime:367 expected:<377> but was:<355>
[ERROR]   Run 3: KllFloatsSketchAggregatorTest.buildingSketchesAtQueryTime:367 expected:<377> but was:<355>
[ERROR]   Run 4: KllFloatsSketchAggregatorTest.buildingSketchesAtQueryTime:367 expected:<377> but was:<355>

@leerho
Copy link
Contributor

leerho commented May 9, 2022

This code is included in our datasketches-java 3.2.0 release dated 27 Apr 2022. There are no gotches we know about, yet :)

@AlexanderSaydakov
Copy link
Contributor Author

@gianm

Has this code been used in a production environment? Any gotches we should know about?

No, this code has not been used yet. I am not aware of any gotchas.
I believe I fixed that error in the test you pointed out.

@gianm
Copy link
Contributor

gianm commented May 26, 2022

FYI I haven't had a chance to take a look at this but it is in my queue. If anyone gets to review it before I do then thanks!

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @AlexanderSaydakov. I'm nearly prepared to merge this one. Other than the comments on docs and merging factory, my biggest question is whether the array return from postaggs is going to play well with usage in expressions. (For example: expression virtual columns in an outer query, or an expression post-aggregator in this same query.) @clintropolis would you be able to take a look at this specific part of it?

public static final ColumnType STRING_ARRAY = new ColumnType(ValueType.ARRAY, null, STRING);
public static final ColumnType LONG_ARRAY = new ColumnType(ValueType.ARRAY, null, LONG);
public static final ColumnType DOUBLE_ARRAY = new ColumnType(ValueType.ARRAY, null, DOUBLE);
public static final ColumnType FLOAT_ARRAY = new ColumnType(ValueType.ARRAY, null, FLOAT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis what are your thoughts on the implications of adding float array? Do we need to update any other code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float arrays are allowed to exist in the type system, we just didn't have anything producing them so I didn't make a constant for it. The type system even allows nested arrays if druid.expressions.allowNestedArrays is true (but it is off by default), so there should be no problem with it. I will ensure that float[] is handled correctly in the expression system when I make the other adjustments.

|type|This String should be "KllFloatsSketch" or "KllDoublesSketch"|yes|
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be from 8 to 65535.|no, defaults to 200|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a table available of k -> space required and expected accuracy? Would be great to provide some additional guidance to users about this. It's the most common question I get about using sketches in Druid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on adding such a table on our web site, and I will refer to it here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be from 8 to 65535.|no, defaults to 200|
|maxStreamLength|This parameter defines the number of items presented to each sketch before it might, in the context of a BufferAggregator, grow larger than a preallocated memory region and need to move on heap. Ideally just a few sketches should grow that large.|no, defaults to 1000000000|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users won't necessarily know what BufferAggregators are, since they're an implementation detail. Suggested alternate wording:

This parameter defines the number of items that can be presented to each sketch before it may need to move from off-heap to on-heap memory. This is relevant to query types that use off-heap memory, including [TopN](../../querying/topnquery.md) and [GroupBy](../../querying/groupbyquery.md). Ideally, should be set high enough such that most sketches can stay off-heap.

This raises the question in users' minds: why not set it to some huge value like Long.MAX_VALUE? I guess that the downside of setting maxStreamLength higher is increased off-heap memory requirements. Is there a table or formula available about the relationship between maxStreamLength and required off-heap memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I like your suggestion to reword. And again we need a table of sketch sizes, which I will provide.

@@ -56,7 +56,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch
|name|A String for the output (result) name of the calculation.|yes|
|fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes|
|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for details.|no, defaults to 128|
|maxStreamLength|This parameter is a temporary solution to avoid a [known issue](https://github.com/apache/druid/issues/11544). It may be removed in a future release after the bug is fixed. This parameter defines the maximum number of items to store in each sketch. If a sketch reaches the limit, the query can throw `IllegalStateException`. To workaround this issue, increase the maximum stream length. See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.|no, defaults to 1000000000|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. Documenting something as a temporary solution is a great way to ensure it sticks around forever 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the parameter was introduced to temporarily work around a bug, but after some discussion it was decided that it might be useful regardless

return new KllDoublesSketchMergeAggregatorFactory(
getName(),
Math.max(getK(), ((KllDoublesSketchAggregatorFactory) other).getK()),
getMaxStreamLength()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nicer for this stuff to be symmetric; would it make sense to use max here?

Like, Math.max(getMaxStreamLength(), ((KllDoublesSketchAggregatorFactory) other).getMaxStreamLength())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think it makes sense.

@Override
public boolean canVectorize(ColumnInspector columnInspector)
{
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🚀 🚀

if (sketch.isEmpty()) {
final double[] cdf = new double[splitPoints.length + 1];
Arrays.fill(cdf, Double.NaN);
return cdf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis is double[] a valid return class for something declared as DOUBLE_ARRAY? (Or should it be Double[] or Object[]?) I didn't find javadocs where we codify this. Ideally it should be linked from ColumnType or ValueType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think double[] is ok to return here (it actually isn't currently, if the output of this is fed into an expression without a round-trip through Jackson, but this I think is a flaw and I'm going to fix that). When I fix the issue with double[] I will also update the javadocs for ColumnType to indicate the allowed return types.

{
final KllDoublesSketch sketch = (KllDoublesSketch) field.compute(combinedAggregators);
if (sketch.isEmpty()) {
final double[] quantiles = new double[fractions.length];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis similar question here about DOUBLE_ARRAY and double[]?

return new KllFloatsSketchMergeAggregatorFactory(
getName(),
Math.max(getK(), ((KllFloatsSketchAggregatorFactory) other).getK()),
getMaxStreamLength()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to KllDoublesSketchAggregatorFactory: It's nicer for this stuff to be symmetric; would it make sense to use max here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to let you know, Alex is on vacation and won't be back until Aug 15th.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting us know. We took long enough to review this, so we can certainly wait a week to continue the conversation.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI passes!

The remaining issue in phase 1 of CI is the docs spellcheck. The following terms are not in the dictionary: KLL, KllFloatsSketch, KllDoublesSketch, PMF, CDF, maxStreamLength, and toString (https://app.travis-ci.com/github/apache/druid/jobs/580267197).

@AlexanderSaydakov you can address these by adding them to the dictionary (the website/.spelling file).

@gianm
Copy link
Contributor

gianm commented Aug 24, 2022

Two test runs failed with the below segfault in KllDoublesSketchAggregatorTest (first time) and KllFloatsSketchAggregatorTest (second time).

Judging by the fact it's BaseWritableMemoryImpl.getNativeOrderedLong, it could be an out-of-bound access or a use-after-free. I tried running the test case in a loop in my IDE, and unfortunately wasn't able to repro this.

In looking at the code, I suspect it's due to a use-after-free due to returning an object from KllSketchBuildBufferAggregatorHelper.get that references the underlying buffer. That's used to implement BufferAggregator.get and VectorAggregator.get, which by contract aren't permitted to return objects that reference the underlying buffer. (It may be freed, or reused, while those get'ed objects still exist.)

Similar code in other sketch helpers creates on-heap copies (hll uses copy(), quantiles uses compact()). Could we do the same thing here?

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f69d737e924, pid=21102, tid=21103
#
# JRE version: OpenJDK Runtime Environment (11.0.2+9) (build 11.0.2+9)
# Java VM: OpenJDK 64-Bit Server VM (11.0.2+9, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 14439 c1 org.apache.datasketches.memory.internal.BaseWritableMemoryImpl.getNativeOrderedLong(J)J (38 bytes) @ 0x00007f69d737e924 [0x00007f69d737e7a0+0x0000000000000184]
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/travis/build/apache/druid/extensions-core/datasketches/hs_err_pid21102.log
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

@gianm
Copy link
Contributor

gianm commented Aug 25, 2022

So close! There's a conflict in ColumnType, probably from #12914, which was a response to the discussion in this PR about better supporting the arrays returned by the KLL postaggs. I took the liberty of resolving the conflict using GitHub's "resolve conflict" feature.

@gianm
Copy link
Contributor

gianm commented Aug 25, 2022

Hmm. Another crash.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f193747fdac, pid=21257, tid=0x00007f194d6bd700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 14078 C1 org.apache.datasketches.memory.internal.BaseWritableMemoryImpl.getNativeOrderedLong(J)J (38 bytes) @ 0x00007f193747fdac [0x00007f193747fc20+0x18c]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/travis/build/apache/druid/extensions-core/datasketches/hs_err_pid21257.log
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

@gianm
Copy link
Contributor

gianm commented Aug 26, 2022

I suspect the additional crash was due to a similar problem, but for unions. I pushed 7e117c7 to the branch which will copy those too. @AlexanderSaydakov if there's a better way to do this, please let me know (or just change it).

@AlexanderSaydakov
Copy link
Contributor Author

I suspect the additional crash was due to a similar problem, but for unions. I pushed 7e117c7 to the branch which will copy those too. @AlexanderSaydakov if there's a better way to do this, please let me know (or just change it).

Looks fine to me. Sorry, I missed that part.

@gianm
Copy link
Contributor

gianm commented Aug 27, 2022

Looks like the latest build passed tests. I'll merge it. Thanks @AlexanderSaydakov !

@gianm gianm merged commit 7e2371b into master Aug 27, 2022
@gianm gianm deleted the kll_sketch branch August 27, 2022 04:19
@kfaraz kfaraz added this to the 25.0 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants