-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SpectatorHistogram extension #15340
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.
Thanks for the PR! This looks like a really cool addition to Druid. I'm going through the PR, and will add comments after it's done. |
142ff7f
to
9f9a27b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @bsyk !
I have not reviewed the code in depth, and have mostly read through the documentation.
From skimming the code, I think the extension is added in a safe way, and can be improved on in future patches.
Some things that could be done in the future:
- Add vectorized implementations of the aggregator so that it allows queries to leverage that functionality.
- Add sql bindings
docs/configuration/extensions.md
Outdated
| gce-extensions | GCE Extensions | [link](../development/extensions-contrib/gce-extensions.md) | | ||
| prometheus-emitter | Exposes [Druid metrics](../operations/metrics.md) for Prometheus server collection (https://prometheus.io/) | [link](../development/extensions-contrib/prometheus.md) | | ||
| kubernetes-overlord-extensions | Support for launching tasks in k8s without Middle Managers | [link](../development/extensions-contrib/k8s-jobs.md) | | ||
| druid-spectator-histogram | Support for efficient approximate percentile queries | [link](../development/extensions-contrib/spectator-histogram.md) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update your editor to undo the formatting changes to this table please.
https://github.com/apache/druid/blob/master/dev/druid_intellij_formatting.xml#L77-L80 - This was recently added to the druid_intellij_formatting.xml file- so if you re-import it, the formatter should no longer update the tables when you edit them.
data-sketches (depending on data-set, see limitations below). | ||
|
||
## Limitations | ||
* Supports positive numeric values within the range of [0, 2^53). Negatives are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to call out that decimals are not supported - when I first read numeric values, I just assumed that decimals were supported, but the druid summit talk mentions those are not supported.
* Fixed buckets with increasing bucket widths. Relative accuracy is maintained, | ||
but absolute accuracy reduces with larger values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the accuracy tradeoff here vs other sketch implementations.
I don't understand what absolute accuracy reduces with larger values means. Maybe an example in the docs will help clear it up.
I think that sort of information will be helpful for users to decide which sketch implementation to use for their use case.
} | ||
``` | ||
|
||
> Note: It's more efficient to request multiple percentiles in a single query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Given this note, would it be a nicer UX if the extension did not provide a way to get a single percentile. If users want to get a single percentile, they could pass in an array with one element.
I don't have a strong opinion on this, so if you think having both functions is better - that's fine with me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where only a single percentile is wanted, often median or 95th, it's slightly nicer to get a single value back, rather than having to extract from an array in the results.
Also not a strong opinion.
Is the note misleading? It's trying to say, "if you want multiple percentiles from the same underlying metric, then ask for them all at once, rather than as separate metrics". 1 query being more efficient than 2.
// This will prevent class casting exceptions if trying to query with sum rather | ||
// than explicitly as a SpectatorHistogram | ||
// | ||
// The SpectatorHistorgram is a Number. That number is of intValue(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpectatorHistogram
public void configure(Binder binder) | ||
{ | ||
registerSerde(); | ||
//TODO: samarth this probably needs to be added for sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could either be removed to a comment till sql support is added
import java.util.BitSet; | ||
import java.util.Objects; | ||
|
||
public class NullableOffsetsHeader implements Serializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: A javadoc for this class as a small summary of its usages would help make the code more readable
} | ||
final SpectatorHistogramAggregatorFactory that = (SpectatorHistogramAggregatorFactory) o; | ||
|
||
//TODO: samarth should we check for equality of contents in count arrays? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was left over from an earlier implementation and no longer relevant.
89d8b15
to
8f604d1
Compare
private void add(Object key, Number value) | ||
{ | ||
if (key instanceof String) { | ||
this.add(Integer.parseInt((String) key), value.longValue()); |
Check notice
Code scanning / CodeQL
Missing catch of NumberFormatException Note
} | ||
// Treat as long number, if it looks like a number | ||
if (Character.isDigit((objectString).charAt(0))) { | ||
return Long.parseLong((String) object); |
Check notice
Code scanning / CodeQL
Missing catch of NumberFormatException Note
Assert.assertEquals(1, results.size()); | ||
Map<String, ColumnAnalysis> columns = results.get(0).getColumns(); | ||
Assert.assertNotNull(columns.get("histogram")); | ||
Assert.assertEquals("spectatorHistogramTimer", columns.get("histogram").getType()); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note test
ColumnAnalysis.getType
Assert.assertEquals(1, results.size()); | ||
Map<String, ColumnAnalysis> columns = results.get(0).getColumns(); | ||
Assert.assertNotNull(columns.get("histogram")); | ||
Assert.assertEquals("spectatorHistogramDistribution", columns.get("histogram").getType()); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note test
ColumnAnalysis.getType
byte[] bytes = histogram.toBytes(); | ||
int keySize = Short.BYTES; | ||
int valSize = 0; | ||
Assert.assertEquals("Should compact small values within key bytes", 5 * (keySize + valSize), bytes.length); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type Warning test
int multiplication
byte[] bytes = histogram.toBytes(); | ||
int keySize = Short.BYTES; | ||
int valSize = Short.BYTES; | ||
Assert.assertEquals("Should compact medium values to short", 5 * (keySize + valSize), bytes.length); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type Warning test
int multiplication
byte[] bytes = histogram.toBytes(); | ||
int keySize = Short.BYTES; | ||
int valSize = Integer.BYTES; | ||
Assert.assertEquals("Should compact larger values to integer", 5 * (keySize + valSize), bytes.length); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type Warning test
int multiplication
byte[] bytes = histogram.toBytes(); | ||
int keySize = Short.BYTES; | ||
int valSize = Long.BYTES; | ||
Assert.assertEquals("Should not compact larger values", 5 * (keySize + valSize), bytes.length); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type Warning test
int multiplication
|
||
byte[] bytes = histogram.toBytes(); | ||
int keySize = Short.BYTES; | ||
Assert.assertEquals("Should not compact larger values", (5 * keySize) + 0 + 2 + 4 + 8 + 8, bytes.length); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type Warning test
int multiplication
|
||
byte[] bytes = histogram.toBytes(); | ||
int keySize = Short.BYTES; | ||
Assert.assertEquals("Should compact", (8 * keySize) + 0 + 1 + 1 + 2 + 2 + 4 + 4 + 8, bytes.length); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type Warning test
int multiplication
8f604d1
to
2e13c64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments! I've gone through the changes, and they seem to be fine. The build failures look unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one more comment to try to explain the opinionated behavior. Otherwise LGTM for docs.
Fantastic. Thanks for all your suggestions, they've made the docs a lot more clear. |
@suneet-s @vtlim @adarshsanjeev |
Cleanup comments
so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram. When queried as a Number, we're returning the count of entries in the histogram.
Co-authored-by: Victoria Lim <[email protected]>
Co-authored-by: Victoria Lim <[email protected]>
87f2516
to
78e0235
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is already approved is probably fine to do the changes i suggested as follow-up PR
@Override | ||
public Comparator<Double> getComparator() | ||
{ | ||
return Doubles::compare; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem like the correct comparator if the output type is double array (ColumnType has a comparator available type.getNullableStrategy()
if there might be nulls, or type.getStrategy()
if not that should work)
@Override | ||
public ValueType getType() | ||
{ | ||
return ValueType.COMPLEX; | ||
} | ||
|
||
@Override | ||
public ValueType getFinalizedType() | ||
{ | ||
return ValueType.COMPLEX; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods are deprecated and will be removed at some point, please implement getIntermediateType
and getResultType
instead.
@bsyk |
Description
Adds the contribution extension providing support for SpectatorHistogram. A fast, small alternative to data-sketches or T-Digest for computing approximate percentiles.
See documentation included in the PR for more details.
This PR has: