-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLLSketchMerge aggregator failing for some metrics after upgrade to v0.18 #9736
Comments
Hi @scrawfor, would you be able to upload one of the specific segments that has this problem? It sounds like it might be a backwards compatibility issue in DataSketches, but having a copy of the actual sketch image would help debug. /cc @AlexanderSaydakov for fyi |
By the way, Druid 0.16 used DataSketches 0.13.4, and Druid 0.18 uses DataSketches 1.2.0-incubating. |
Unfortunately I can't upload it. I can try to recreate the issue with generic data, but I'm not sure how productive that would be. Are there any other debugging steps I could take? I tried to load a 0.17.1 historical server to see if it occurred in that release as well, but I ran into issues with the authentication extension and was unable to get the node started. |
@scrawfor Perhaps you could extract and upload just the HLL column? It only contains sketches of data, so it's less sensitive than the entire segment. A good way to do it is to unzip the segment and look at the
You could extract it by running:
What I'm after is a binary image of a specific sketch that exhibits the problem — I think once we have that it should be easier to find and fix it. If it is possible to reproduce this on some test data that would be great too. |
Do we have a stack trace that shows where in the sketch code this occurs? |
It would also be helpful to know in what kind of operation: update or merge, getResult(), getEstimate(), etc. |
The exception thrown is "java.util.concurrent.ExecutionException". Our sketches are not thread-safe with the single exception of our Theta Sketch, which has concurrent configuration options. So if there is more than one thread touching the HLL sketch, all bets are off. I can't explain why it doesn't fail in Druid 0.16, unless your threading model changed between 0.16 and 0.18. If concurrency is the problem, the simplest & fastest way to fix this would be to put a synchronized wrapper around the sketch. |
@gianm I'll extract it tomorrow. Thanks for the instructions. @leerho It's a merge aggregator. {
"type" : "HLLSketchMerge",
"name" : "unique_views_hll",
"fieldName" : "unique_views_hll"
} Here is a stack strace
|
Thanks for the stack trace. It is very helpful. I think I may have a clue what may be happening. The specific exception thrown by the HLL union operator is:
This occurs in only one place, which is the union.update(HllSketch input) method. It is detecting that a special flag is set in the input sketch that is only ever set during a union operation. Thus the incoming sketch had to be an image of a union operator and not that of a streaming updatable sketch nor a streaming compact sketch. This flag is used to detect that the internal state of the union data structure is not finalized. It is only finalized when you call
Using a serialized union operator as input to a union merge operation is not legal and it never has been. The example code also illustrates the use of union.getResult(type), and that example code has been there for several years. It is only with version 1.2.0 that this special case is properly detected. There still might be a problem in the HLL code, but I would appreciate it if you could check your usage code and see if my hunch is correct. Lee. |
Also, please check that these sketches are only touched by one thread at a time. The "concurrent" exception that was thrown makes me nervous :) |
@gianm I pulled out all sensitive info and re-indexed our data, so I'm attaching the full segment. The offending metric is Also, I did some tests with other query types and found timeseries and topN queries were successful. @leerho I'll have to leave that to others more familiar with the druid extension to comment on, but looking at the code it does seem like a lock is acquired. |
@scrawfor Thanks for the upload. Do you have an example of a query that exhibits the problem? |
Sure. {
"dataSource" : "hlltest",
"queryType" : "groupBy",
"intervals" : [ "2020-04-06/2020-04-07" ],
"granularity" : "ALL",
"aggregations" : [{
"type" : "HLLSketchMerge",
"name" : "unique_views_hll",
"fieldName" : "unique_views_hll"
}],
"limitSpec" : {
"type" : "default",
"limit" : 50000,
"columns" : [ {
"dimension" : "unique_views_hll",
"direction" : "descending",
"dimensionOrder" : "alphanumeric"
} ]
},
"dimensions" : []
} |
Thanks for the segment and example query, I can reproduce this issue in the debugger with group by query and can confirm that it doesn't seem to affect timeseries queries, which means it is likely an issue with
This is an unfortunate presentation issue, that's not the actual exception on the historical, but the side effect of the broker parallel merging catching an error on an individual query. I'll see if I can try to improve this in the future to make it less confusing. |
@clintropolis Thanks for confirming. I did find that the I tried to reissue the query this evening, but it's using
|
How was that segment with sketches created? |
@AlexanderSaydakov Using native batch indexer, local Firehose, built the sketch over a string column. Here's the metric spec. {
"type": "HLLSketchBuild",
"name": "unique_views_hll",
"fieldName": "view_id",
"lgK": 14,
"tgtHllType": "HLL_4",
"round": false
}, |
using what version of Druid? |
0.16.0 |
The exception occurs on this line on i think the 7th call to |
We have a fix! @clintropolis @scrawfor @gianm @AlexanderSaydakov With clues from @clintropolis and @scrawfor @AlexanderSaydakov was able to reproduce the bug with his knowledge of how the aggregator works. And from that I was able to locate the bug, which was my fault. I put in a check for a flag where there did not need to be one. So it was actually throwing an unnecessary exception. We will be going over this part of the code carefully, adding unit tests and preparing for a new release. Due to the dual 72 hour release cycles this will take a week or so. Thank you for your patience! Lee. |
Is there a schema file somewhere that describes the layout of the hll_segment.zip file that @scrawfor attached to this issue. I'd like to write a parser so that in the future if we need to pull out just the sketches I can do it more efficiently. I was able to find some of the sketches by hand, but it is a lot of work :) |
@gianm described how to get the raw column out of the segment in this comment by finding the position information in the From the druid package directory:
which will spit out something like this:
|
I ran some tests, and it appears that downgrading |
I don't see why not. I looked at our release notes for 1.2.0, and there is nothing that might affect Druid, just no HLL union speed improvement for now. |
@clintropolis @gianm |
apache/datasketches-java#308 - Looks like this is the PR with the fix. |
Yes. But I would wait for the full release.
See
https://lists.apache.org/thread.html/r6a69c6689ec303fc2df83ea483b87166b0b1d14422a2803b881b87ef%40%3Cdev.datasketches.apache.org%3E
…On Wed, Apr 29, 2020 at 9:15 AM Suneet Saldanha ***@***.***> wrote:
apache/datasketches-java#308
<apache/datasketches-java#308> - Looks
like this is the PR with the fix.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9736 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCXRQS432M7JNHEVF6CJVTRPBHCRANCNFSM4MNQLJ5Q>
.
|
Folks, As a result, I have developed a small tool that takes the output of the dump_segment_tool , and extracts the sketches as binary files. This allows us to easily examine the details of individual sketches with methods already available in the DataSketches library. Hopefully, this will make debugging issues involving sketches in Druid much easier and faster. The question is where should we put this tool so others can use it? Obviously it makes assumptions about Druid's segment structure and Druid's Dump-Segment tool. It doesn't make sense to put it in the DataSketches library as it is specific to Druid. I'd be glad to submit a PR and add it to druid/services/src/main/java/org/apache/druid/cli directory. Or perhaps it should be added to the druid/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches folder. Please advise. Lee. |
@leerho, would the work you did make sense as an additional option on the DumpSegment tool? If so that seems like the most natural place. |
Could we set up a separate issue for this discussion? It is a bit off topic for this bug. |
We announced a new release on May 7th that fixes this issue on [email protected]. |
Please provide a detailed title (e.g. "Broker crashes when using TopN query with Bound filter" instead of just "Broker crashes").
Affected Version
v0.18 (upgraded from 0.16.0)
Description
The HLLSketchMerge aggregator is failing for some of our metrics after upgrading to druid 0.18.0. Reverting back to 0.16.0 fixes the issue. I have isolated specific segments where the issue occurs, moved those segments back to our 0.16 historical and have been successfully able to query the same metric.
Re-indexing data does not seem to fix the issue.
Error Message.
The text was updated successfully, but these errors were encountered: