Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLICKHOUSE-3547 streaming histogram aggregation #2521

Merged
merged 22 commits into from
Jul 11, 2018

Conversation

ssmike
Copy link
Contributor

@ssmike ssmike commented Jun 16, 2018

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

@ssmike ssmike force-pushed the CLICKHOUSE-3547 branch 4 times, most recently from 242371c to a9333b3 Compare June 19, 2018 12:53
UInt32 bins_count;

#define READ(VAL, PARAM) \
VAL = applyVisitor(FieldVisitorConvertToNumber<decltype(VAL)>(), PARAM);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macro is used for single case. Why don't write bins_count = applyVisitor(FieldVisitorConvertToNumber<UInt32>(), params[0]);?

{
if (params.size() != 1)
{
throw Exception("Function " + name + " requires only bins count");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add ErrorCode.
It's not obvious that bins count is a parameter (not an argument), and we must specify it. For example:

:) select histogram(number) from (select * from system.numbers limit 20);

SELECT histogram(number)
FROM 
(
    SELECT *
    FROM system.numbers 
    LIMIT 20
) 

Received exception from server (version 1.1.54386):
Code: 0. DB::Exception: Received from localhost:9004, ::1. DB::Exception: Function histogram requires only bins count.

next[size] = 0;
previous[0] = size;

using QueueItem = std::pair<Mean, int>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use int here? max_bins and size are UInt32.

{
if (!points)
init(arena);
points[size++] = {value, weight};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will use foreign memory if max_bins == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_bins=0 is strange case. I think function should accept only positive bin count.

{
throw Exception("Function " + name + " requires only bins count");
}
assertUnary(name, arguments);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need to check that argument is numeric.

@ssmike ssmike force-pushed the CLICKHOUSE-3547 branch from 3b559fc to 0188538 Compare June 22, 2018 18:29
@alexey-milovidov
Copy link
Member

Simple test for performance and precision:

SELECT arrayJoin(histogram(10)(number)) AS res, res.3 / (res.2 - res.1) AS density, bar(density * 1000, 0, 1000, 80) AS bar FROM numbers(1000000000)

it's about 8.5 million rows/sec per single core.
The bottleneck is compress function and, consequently, priority_queue.

@alexey-milovidov
Copy link
Member

SELECT 
    CounterID, 
    count(), 
    histogramIf(10)(SendTiming, SendTiming != -1)
FROM hits_1000m_single 
GROUP BY CounterID
ORDER BY count() DESC
LIMIT 20

Looks like the results are wrong:

232261 │ 12148031 │ [(0,2.2250738585072014e-308,12148031)]

@alexey-milovidov
Copy link
Member

:) SELECT histogram(10)(-2.0)

SELECT histogram(10)(-2.)

┌─histogram(10)(-2.)───────────────┐
│ [(-2,2.2250738585072014e-308,1)] │
└──────────────────────────────────┘

@alexey-milovidov
Copy link
Member

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 4, 2018

The values don't look real on this query:

┌─CounterID─┬──count()─┬─quantilesTimingIf(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)(SendTiming, greater(SendTiming, 0))─┬─histogramIf(10)(SendTiming, greater(SendTiming, 0))──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│    114208 │ 56057344 │ [1,78,112,149,193,248,322,433,621,1072,30000]                                                            │ [(1,20215729.61420258,14308657.75),(20215729.61420258,52954898.79166667,2044103.125),(52954898.79166667,82019633.16666667,3.875),(82019633.16666667,111713231.25,1.375),(111713231.25,150951663.75,1.75),(150951663.75,239427425,1.125),(239427425,326503053,1),(326503053,472743336.5,1),(472743336.5,2076058731.5,1),(2076058731.5,2227775,1)]

@alexey-milovidov
Copy link
Member

TODO: allocate all state inplace, get rid of WeightedValue * pointer.
Look for uniqUpTo aggregate function as an example.

@alexey-milovidov
Copy link
Member

The type of parameter is not checked:

SELECT CounterID, count(), histogramIf(-1)(SendTiming, SendTiming > 0) FROM hits_1000m_single GROUP BY CounterID ORDER BY count() DESC LIMIT 20
Memory limit (for query) exceeded: would use 128.02 GiB

@alexey-milovidov
Copy link
Member

As we preallocate memory for maximum size of histogram, better to limit its size by 256 (or maybe 1000).

@alexey-milovidov
Copy link
Member

Incorrect error message:

:) SELECT histogram(0.5)(1)

SELECT histogram(0.5)(1)

Received exception from server (version 1.1.54388):
Code: 36. DB::Exception: Received from localhost:9000, ::1. DB::Exception: Bin count should be positive.

@alexey-milovidov
Copy link
Member

This should throw an exception:
SELECT histogram(1.5)(1)

@ssmike
Copy link
Contributor Author

ssmike commented Jul 5, 2018

SELECT topK(0.2)(number) FROM (SELECT * FROM system.numbers LIMIT 50)

Received exception from server (version 1.1.54387):
Code: 69. DB::Exception: Received from localhost:9000, ::1. DB::Exception: Parameter 0 is illegal for aggregate function topK.

¯\_(ツ)_/¯

@ssmike
Copy link
Contributor Author

ssmike commented Jul 5, 2018

Ok, I've fixed limits. But I don't understand what you expect to see in histogram for one value.

@ssmike ssmike force-pushed the CLICKHOUSE-3547 branch from 9440175 to f598e6d Compare July 5, 2018 09:52
@ssmike ssmike force-pushed the CLICKHOUSE-3547 branch from a9ddff4 to 0418f17 Compare July 5, 2018 13:28
@ssmike ssmike force-pushed the CLICKHOUSE-3547 branch from 21da475 to 0418f17 Compare July 6, 2018 08:17
@ssmike ssmike force-pushed the CLICKHOUSE-3547 branch 2 times, most recently from efbe03f to 877acef Compare July 8, 2018 13:55
@alexey-milovidov alexey-milovidov merged commit 9793ae9 into ClickHouse:master Jul 11, 2018
alexey-milovidov added a commit that referenced this pull request Jul 11, 2018
@fizerkhan
Copy link

fizerkhan commented Jul 26, 2018

I am confused, how to use it for following scenerio

I have table with two columns such as route name and response time as duration. I want to see the histogram of response time with maximum duration of 5 second with 10 buckets.

is this possible?

histogram(5, 10)(duration)

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 26, 2018

histogramIf(10)(duration, duration <= 5)

Here we applied -If combinator.
Read more: https://clickhouse.yandex/docs/en/query_language/agg_functions/combinators/

@fizerkhan
Copy link

fizerkhan commented Jul 27, 2018

@alexey-milovidov Awesome. That makes sense.

Let's say If don't use the -If combinator, then how the behaviour will be

histogram(10)(duration)

Here I can understand that 10 ranges will come, but what is the ceiling limit?

@alexey-milovidov
Copy link
Member

Histogram will span between actual minimum and maximum of data value.

@fizerkhan
Copy link

@alexey-milovidov Great. Thanks

@nanohayder
Copy link

Hi,

is it possible to histogram return fixed size ranges within the number of buckets specified for example 0-5,5-10,10-15,... that would be very helpful

Thanks

@fizerkhan
Copy link

fizerkhan commented Feb 21, 2019

@alexey-milovidov We have the following table

count  UInt64
avg   Float32

We could like something like

select sum(count), histogramIf(10)(avg, avg <= 20) from summingtable

We would like to do get both the sum of count and the histogram together

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants