Add AggregateFunctionRetention #2887

sundy-li · 2018-08-17T12:31:11Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

This implements an AggregateFunction retention, which could be used for user retention rate analysis.

sundy-li · 2018-08-17T12:36:01Z

Currently there are two main sql scripts that can calculate the retention, but they run very slowly.
1.use groupUniqArrayIf

SELECT 
    r1, 
    r2
FROM 
(
    SELECT 
        groupUniqArrayIf(uid, date = '2018-08-13') AS users1, 
        groupUniqArrayIf(uid, date = '2018-08-14') AS users2, 
        length(users1) AS r1, 
        length(arrayIntersect(users1, users2)) AS r2
    FROM events 
 WHERE date in('2018-08-13','2018-08-14')
);

Which costs 70 seconds on my machine to process 3 billion dataset.

2.use arrayJoin

SELECT 
    countIf(1, detal = 0) AS r1, 
    countIf(1, detal = 1) AS r2
FROM 
(
    SELECT 
        uid, 
        toDate('2018-08-13') AS firstday, 
        arraySort(groupUniqArray(date)) AS tdates, 
        arrayFilter(x -> has(tdates, firstday), tdates) AS fdays,
        arrayMap(x -> (x - firstday), fdays) AS days, 
        arrayJoin(days) AS detal
    FROM events 
    WHERE date in('2018-08-13','2018-08-14')
    GROUP BY uid
);

Which costs 30 seconds on my machine to process same dataset.

But we could use retention function to make it much faster.

SELECT 
    sum(x[1]) AS r1, 
    sum(x[2]) AS r2
FROM 
(
    SELECT 
    	retention(date = '2018-08-13', date = '2018-08-14') as x
    FROM events 
    WHERE date in('2018-08-13','2018-08-14')
    GROUP BY uid
);

It only cost 6 seconds to get the result.

Related issues #2120

alexey-milovidov · 2018-08-20T04:29:04Z

What about the following query?

SELECT 
    uniq(uid) AS r1, 
    uniqIf(uid, date = '2018-08-14') AS r2
FROM events
WHERE date IN ('2018-08-13', '2018-08-14')
AND uid IN (SELECT uid FROM events WHERE date = '2018-08-13')

sundy-li · 2018-08-20T06:10:57Z

I got that yandex.Metrica use uniq function to get the unique visitors, but it has some disadvantages.
uniq function returns approximate results, yet retention returns the exactly results.
uniq needs global in to process distribute queries.
Last but not least, retention could be up to 10 times faster than uniq on my new datasets.

Here is some detail test results on my new datasets.

SELECT 
 count(),
    uniq(uid), 
    uniqExact(uid)
FROM events
WHERE partition = '2018-08-13';
┌───count()─┬─uniq(uid)─┬─uniqExact(uid)─┐
│ 246055188 │          11633890 │               11661510 │
└───────────┴───────────────────┴────────────────────────┘

1 rows in set. Elapsed: 2.945 sec. Processed 246.06 million rows, 2.46 GB (83.56 million rows/s., 835.59 MB/s.) 



SELECT 
    uniq(uid) AS r1, 
    uniqIf(uid, partition = '2018-08-14') AS r2
FROM events
WHERE (partition IN ('2018-08-13', '2018-08-14')) AND (uid GLOBAL IN 
(
    SELECT uid
    FROM events
    WHERE partition = '2018-08-13'
));

┌──────r1─┬──────r2─┐
│ 8759444 │ 1760035 │
└─────────┴─────────┘

1 rows in set. Elapsed: 21.723 sec. Processed 9.87 billion rows, 86.33 GB (454.41 million rows/s., 3.97 GB/s.) 




SELECT 
    r1, 
    r2
FROM 
(
    SELECT 
        groupUniqArrayIf(uid, partition = '2018-08-13') AS users1, 
        groupUniqArrayIf(uid, partition = '2018-08-14') AS users2, 
        length(users1) AS r1, 
        length(arrayIntersect(users1, users2)) AS r2
    FROM events
 WHERE (partition IN ('2018-08-13', '2018-08-14'))
);
┌──────r1─┬──────r2─┐
│ 8765610 │ 1755812 │
└─────────┴─────────┘

1 rows in set. Elapsed: 7.499 sec. Processed 2.44 billion rows, 24.40 GB (325.38 million rows/s., 3.25 GB/s.) 





SELECT 
    sum(r[1]) as r1, 
    sum(r[2]) as r2
FROM 
(
    SELECT 
        uid, 
        retention(partition = '2018-08-13', partition = '2018-08-14') AS r
    FROM events
    WHERE (partition IN ('2018-08-13', '2018-08-14'))
    GROUP BY uid
);

┌──────r1─┬──────r2─┐
│ 8765610 │ 1755812 │
└─────────┴─────────┘

1 rows in set. Elapsed: 2.692 sec. Processed 2.44 billion rows, 24.40 GB (906.21 million rows/s., 9.06 GB/s.)

alexey-milovidov · 2018-08-20T06:39:38Z

dbms/src/AggregateFunctions/AggregateFunctionRetention.h

+        auto & offsets_to = static_cast<ColumnArray &>(to).getOffsets();
+
+        const bool first_flag = this->data(place).events.test(0);
+        data_to.insert(first_flag ? Field(static_cast<UInt64>(1)) : Field(static_cast<UInt64>(0)));


You can gain more performance if you cast array elements to ColumnUInt8 and access its data directly.

Ok， nice advice.

amosbird · 2018-08-20T08:49:45Z

Hmm, isn't this doable by CRDT tricks? a simple test shows that the plain SQL routine outperforms this UDAF.

dell123 :) SELECT count(x), countIf(y, abs(y) < 1 ) FROM (SELECT sum(1) x, sum( (date = '2018-08-06') * 2 - 1 ) / x  y  FROM retention_test WHERE date IN ('2018-08-06', '2018-08-08') GROUP BY uid);

SELECT
    count(x),
    countIf(y, abs(y) < 1)
FROM
(
    SELECT
        sum(1) AS x,
        sum(((date = '2018-08-06') * 2) - 1) / x AS y
    FROM retention_test
    WHERE date IN ('2018-08-06', '2018-08-08')
    GROUP BY uid
)

┌─count(x)─┬─countIf(y, less(abs(y), 1))─┐
│  8000000 │                     6000000 │
└──────────┴─────────────────────────────┘

1 rows in set. Elapsed: 1.178 sec. Processed 19.00 million rows, 114.00 MB (16.13 million rows/s., 96.77 MB/s.)

dell123 :) SELECT sum(r[1]) as r1, sum(r[2]) as r2 FROM (SELECT uid, retention(date = '2018-08-06', date = '2018-08-08') AS r FROM retention_test WHERE date IN ('2018-08-06', '2018-08-08') GROUP BY uid);

SELECT
    sum(r[1]) AS r1,
    sum(r[2]) AS r2
FROM
(
    SELECT
        uid,
        retention(date = '2018-08-06', date = '2018-08-08') AS r
    FROM retention_test
    WHERE date IN ('2018-08-06', '2018-08-08')
    GROUP BY uid
)

┌──────r1─┬──────r2─┐
│ 8000000 │ 6000000 │
└─────────┴─────────┘

1 rows in set. Elapsed: 1.523 sec. Processed 19.00 million rows, 114.00 MB (12.48 million rows/s., 74.87 MB/s.)

zhang2014 · 2018-08-20T09:22:09Z

Hmm, isn't this doable by CRDT tricks? a simple test shows that the plain SQL routine outperforms this UDAF.

👍 Great idea, But it gets more and more complicated as you add states.

sundy-li · 2018-08-20T09:28:25Z

@amosbird That is really smart... 👍

But count(x) is wrong answer, it includes the visitors only in 2018-08-08

amosbird · 2018-08-20T10:43:05Z

@sundy-li nice catch. the fix is trivial. Just replace count(x) with countIf(y, y > -1)
@zhang2014 #1646 would be one way to alleviate the complication. Or else this umbrella issue #11

add AggregateFunctionRetention

63d7497

sundy-li force-pushed the retention branch from 4b4b079 to 63d7497 Compare August 18, 2018 12:40

Update AggregateFunctionRetention.h

89ee237

alexey-milovidov merged commit bce94da into ClickHouse:master Aug 20, 2018

alexey-milovidov reviewed Aug 20, 2018

View reviewed changes

alexey-milovidov added a commit that referenced this pull request Aug 23, 2018

Fixed wrong code #2887

a2674d4

alexey-milovidov added a commit that referenced this pull request Aug 23, 2018

Updated test #2887

d95e2be

alexey-milovidov added a commit that referenced this pull request Aug 23, 2018

Improvement #2887

246f194

ClownfishYang mentioned this pull request Jan 6, 2021

Quickly finding the element values in the array range follows some logic and is monotone. #18780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AggregateFunctionRetention #2887

Add AggregateFunctionRetention #2887

sundy-li commented Aug 17, 2018

sundy-li commented Aug 17, 2018 •

edited

Loading

alexey-milovidov commented Aug 20, 2018 •

edited

Loading

sundy-li commented Aug 20, 2018 •

edited

Loading

alexey-milovidov Aug 20, 2018

sundy-li Aug 20, 2018

amosbird commented Aug 20, 2018

zhang2014 commented Aug 20, 2018

sundy-li commented Aug 20, 2018 •

edited

Loading

amosbird commented Aug 20, 2018

Add AggregateFunctionRetention #2887

Add AggregateFunctionRetention #2887

Conversation

sundy-li commented Aug 17, 2018

sundy-li commented Aug 17, 2018 • edited Loading

alexey-milovidov commented Aug 20, 2018 • edited Loading

sundy-li commented Aug 20, 2018 • edited Loading

alexey-milovidov Aug 20, 2018

Choose a reason for hiding this comment

sundy-li Aug 20, 2018

Choose a reason for hiding this comment

amosbird commented Aug 20, 2018

zhang2014 commented Aug 20, 2018

sundy-li commented Aug 20, 2018 • edited Loading

amosbird commented Aug 20, 2018

sundy-li commented Aug 17, 2018 •

edited

Loading

alexey-milovidov commented Aug 20, 2018 •

edited

Loading

sundy-li commented Aug 20, 2018 •

edited

Loading

sundy-li commented Aug 20, 2018 •

edited

Loading