Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AggregateFunctionRetention #2887

Merged
merged 2 commits into from
Aug 20, 2018

Conversation

sundy-li
Copy link
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

This implements an AggregateFunction retention, which could be used for user retention rate analysis.

@sundy-li
Copy link
Contributor Author

sundy-li commented Aug 17, 2018

Currently there are two main sql scripts that can calculate the retention, but they run very slowly.
1.use groupUniqArrayIf

SELECT 
    r1, 
    r2
FROM 
(
    SELECT 
        groupUniqArrayIf(uid, date = '2018-08-13') AS users1, 
        groupUniqArrayIf(uid, date = '2018-08-14') AS users2, 
        length(users1) AS r1, 
        length(arrayIntersect(users1, users2)) AS r2
    FROM events 
 WHERE date in('2018-08-13','2018-08-14')
);

Which costs 70 seconds on my machine to process 3 billion dataset.

2.use arrayJoin

SELECT 
    countIf(1, detal = 0) AS r1, 
    countIf(1, detal = 1) AS r2
FROM 
(
    SELECT 
        uid, 
        toDate('2018-08-13') AS firstday, 
        arraySort(groupUniqArray(date)) AS tdates, 
        arrayFilter(x -> has(tdates, firstday), tdates) AS fdays,
        arrayMap(x -> (x - firstday), fdays) AS days, 
        arrayJoin(days) AS detal
    FROM events 
    WHERE date in('2018-08-13','2018-08-14')
    GROUP BY uid
);

Which costs 30 seconds on my machine to process same dataset.

But we could use retention function to make it much faster.

SELECT 
    sum(x[1]) AS r1, 
    sum(x[2]) AS r2
FROM 
(
    SELECT 
    	retention(date = '2018-08-13', date = '2018-08-14') as x
    FROM events 
    WHERE date in('2018-08-13','2018-08-14')
    GROUP BY uid
);

It only cost 6 seconds to get the result.

Related issues #2120

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Aug 20, 2018

What about the following query?

SELECT 
    uniq(uid) AS r1, 
    uniqIf(uid, date = '2018-08-14') AS r2
FROM events
WHERE date IN ('2018-08-13', '2018-08-14')
AND uid IN (SELECT uid FROM events WHERE date = '2018-08-13')

@sundy-li
Copy link
Contributor Author

sundy-li commented Aug 20, 2018

  1. I got that yandex.Metrica use uniq function to get the unique visitors, but it has some disadvantages.
  2. uniq function returns approximate results, yet retention returns the exactly results.
  3. uniq needs global in to process distribute queries.
  4. Last but not least, retention could be up to 10 times faster than uniq on my new datasets.

Here is some detail test results on my new datasets.

SELECT 
 count(),
    uniq(uid), 
    uniqExact(uid)
FROM events
WHERE partition = '2018-08-13';
┌───count()─┬─uniq(uid)─┬─uniqExact(uid)─┐
│ 246055188 │          11633890 │               11661510 │
└───────────┴───────────────────┴────────────────────────┘

1 rows in set. Elapsed: 2.945 sec. Processed 246.06 million rows, 2.46 GB (83.56 million rows/s., 835.59 MB/s.) 



SELECT 
    uniq(uid) AS r1, 
    uniqIf(uid, partition = '2018-08-14') AS r2
FROM events
WHERE (partition IN ('2018-08-13', '2018-08-14')) AND (uid GLOBAL IN 
(
    SELECT uid
    FROM events
    WHERE partition = '2018-08-13'
));

┌──────r1─┬──────r2─┐
│ 8759444 │ 1760035 │
└─────────┴─────────┘

1 rows in set. Elapsed: 21.723 sec. Processed 9.87 billion rows, 86.33 GB (454.41 million rows/s., 3.97 GB/s.) 




SELECT 
    r1, 
    r2
FROM 
(
    SELECT 
        groupUniqArrayIf(uid, partition = '2018-08-13') AS users1, 
        groupUniqArrayIf(uid, partition = '2018-08-14') AS users2, 
        length(users1) AS r1, 
        length(arrayIntersect(users1, users2)) AS r2
    FROM events
 WHERE (partition IN ('2018-08-13', '2018-08-14'))
);
┌──────r1─┬──────r2─┐
│ 8765610 │ 1755812 │
└─────────┴─────────┘

1 rows in set. Elapsed: 7.499 sec. Processed 2.44 billion rows, 24.40 GB (325.38 million rows/s., 3.25 GB/s.) 





SELECT 
    sum(r[1]) as r1, 
    sum(r[2]) as r2
FROM 
(
    SELECT 
        uid, 
        retention(partition = '2018-08-13', partition = '2018-08-14') AS r
    FROM events
    WHERE (partition IN ('2018-08-13', '2018-08-14'))
    GROUP BY uid
);

┌──────r1─┬──────r2─┐
│ 8765610 │ 1755812 │
└─────────┴─────────┘

1 rows in set. Elapsed: 2.692 sec. Processed 2.44 billion rows, 24.40 GB (906.21 million rows/s., 9.06 GB/s.)

@alexey-milovidov alexey-milovidov merged commit bce94da into ClickHouse:master Aug 20, 2018
auto & offsets_to = static_cast<ColumnArray &>(to).getOffsets();

const bool first_flag = this->data(place).events.test(0);
data_to.insert(first_flag ? Field(static_cast<UInt64>(1)) : Field(static_cast<UInt64>(0)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can gain more performance if you cast array elements to ColumnUInt8 and access its data directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nice advice.

@amosbird
Copy link
Collaborator

Hmm, isn't this doable by CRDT tricks? a simple test shows that the plain SQL routine outperforms this UDAF.

dell123 :) SELECT count(x), countIf(y, abs(y) < 1 ) FROM (SELECT sum(1) x, sum( (date = '2018-08-06') * 2 - 1 ) / x  y  FROM retention_test WHERE date IN ('2018-08-06', '2018-08-08') GROUP BY uid);

SELECT
    count(x),
    countIf(y, abs(y) < 1)
FROM
(
    SELECT
        sum(1) AS x,
        sum(((date = '2018-08-06') * 2) - 1) / x AS y
    FROM retention_test
    WHERE date IN ('2018-08-06', '2018-08-08')
    GROUP BY uid
)

┌─count(x)─┬─countIf(y, less(abs(y), 1))─┐
│  8000000 │                     6000000 │
└──────────┴─────────────────────────────┘

1 rows in set. Elapsed: 1.178 sec. Processed 19.00 million rows, 114.00 MB (16.13 million rows/s., 96.77 MB/s.)

dell123 :) SELECT sum(r[1]) as r1, sum(r[2]) as r2 FROM (SELECT uid, retention(date = '2018-08-06', date = '2018-08-08') AS r FROM retention_test WHERE date IN ('2018-08-06', '2018-08-08') GROUP BY uid);

SELECT
    sum(r[1]) AS r1,
    sum(r[2]) AS r2
FROM
(
    SELECT
        uid,
        retention(date = '2018-08-06', date = '2018-08-08') AS r
    FROM retention_test
    WHERE date IN ('2018-08-06', '2018-08-08')
    GROUP BY uid
)

┌──────r1─┬──────r2─┐
│ 8000000 │ 6000000 │
└─────────┴─────────┘

1 rows in set. Elapsed: 1.523 sec. Processed 19.00 million rows, 114.00 MB (12.48 million rows/s., 74.87 MB/s.)

@zhang2014
Copy link
Contributor

Hmm, isn't this doable by CRDT tricks? a simple test shows that the plain SQL routine outperforms this UDAF.

👍 Great idea, But it gets more and more complicated as you add states.

@sundy-li
Copy link
Contributor Author

sundy-li commented Aug 20, 2018

@amosbird That is really smart... 👍

But count(x) is wrong answer, it includes the visitors only in 2018-08-08

@amosbird
Copy link
Collaborator

@sundy-li nice catch. the fix is trivial. Just replace count(x) with countIf(y, y > -1)
@zhang2014 #1646 would be one way to alleviate the complication. Or else this umbrella issue #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants