Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add aggregate functions IntersectionsMax and IntersectionsMaxPos #2012

Merged
merged 2 commits into from
Mar 14, 2018

Conversation

furmur
Copy link
Contributor

@furmur furmur commented Mar 8, 2018

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

new aggregate functions:

IntersectionsMax(start_column ,end_column)
IntersectionsMaxPos(start_column ,end_column)

returns maximum count of the intersected intervals defined by start_column and end_column values

if start_column value is NULL or 0 then interval will be skipped
if end_column value is NULL or 0 then interval is considered to have no end (interval (3,0) in example below)

IntersectionsMaxPos in addition returns position where the maximum of intersected intervals found

typical application for this functions (and why they were implemented) is to count maximum active calls for the certain time frame by stored calls detailed records with timestamps for the calls start/end

example:

:) CREATE DATABASE test;
:) CREATE TABLE test(start Integer, end Integer) engine = Memory;
:) INSERT INTO test(start,end) VALUES (1,3),(2,7),(3,0),(4,7),(5,8);

intervals in the table after the insert:

1 2 3 4 5 6 7 8 9
------------------>
1---3
  2---------7
    3-------------
      4-----7
        5-----8
------------------>
1 2 3 3 4 4 4 2 1  //intersections count for each point

functions output:

:) SELECT IntersectionsMax(start,end) FROM test;

SELECT IntersectionsMax(start, end)
FROM test

┌─IntersectionsMax(start, end)─┐
│                            4 │
└──────────────────────────────┘

1 rows in set. Elapsed: 0.003 sec.

:) SELECT IntersectionsMaxPos(start,end) FROM test;

SELECT IntersectionsMaxPos(start, end)
FROM test 

┌─IntersectionsMaxPos(start, end)─┐
│                               4 │
│                               5 │
└─────────────────────────────────┘

2 rows in set. Elapsed: 0.003 sec. 

alexey-milovidov added a commit that referenced this pull request Mar 14, 2018
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Mar 14, 2018

Few points:

IntersectionsMax

Currently we name our functions in camelCase, not CamelCase.
Also I think that maxIntersections sounds better than intersectionsMax.

if start_column value is NULL

NULL values are skipped before being passed to aggregate functions.

if start_column value is NULL or 0 then interval will be skipped
if end_column value is NULL or 0 then interval is considered to have no end (interval (3,0) in example below)

It looks very specific to reserve zero value as something special...
Also I don't like assymetry, but I understand the business logic.
Maybe we can totally remove the support for half-open intervals?

Motivation:

  • it's not possible to have calls/sessions with infinite duration;
  • you can pass some very big value as end of interval instead;

example:
:) CREATE DATABASE test;
:) CREATE TABLE test(start Integer, end Integer) engine = Memory;

BTW, this example is incorrect, because you don't use database test in second query.
CREATE TABLE test.test will be correct.

SELECT IntersectionsMaxPos(start, end)
FROM test
┌─IntersectionsMaxPos(start, end)─┐
│ 4 │
│ 5 │
└─────────────────────────────────┘

Don't understand how the aggregate function can return two rows.

@alexey-milovidov
Copy link
Member

Missing documentation in code (.h file).
Missing test.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Mar 14, 2018

    String getName() const override
    {
        return "IntersectionsMax";
    }

this is wrong according that you have two different functions.

@alexey-milovidov
Copy link
Member

The function IntersectionsMaxPos should be named maxIntersectionsPosition according the fact that we have position function, not pos.

@alexey-milovidov
Copy link
Member

The user could expect that the function will work for floats and signed numbers.

@alexey-milovidov
Copy link
Member

    void insertResultInto(ConstAggregateDataPtr place, IColumn & to) const override
    {
        auto & ret = static_cast<ColumnUInt64 &>(to).getData();
        ret.push_back(data(place).max());
        if (return_position)
            ret.push_back(data(place).max_pos());
    }

That's absolutely incorrect: you cannot insert two values in a column as a result.

@alexey-milovidov
Copy link
Member

Missing performance test.
Missing algorithm description inside the code.

alexey-milovidov added a commit that referenced this pull request Mar 14, 2018
@alexey-milovidov alexey-milovidov self-assigned this Mar 14, 2018
@alexey-milovidov alexey-milovidov merged commit cbab87f into ClickHouse:master Mar 14, 2018
@alexey-milovidov
Copy link
Member

I've slightly changed the algorithm. Please take a look.

@furmur
Copy link
Contributor Author

furmur commented Mar 15, 2018

thank you for such a lot of fixes.

i see that algorithm was completely changed.
i used rb-tree to keep array sorted and to really aggregate data instead of adding points to the plain array.
your algorithm sorts accumulated array right in the insertResultInto and simply counts points. it's more straightforward but is also correct.

i'm not sure which approach is better, so i would wait till build in the master will be fixed and compare them.
if my approach would show better results then i'll try to explain it for others, otherwise let's leave your implementation.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Mar 15, 2018

My algorithm will eat more memory if there are many intervals that start/end at the same second.
It's expected, for example, if we accumulating data for one day into single aggregation state, and we have more than about 100 000 intervals per day. That's quite natural.

We can simply improve it later by adding "compress" function that will sort an array and accumulate duplicate values (or do batch insertion into already sorted part of array); and we can call this function when array becomes large (and also before serialization of the state to save network traffic).

Also my algorithm will be worse if you have small amount of duplicated intervals. For example, millions of rows with just a few unique intervals.

Using plain array should be better for the following reasons:

  • if you sort array at once, it is more efficient than using tree (incremental sorting);
  • the memory for std::map is allocated separately (one allocation per node, about 24 bytes) and these allocations are not accounted by MemoryTracker (memory limits for query won't work);
  • BTW, we can switch to radix sort (see RadixSort) for big enough arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants