Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiVolnitsky added with tests and some benchmark, many multiFunctions are added to support multistring search #4053

Merged
merged 9 commits into from
Jan 16, 2019

Conversation

danlark1
Copy link
Contributor

MultiVolnitsky added with tests and some benchmark, many multiFunctions are added to support multistring search

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog.

Category:

  • New Feature

Short description (up to few sentences):

Added multi searcher to search from multiple constant strings from big haystack. Added functions {multiPosition,multiSearch,firstMatch}{'', UTF8, CaseInsensitive, CaseInsensitiveUTF8}

Detailed description:

Here are the benchmarks. Though finding few number of strings is a bit worse, improvements of others are really great.

{
    "min_time": 0.601000,
    "query": "select count(position(URL, 'yandex')), count(position(URL, 'google')) FROM hits_100m_single"
},
{
    "min_time": 0.664000,
    "query": "select count(multiPosition(URL, ['yandex', 'google'])) FROM hits_100m_single"
},
{
    "min_time": 1.389000,
    "query": "select count(match(URL, 'yandex|google')) FROM hits_100m_single"
},
{
    "min_time": 0.713000,
    "query": "select sum(match(URL, 'yandex')), sum(match(URL, 'google')), sum(match(URL, 'yahoo')), sum(match(URL, 'pikabu')) FROM hits_100m_single"
},
{
    "min_time": 0.580000,
    "query": "select sum(multiSearch(URL, ['yandex', 'google', 'yahoo', 'pikabu'])) from hits_100m_single"
},
{
    "min_time": 1.557000,
    "query": "select sum(match(URL, 'yandex|google|yahoo|pikabu')) FROM hits_100m_single"
},
{
    "min_time": 0.595000,
    "query": "select sum(match(URL, 'yandex')), sum(match(URL, 'google')), sum(match(URL, 'http')) FROM hits_100m_single"
},
{
    "min_time": 0.508000,
    "query": "select sum(multiSearch(URL, ['yandex', 'google', 'http'])) from hits_100m_single"
},
{
    "min_time": 0.592000,
    "query": "select sum(match(URL, 'yandex|google|http')) FROM hits_100m_single"
},
{
    "min_time": 0.765000,
    "query": "select sum(match(URL, 'yandex')), sum(match(URL, 'google')), sum(match(URL, 'facebook')), sum(match(URL, 'wikipedia')), sum(match(URL, 'reddit')) FROM hits_100m_single"
},
{
    "min_time": 0.567000,
    "query": "select sum(multiSearch(URL, ['yandex', 'google', 'facebook', 'wikipedia', 'reddit'])) from hits_100m_single"
},
{
    "min_time": 1.634000,
    "query": "select sum(match(URL, 'yandex|google|facebook|wikipedia|reddit')) FROM hits_100m_single"
},
{
    "min_time": 0.627000,
    "query": "select sum(firstMatch(URL, ['yandex', 'google', 'http', 'facebook', 'google'])) from hits_100m_single"
}

This was done as a first part of my diploma at Faculty of Computer Science

@danlark1
Copy link
Contributor Author

danlark1 commented Jan 15, 2019

Some details i should mention.

MultiVolnitsky saves into the hashtable the index of a needle and a position of it. Then the simple version is used.

3 functions were added
multiPosition(haystack, [needle_1, ..., needle_n]) -- finding all the positions of all needles, if string was not found, 0 is returned. Returns an array of positions.
multiSearch(haystack, [needle_1, ..., needle_n]) -- returning 1 if some needle matches the haystack, 0 otherwise.
firstMatch(haystack, [needle_1, ..., needle_n]) -- returning the first index (starts with 1) that matches the haystack

Finding multiPosition() becomes faster according to just many position() if there are at least 3-4 strings. This happens because there are some additional checks in multi version which cannot be thrown away (check if we have enough bytes to compare or if we are not out of range when subtracting position from hashtable). And there are two additional variables to save the answer.

multiSearch is much faster even if there are 1-2 strings to search -- there are less variables and code is faster.

firstMatch is a bit slower than multiSearch but faster than multiPosition as all we need is to modify multiSearch algorithm to take the minimum index. cmovle instructions are pretty cool and minimize branch conditions.

All the functions become extremely fast when the minimun size of all the strings is big (4 is the least number of string size to use MultiVolnitsky, say, if minimum size is 10, we will get much faster search). This is because we use the step of minimum size to find all the needles. And as a consequence, we use less cpu.

@danlark1 danlark1 changed the title MultiVolnitsky added with tests and some benchmark, many multiFunctio… MultiVolnitsky added with tests and some benchmark, many multiFunctions are added to support multistring search Jan 15, 2019
@danlark1
Copy link
Contributor Author

After some optimizations, benchmarks are like this:

{
    "min_time": 0.589000,
    "query": "select count(position(URL, 'yandex')), count(position(URL, 'google')) FROM hits_100m_single"
},
{
    "min_time": 0.599000,
    "query": "select count(multiPosition(URL, ['yandex', 'google'])) FROM hits_100m_single"
},
{
    "min_time": 1.370000,
    "query": "select count(match(URL, 'yandex|google')) FROM hits_100m_single"
},
{
    "min_time": 0.669000,
    "query": "select sum(match(URL, 'yandex')), sum(match(URL, 'google')), sum(match(URL, 'yahoo')), sum(match(URL, 'pikabu')) FROM hits_100m_single"
},
{
    "min_time": 0.550000,
    "query": "select sum(multiSearch(URL, ['yandex', 'google', 'yahoo', 'pikabu'])) from hits_100m_single"
},
{
    "min_time": 1.544000,
    "query": "select sum(match(URL, 'yandex|google|yahoo|pikabu')) FROM hits_100m_single"
},
{
    "min_time": 0.579000,
    "query": "select sum(match(URL, 'yandex')), sum(match(URL, 'google')), sum(match(URL, 'http')) FROM hits_100m_single"
},
{
    "min_time": 0.495000,
    "query": "select sum(multiSearch(URL, ['yandex', 'google', 'http'])) from hits_100m_single"
},
{
    "min_time": 0.573000,
    "query": "select sum(match(URL, 'yandex|google|http')) FROM hits_100m_single"
},
{
    "min_time": 0.721000,
    "query": "select sum(match(URL, 'yandex')), sum(match(URL, 'google')), sum(match(URL, 'facebook')), sum(match(URL, 'wikipedia')), sum(match(URL, 'reddit')) FROM hits_100m_single"
},
{
    "min_time": 0.550000,
    "query": "select sum(multiSearch(URL, ['yandex', 'google', 'facebook', 'wikipedia', 'reddit'])) from hits_100m_single"
},
{
    "min_time": 1.672000,
    "query": "select sum(match(URL, 'yandex|google|facebook|wikipedia|reddit')) FROM hits_100m_single"
},
{
    "min_time": 0.588000,
    "query": "select sum(firstMatch(URL, ['yandex', 'google', 'http', 'facebook', 'google'])) from hits_100m_single"
}

@alexey-milovidov alexey-milovidov merged commit 2cbd126 into ClickHouse:master Jan 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants