Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tests for hash functions #3918

Merged
merged 2 commits into from
Dec 25, 2018

Conversation

filimonov
Copy link
Contributor

@filimonov filimonov commented Dec 24, 2018

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • Build/Testing/Packaging Improvement

Short description (up to few sentences):

Performance tests for hash functions ( #3905 ).

Detailed description (optional):

Summary in a table form (tested on i5-2400 CPU @ 3.10GHz workstation with 8Gb Ram)

javaHash & hiveHash have very poor quality (lot of collisions even in small dataset).

hash year of publish empty1 empty4 10char1 10char4 1024char1 1024char4
cityHash64 2011 801.9 2,219.4 328.7 968.8 16.1 37.1
farmHash64 2014 814.3 1,928.3 318.1 923.1 19.1 39.0
metroHash64 2015 658.6 1,883.4 293.7 939.0 18.3 38.9
murmurHash2_32 2008 920.2 2,438.1 297.5 877.4 15.5 33.9
murmurHash2_64 2008 798.5 1,707.0 293.9 955.5 15.9 38.2
murmurHash3_32 2012 739.0 2,033.2 252.9 784.6 13.7 32.8
murmurHash3_64 2012 448.2 1,612.5 230.6 716.5 16.3 37.2
murmurHash3_128 2012 537.3 1,316.9 237.0 747.5 16.1 37.4
javaHash ? 1,326.7 2,174.9 274.2 844.9 5.2 20.6
hiveHash ? 1,305.0 2,790.2 276.2 860.1 5.2 20.8
xxHash32 2012 807.2 1,800.8 273.9 865.5 16.2 33.8
xxHash64 2012 711.5 1,998.0 263.8 859.8 18.7 39.9

Legend:

number is Mbytes per sec (higher is better)
empty1 - empty string on in one thread (hash init / finalize costs)
empty4 - empty string on in 4 threads (hash init / finalize costs)
10char1 - 10 char string (numeric) on in one thread
10empty4 - 10 char string (numeric) on in 4 threads
1024char1- 1024 char string ("Lorem ipsum...") on in one thread
1024char4 - 1024 char string ("Lorem ipsum...") on in 4 threads

@blinkov
Copy link
Contributor

blinkov commented Dec 24, 2018

This table could have been published in docs, but at the moment it lacks the description at least about used hardware.

@filimonov filimonov closed this Dec 24, 2018
@filimonov filimonov reopened this Dec 24, 2018
@filimonov
Copy link
Contributor Author

filimonov commented Dec 24, 2018

This table could have been published in docs, but at the moment it lacks the description at least about used hardware.

I was checking it on quite old workstation. May be it would be better to check it on better cpu, and also with some extra metrics.

Also results are quite confusing, cause i was expecting to get Gb/sec even for long strings. See
https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/
https://github.com/Cyan4973/xxHash#benchmarks

Maybe I have some issue with compilation flags (i was experimenting a bit).

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 24, 2018

Testing with set of strings with constant length is non-representative, because it makes branch predictor too happy. Better to test on real dataset with different distribution of length of strings.

Numbers in MB/sec looks like MB of source data per second, where source data is something like system.numbers table. It means, that this is not MB/sec, but MHashes/sec divided by 8.

expecting to get Gb/sec even for long strings

For long (kilobyte) strings, you can easily get tens of gigabytes per second on single CPU core.
Example: https://github.com/Bulat-Ziganshin/FARSH
But testing on long strings is totally irrelevant for ClickHouse.

You can find a test of hash functions for hash tables with string keys inside our repository.
(that was evaluated on multiple representative datasets; and the winner is ClickHouse own hash function, that is intentionally not exported as SQL function to avoid the possibility of hash interference)

@alexey-milovidov alexey-milovidov merged commit 3687980 into ClickHouse:master Dec 25, 2018
@filimonov
Copy link
Contributor Author

filimonov commented Dec 31, 2018

Numbers in MB/sec looks like MB of source data per second, where source data is something like system.numbers table. It means, that this is not MB/sec, but MHashes/sec divided by 8.

Yep. Looks like that is the reason of those weird results.

Testing with set of strings with constant length is non-representative, because it makes branch predictor too happy. Better to test on real dataset with different distribution of length of strings.

Yes, it gives quite rough estimation of the speed. But to make it possible to create reproducible performance tests some 'predefined' dataset should be published.

@filimonov
Copy link
Contributor Author

filimonov commented Jan 2, 2019

https://gist.githubusercontent.com/alexey-milovidov/811ce0a62cc142227e4910e525c06116/raw/dfa39ccdf9f475ccc8abd6395de959f978cbb7b2/datasets_example.txt

I can make benchmarks or can rewrite that tests to use that tables. Should I?

BTW: also some synthetic data can be used, that can be useful to create data with needed properties.
I was playing with something like that:

CREATE TABLE ascii_random_data1
 ENGINE=MergeTree
 PARTITION BY tuple()
 ORDER BY tuple() AS
 WITH
   arrayStringConcat(
       arrayMap( x -> reinterpretAsString( toUInt8( rand(x) % 96 + 0x20 ) ), range( 1024 ) ) 
   ) as str1024,
   substring(str1024, 1, 512 + rand() % 512 ) as str
  select str from numbers(5000000);

It looks like farmHash64 gives the best (or almost the best) performance in most of the checked cases.
In singlethread mode xxHash64 is as good as farmHash64 for long strings (or a bit better), but about 40% slower for short strings (<8 chars).
In multithread mode xxHash64 is 5-20% behind the winner (usually it's also farmHash64).

P.S. There is also https://github.com/rurban/smhasher

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jan 2, 2019

I can make benchmarks or can rewrite that tests to use that tables. Should I?

This is worth doing. Tests with
MobilePhoneModel, PageCharset, Params, URLDomain, UTMSource, Referer, URL, Title
columns should be representable.

This dataset will soon became our official dataset for benchmarks.

BTW, we also have isolated test for hash function performance in hash tables: https://github.com/yandex/ClickHouse/blob/master/dbms/src/Interpreters/tests/hash_map_string_3.cpp
(This is not the same as just performance, because it depends on some combination of performance and quality.)

Also we have presentation about hash functions in ClickHouse:
https://www.youtube.com/watch?v=EoX82TEz2sQ
(that also has a topic about quality evaluation)

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jan 2, 2019

It looks like farmHash64 gives the best (or almost the best) performance in most of the checked cases.

This is quite interesting, because I was not sure, that we have enabled CPU dispatching in FarmHash in our build. BTW, FarmHash is intended to be not stable across different CPUs (the user should not save the result anywhere or use hash as a key - that's why we don't document it - to avoid misusage). There is a variant of FarmHash named "FingerprintHash" that is portable.

PS. If you want to go further, you can also consider adding HighwayHash https://github.com/google/highwayhash as an alternative to SipHash.

@halayli
Copy link

halayli commented Feb 8, 2019

I would also consider t1ha hash family. t1ha_aes performs particularly well on aes supported cpus.

https://github.com/leo-yuriev/t1ha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants