-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance tests for hash functions #3918
Conversation
This table could have been published in docs, but at the moment it lacks the description at least about used hardware. |
I was checking it on quite old workstation. May be it would be better to check it on better cpu, and also with some extra metrics. Also results are quite confusing, cause i was expecting to get Gb/sec even for long strings. See Maybe I have some issue with compilation flags (i was experimenting a bit). |
Testing with set of strings with constant length is non-representative, because it makes branch predictor too happy. Better to test on real dataset with different distribution of length of strings. Numbers in MB/sec looks like MB of source data per second, where source data is something like system.numbers table. It means, that this is not MB/sec, but MHashes/sec divided by 8.
For long (kilobyte) strings, you can easily get tens of gigabytes per second on single CPU core. You can find a test of hash functions for hash tables with string keys inside our repository. |
Yep. Looks like that is the reason of those weird results.
Yes, it gives quite rough estimation of the speed. But to make it possible to create reproducible performance tests some 'predefined' dataset should be published. |
I can make benchmarks or can rewrite that tests to use that tables. Should I? BTW: also some synthetic data can be used, that can be useful to create data with needed properties. CREATE TABLE ascii_random_data1
ENGINE=MergeTree
PARTITION BY tuple()
ORDER BY tuple() AS
WITH
arrayStringConcat(
arrayMap( x -> reinterpretAsString( toUInt8( rand(x) % 96 + 0x20 ) ), range( 1024 ) )
) as str1024,
substring(str1024, 1, 512 + rand() % 512 ) as str
select str from numbers(5000000); It looks like P.S. There is also https://github.com/rurban/smhasher |
This is worth doing. Tests with This dataset will soon became our official dataset for benchmarks. BTW, we also have isolated test for hash function performance in hash tables: https://github.com/yandex/ClickHouse/blob/master/dbms/src/Interpreters/tests/hash_map_string_3.cpp Also we have presentation about hash functions in ClickHouse: |
This is quite interesting, because I was not sure, that we have enabled CPU dispatching in FarmHash in our build. BTW, FarmHash is intended to be not stable across different CPUs (the user should not save the result anywhere or use hash as a key - that's why we don't document it - to avoid misusage). There is a variant of FarmHash named "FingerprintHash" that is portable. PS. If you want to go further, you can also consider adding HighwayHash https://github.com/google/highwayhash as an alternative to SipHash. |
I would also consider |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
For changelog. Remove if this is non-significant change.
Category (leave one):
Short description (up to few sentences):
Performance tests for hash functions ( #3905 ).
Detailed description (optional):
Summary in a table form (tested on i5-2400 CPU @ 3.10GHz workstation with 8Gb Ram)
javaHash & hiveHash have very poor quality (lot of collisions even in small dataset).
Legend:
number is Mbytes per sec (higher is better)
empty1
- empty string on in one thread (hash init / finalize costs)empty4
- empty string on in 4 threads (hash init / finalize costs)10char1
- 10 char string (numeric) on in one thread10empty4
- 10 char string (numeric) on in 4 threads1024char1
- 1024 char string ("Lorem ipsum...") on in one thread1024char4
- 1024 char string ("Lorem ipsum...") on in 4 threads