-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
find optimum hash data structure used for kvs object cache #474
Comments
Just updated the results above for zhashx, as I had not configured it to use the 20 byte raw SHA1 as stated. |
This is really cool! Do we know what we'd like to optimize, though? To my eyes it seems like our good ol' lsd hash is a good balance of speed and small size, but is fixed size hash right out? It would be interesting to also compare to the various implementations in Googles sparsehash, but these are C++, if you could stomach it. Also a quick search turned up this post with some other benchmarks, and a comment there mentions concurrencykit which has another hash table implementation. Apologies if you already looked at these options. |
Cool! It seems no-brainer to go to zhashx from zhash as it reduces memory overhead by 2x without performance impact. lsd hash gives another 2x memory reduction but it comes from 2x increase in lookup performance... @grondo makes good points... I am curious what is and would be typical kvs workloads? At first glance, it seems we are much more insert-performance bound (and lsd hashing seems pretty good for that reason?), but it would be good to see some stats on it. |
Not sure it is relevant, but as alternative to Judy radix tree I saw some mention of HAT-trie as a space and cache efficient alternative and it was interesting enough to mention here. I don't actually know anything about this data structure, so I'm not even sure it could be used as drop-in replacement, or if there are C implementations out there, but it is benchmarked along with other associative array implementations here: |
Thanks for the references! With the backing store from #471, the KVS will only use the hash as cache of recently used objects, thus a fixed size hash is feasible if we manage the number of elements actively. |
Just updated results above with a hat-trie test. Maybe not all that compelling though good space efficiency. |
Nice work @garlick! The google {sparse,dense}hash options would be nice to look at if not too complicated, but the other thing that comes to mind is either using JudyS or a nest of 3 JudyLs. I would expect the JudyL nest to come out as the densest, except for perhaps google sparsehash or maybe the hat trie, but I'm not sure how the performance would change. If only JudyL had a settable fixed-depth variant it would be easier, and probably faster, but there is no such beast unfortunately. One other one that might be worth a glance is GCC's core hash management stuff from libiberty. It's old, its low level, and very, very old-style C, but it's fast compared to most I've seen written in C. |
One side-note on this. I've been thinking about flux data structures for a while, and regardless of what we pick, what do people think about having a flux container interface that we can swap the back-end implementation out on? For this something like |
So, I grabbed the benchmark and tested it with a few more containers. These were run on my laptop, so take the memory numbers with a grain of salt until I can get this on a linux box with an allocator I can control to check the sizes. I say this especially because sparsehash is coming out so large, and dense_hash so small... that really can't happen... Anyway, the tests cover LSD, judy (which is by far the winner for ordered or C-API containers on here), zhash, zhashx, c++ stl map, and a cross product of: c++ unordered_map, google::dense_hash_map, and google::sparse_hash_map with hash functions: boost::hash, cityhash64, cityhash32, and taking the first 8 bytes as a 64-bit key like the lsd comparator does. The hashes are by suffix, maps by prefix. Also I did a round giving the C++ hash tables the same initial size as LSD is given, it makes a huge difference in insert performance. Pre-sized: (as LSD is)
Empty:
Overall, the dense_hash_map with the google CityHash64 and pre-allocation is the fastest and most capable, judy is the densest, and the fastest with a C API that doesn't require a fixed size. I'm suspicious of the memory numbers though, so may have to try this again... Anyway, if anyone wants this c++ version, let me know and I'll pop it up in a branch somewhere. |
Oh, by the way, very nice on the sqlite test @garlick, you hit everything. To my surprise even eliding the rowid from the table made it worse, so that's a best-case baseline. |
One more set. These numbers are all from hype, and the memory usage is from jemalloc and represents the exact number of total allocated bytes, and the delta from the previous measurement after in the perens. Added in judys, which is a judysl container in addition to judyhs, and have c++ map, unordered map, sparsehash, and densehash variants all with boost hash, city hash, city hash 32, and lsd-like hash options. The ones prefixed with "i" are informed that they are likely to receive at least 1024_1024_8 items at construction, which makes a huge difference for the dynamically expanding hashes.
As previously, the judy array is best for space and pretty darn good for speed. LSD is great for both size and speed, but if we up the number of elements that is likely to degrade as it stacks up more. The densehash is best for performance overall when initialized, but uses almost 2x more space than most of the others. Interestingly enough, the technique of just lopping off the first few characters as a hash works quite well, and the string-centric cityhash variants are pretty close behind. |
Thanks for doing this @trws. Could you push the test somewhere so I can have a look? |
Absolutely, give me a few minutes to tie it into the build system and push the deps to spack so they're easy to grab and I'll pop it up in a branch. It's 99% the same benchmark, just with a templated c++ hash-table test and jemalloc hooks for memory usage checking. If you want, it actually can do much more detailed profiling than this, it's the same setup I used to get the KVS memory trace before. |
Ok, the updated version is up on a branch in my fork here. The autoconf/automake setup is probably not entirely complete, I was just building it directly with g++ and setting the paths because I was pulling in a number of external packages, some of which do not have pkg-config in their installs. The extra dependencies for this version are:
Oh, and the actual test was run with this: for HT in zhash zhashx judy judys lsd map {,i}{u,d,s}map{,c,c32,lsd} ; do
LD_LIBRARY_PATH=~/programs/lib ./hashtest $HT | tee -a hash-table-times.out
done |
This is a performance test for various hash containers, used to investigate possible alternatives to zhash_t in the content cache and KVS object cache. Some results are captured in issue flux-framework#474.
This is a performance test for various hash containers, used to investigate possible alternatives to zhash_t in the content cache and KVS object cache. Some results are captured in issue flux-framework#474.
This is a performance test for various hash containers, used to investigate possible alternatives to zhash_t in the content cache and KVS object cache. Some results are captured in issue flux-framework#474.
This is a performance test for various hash containers, used to investigate possible alternatives to zhash_t in the content cache and KVS object cache. Some results are captured in issue flux-framework#474.
This is a performance test for various hash containers, used to investigate possible alternatives to zhash_t in the content cache and KVS object cache. Some results are captured in issue flux-framework#474.
The hash container used to map SHA1 hash strings to objects in the KVS was chosen with little research during prototyping and should be reexamined. Its current container expects keys to be strings (so we have to convert 20 byte digests to 41 byte null terminated strings), and uses an internal hash function when the SHA1 has already been computed.
I wrote a simple test that creates 16M unique objects and computes their SHA1's, then inserts them into a hash, then looks them all up again. The time to create + insert (and rss increase as a result), and the time to look up is reported. I tried five different hash containers and here are some results for each:
zhash - exactly as currently implemented in the KVS. 40 byte SHA1 strings are used as keys, which are copied internally by the hash, and its internal hash function.
Insert: 13.34s (2,096,148 Kbytes used)
Lookup: 4.16s
zhashx is in newer czmq and includes methods to override key duplication, destruction, comparison, and hash function. Thus I was able to circumvent key duplication and allow the 20 byte raw SHA1 digests to be used as keys, and override the hash function to leverage the cryptographic hashing we alraedy did by generating the SHA1, and just return the first four bytes of the digest as the integer hash value.
Insert: 13.21 (1,048,588 Kbytes used)
Lookup: 4.26s
lsd hash (@dun's fixed-size hash) allocated with a size of 8M slots. Same strategy as with zhashx: use 20 byte raw SHA1 digests as keys, use first four bytes of digest as integer hash value.
Insert: 8.25s (459,020 Kbytes used)
Lookup: 9.20s
JudyHS array using 20 byte raw SHA1 digests as index, and its internal hash function.
Insert: 8.5s (1,173,020 Kbytes used)
Lookup: 7.37s
hat-trie using 20 byte raw SHA1 digests as keys and internal (murmur) hash function (N.B. this implementation is little endian only):
Insert: 16.49s (650,960 Kbytes used)
Lookup: 7.73s
I realize this is very anecdotal and that other dimensions could be probed. I just wanted to get a rough idea of how these containers performed.
Any suggestions for other containers that might be a good fit here?
sophia for comparison. This is the proposed db for object persistence and so gives an idea of the cost of a "cache miss":
Insert: 12.84s (2,097,168 Kbytes used)
Lookup: 306.4s
sqlite3 with synchronous+journal disabled, single huge transaction for bulk insert, 20 byte raw SHA1 as primary key, object represented as blob:
Insert: 248.74s (2,668 Kbytes used)
Lookup: 135.5s
The text was updated successfully, but these errors were encountered: