find optimum hash data structure used for kvs object cache #474

garlick · 2015-11-07T00:53:11Z

The hash container used to map SHA1 hash strings to objects in the KVS was chosen with little research during prototyping and should be reexamined. Its current container expects keys to be strings (so we have to convert 20 byte digests to 41 byte null terminated strings), and uses an internal hash function when the SHA1 has already been computed.

I wrote a simple test that creates 16M unique objects and computes their SHA1's, then inserts them into a hash, then looks them all up again. The time to create + insert (and rss increase as a result), and the time to look up is reported. I tried five different hash containers and here are some results for each:

zhash - exactly as currently implemented in the KVS. 40 byte SHA1 strings are used as keys, which are copied internally by the hash, and its internal hash function.
Insert: 13.34s (2,096,148 Kbytes used)
Lookup: 4.16s

zhashx is in newer czmq and includes methods to override key duplication, destruction, comparison, and hash function. Thus I was able to circumvent key duplication and allow the 20 byte raw SHA1 digests to be used as keys, and override the hash function to leverage the cryptographic hashing we alraedy did by generating the SHA1, and just return the first four bytes of the digest as the integer hash value.
Insert: 13.21 (1,048,588 Kbytes used)
Lookup: 4.26s

lsd hash (@dun's fixed-size hash) allocated with a size of 8M slots. Same strategy as with zhashx: use 20 byte raw SHA1 digests as keys, use first four bytes of digest as integer hash value.
Insert: 8.25s (459,020 Kbytes used)
Lookup: 9.20s

JudyHS array using 20 byte raw SHA1 digests as index, and its internal hash function.
Insert: 8.5s (1,173,020 Kbytes used)
Lookup: 7.37s

hat-trie using 20 byte raw SHA1 digests as keys and internal (murmur) hash function (N.B. this implementation is little endian only):
Insert: 16.49s (650,960 Kbytes used)
Lookup: 7.73s

I realize this is very anecdotal and that other dimensions could be probed. I just wanted to get a rough idea of how these containers performed.

Any suggestions for other containers that might be a good fit here?

sophia for comparison. This is the proposed db for object persistence and so gives an idea of the cost of a "cache miss":
Insert: 12.84s (2,097,168 Kbytes used)
Lookup: 306.4s

sqlite3 with synchronous+journal disabled, single huge transaction for bulk insert, 20 byte raw SHA1 as primary key, object represented as blob:
Insert: 248.74s (2,668 Kbytes used)
Lookup: 135.5s

garlick · 2015-11-09T17:55:00Z

Just updated the results above for zhashx, as I had not configured it to use the 20 byte raw SHA1 as stated.

grondo · 2015-11-09T19:48:33Z

This is really cool!

Do we know what we'd like to optimize, though? To my eyes it seems like our good ol' lsd hash is a good balance of speed and small size, but is fixed size hash right out?

It would be interesting to also compare to the various implementations in Googles sparsehash, but these are C++, if you could stomach it.

Also a quick search turned up this post with some other benchmarks, and a comment there mentions concurrencykit which has another hash table implementation.

Apologies if you already looked at these options.

dongahn · 2015-11-09T20:00:32Z

Cool! It seems no-brainer to go to zhashx from zhash as it reduces memory overhead by 2x without performance impact. lsd hash gives another 2x memory reduction but it comes from 2x increase in lookup performance... @grondo makes good points... I am curious what is and would be typical kvs workloads? At first glance, it seems we are much more insert-performance bound (and lsd hashing seems pretty good for that reason?), but it would be good to see some stats on it.

grondo · 2015-11-09T20:13:43Z

Not sure it is relevant, but as alternative to Judy radix tree I saw some mention of HAT-trie as a space and cache efficient alternative and it was interesting enough to mention here. I don't actually know anything about this data structure, so I'm not even sure it could be used as drop-in replacement, or if there are C implementations out there, but it is benchmarked along with other associative array implementations here:
http://kokizzu.blogspot.com/2015/02/string-associative-array-benchmark.html

garlick · 2015-11-09T20:43:13Z

Thanks for the references!

With the backing store from #471, the KVS will only use the hash as cache of recently used objects, thus a fixed size hash is feasible if we manage the number of elements actively.

garlick · 2015-11-10T18:15:35Z

Just updated results above with a hat-trie test. Maybe not all that compelling though good space efficiency.

trws · 2015-11-16T17:05:30Z

Nice work @garlick! The google {sparse,dense}hash options would be nice to look at if not too complicated, but the other thing that comes to mind is either using JudyS or a nest of 3 JudyLs. I would expect the JudyL nest to come out as the densest, except for perhaps google sparsehash or maybe the hat trie, but I'm not sure how the performance would change. If only JudyL had a settable fixed-depth variant it would be easier, and probably faster, but there is no such beast unfortunately.

One other one that might be worth a glance is GCC's core hash management stuff from libiberty. It's old, its low level, and very, very old-style C, but it's fast compared to most I've seen written in C.

trws · 2015-11-16T17:07:33Z

One side-note on this. I've been thinking about flux data structures for a while, and regardless of what we pick, what do people think about having a flux container interface that we can swap the back-end implementation out on? For this something like fc_ordered_map for something with ordering properties like the judy/hat-trie and fc_map or fc_hash_table for whatever is best for pure hash-table work, so we can easily switch later if we run into some reason we need/want to do that?

trws · 2015-11-19T18:02:26Z

So, I grabbed the benchmark and tested it with a few more containers. These were run on my laptop, so take the memory numbers with a grain of salt until I can get this on a linux box with an allocator I can control to check the sizes. I say this especially because sparsehash is coming out so large, and dense_hash so small... that really can't happen... Anyway, the tests cover LSD, judy (which is by far the winner for ordered or C-API containers on here), zhash, zhashx, c++ stl map, and a cross product of: c++ unordered_map, google::dense_hash_map, and google::sparse_hash_map with hash functions: boost::hash, cityhash64, cityhash32, and taking the first 8 bytes as a 64-bit key like the lsd comparator does. The hashes are by suffix, maps by prefix. Also I did a round giving the C++ hash tables the same initial size as LSD is given, it makes a huge difference in insert performance.

Pre-sized: (as LSD is)

create lsd: 0.00s (+8,192K)create items: 7.22s (+1,363,472,384K)insert items: 2.29s (+319,782,912K)lookup items: 3.40s (+0K)
create judy: 0.00s (+0K)create items: 7.05s (+1,363,472,384K)insert items: 5.28s (+608,460,800K)lookup items: 3.98s (+4,096K)
create zhash: 0.00s (+16,384K)create items: 7.09s (+1,363,447,808K)insert items: 6.97s (+1,289,977,856K)lookup items: 2.29s (+0K)
create zhashx: 0.00s (+16,384K)create items: 7.12s (+1,363,456,000K)insert items: 7.35s (+779,710,464K)lookup items: 2.50s (+0K)
create map: 0.00s (+0K)create items: 7.10s (+1,363,480,576K)insert items: 15.86s (+1,533,972,480K)lookup items: 13.68s (+0K)
create umap: 0.03s (+67,117,056K)create items: 8.20s (+1,363,468,288K)insert items: 5.46s (+1,497,616,384K)lookup items: 3.46s (+0K)
create umapc: 0.03s (+67,125,248K)create items: 7.83s (+1,363,456,000K)insert items: 5.45s (+1,497,608,192K)lookup items: 3.17s (+0K)
create umapc32: 0.03s (+67,121,152K)create items: 7.12s (+1,363,451,904K)insert items: 5.10s (+1,497,616,384K)lookup items: 3.07s (+0K)
create umaplsd: 0.03s (+67,121,152K)create items: 7.12s (+1,363,468,288K)insert items: 4.54s (+1,497,587,712K)lookup items: 3.05s (+0K)
create smap: 0.00s (+5,611,520K)create items: 7.18s (+1,363,476,480K)insert items: 12.90s (+1,872,400,384K)lookup items: 3.81s (+0K)
create smapc: 0.00s (+5,611,520K)create items: 7.16s (+1,363,476,480K)insert items: 12.92s (+1,887,846,400K)lookup items: 3.35s (+0K)
create smapc32: 0.00s (+5,611,520K)create items: 7.14s (+1,363,460,096K)insert items: 12.50s (+1,944,043,520K)lookup items: 3.54s (+0K)
create smaplsd: 0.00s (+5,611,520K)create items: 7.22s (+1,363,472,384K)insert items: 12.36s (+1,954,160,640K)lookup items: 3.34s (+0K)
create dmap: 1.38s (+3,758,104,576K)create items: 7.27s (+1,363,468,288K)insert items: 2.82s (+4,096K)lookup items: 2.58s (+0K)
create dmapc: 1.38s (+3,758,104,576K)create items: 7.34s (+1,363,447,808K)insert items: 2.50s (+12,288K)lookup items: 2.23s (+0K)
create dmapc32: 1.36s (+3,758,104,576K)create items: 7.18s (+1,363,468,288K)insert items: 2.31s (+12,288K)lookup items: 2.64s (+0K)
create dmaplsd: 1.36s (+3,758,104,576K)create items: 7.19s (+1,363,464,192K)insert items: 2.48s (+8,192K)lookup items: 2.12s (+0K)

Empty:

create lsd: 0.00s (+4,096K)create items: 7.25s (+1,363,468,288K)insert items: 2.23s (+319,774,720K)lookup items: 3.31s (+0K)
create judy: 0.00s (+0K)create items: 7.35s (+1,363,460,096K)insert items: 5.61s (+605,552,640K)lookup items: 4.07s (+4,096K)
create zhash: 0.00s (+16,384K)create items: 7.19s (+1,363,451,904K)insert items: 7.07s (+1,289,973,760K)lookup items: 2.45s (+0K)
create zhashx: 0.00s (+20,480K)create items: 7.33s (+1,363,451,904K)insert items: 7.64s (+779,710,464K)lookup items: 2.58s (+0K)
create map: 0.00s (+0K)create items: 7.27s (+1,363,476,480K)insert items: 15.25s (+1,533,968,384K)lookup items: 13.83s (+0K)
create umap: 0.00s (+0K)create items: 7.71s (+1,363,468,288K)insert items: 7.79s (+1,574,125,568K)lookup items: 3.94s (+0K)
create umapc: 0.00s (+0K)create items: 7.37s (+1,363,480,576K)insert items: 7.29s (+1,574,125,568K)lookup items: 3.68s (+0K)
create umapc32: 0.00s (+0K)create items: 7.38s (+1,363,472,384K)insert items: 7.59s (+1,574,137,856K)lookup items: 3.87s (+0K)
create umaplsd: 0.00s (+0K)create items: 7.29s (+1,363,476,480K)insert items: 6.99s (+1,574,129,664K)lookup items: 3.70s (+0K)
create smap: 0.00s (+12,288K)create items: 7.32s (+1,363,460,096K)insert items: 20.51s (+1,587,429,376K)lookup items: 3.76s (+0K)
create smapc: 0.00s (+12,288K)create items: 7.84s (+1,363,472,384K)insert items: 20.59s (+1,531,830,272K)lookup items: 3.56s (+0K)
create smapc32: 0.00s (+12,288K)create items: 8.97s (+1,363,472,384K)insert items: 21.17s (+1,523,449,856K)lookup items: 3.72s (+0K)
create smaplsd: 0.00s (+8,192K)create items: 7.57s (+1,363,464,192K)insert items: 20.84s (+1,521,520,640K)lookup items: 3.37s (+0K)
create dmap: 0.00s (+16,384K)create items: 7.45s (+1,363,460,096K)insert items: 10.47s (+4,807,405,568K)lookup items: 3.11s (+0K)
create dmapc: 0.00s (+16,384K)create items: 7.42s (+1,363,443,712K)insert items: 7.22s (+5,872,029,696K)lookup items: 2.27s (+0K)
create dmapc32: 0.00s (+8,192K)create items: 7.36s (+1,363,468,288K)insert items: 7.41s (+5,872,037,888K)lookup items: 2.38s (+0K)
create dmaplsd: 0.00s (+12,288K)create items: 7.37s (+1,363,447,808K)insert items: 6.55s (+5,872,029,696K)lookup items: 2.11s (+0K)

Overall, the dense_hash_map with the google CityHash64 and pre-allocation is the fastest and most capable, judy is the densest, and the fastest with a C API that doesn't require a fixed size. I'm suspicious of the memory numbers though, so may have to try this again... Anyway, if anyone wants this c++ version, let me know and I'll pop it up in a branch somewhere.

trws · 2015-11-19T18:03:27Z

Oh, by the way, very nice on the sqlite test @garlick, you hit everything. To my surprise even eliding the rowid from the table made it worse, so that's a best-case baseline.

trws · 2015-11-23T22:46:03Z

One more set. These numbers are all from hype, and the memory usage is from jemalloc and represents the exact number of total allocated bytes, and the delta from the previous measurement after in the perens. Added in judys, which is a judysl container in addition to judyhs, and have c++ map, unordered map, sparsehash, and densehash variants all with boost hash, city hash, city hash 32, and lsd-like hash options. The ones prefixed with "i" are informed that they are likely to receive at least 1024_1024_8 items at construction, which makes a huge difference for the dynamically expanding hashes.

create zhash: 0.00s +1,360,224B (+23,680B)create items: 7.29s +1,343,387,536B (+1,342,027,312B)insert items: 5.78s +2,484,233,040B (+1,140,845,504B)lookup items: 2.31s +2,484,233,040B (+0B)
create zhashx: 0.00s +1,339,744B (+3,200B)create items: 7.28s +1,343,385,616B (+1,342,045,872B)insert items: 5.99s +1,980,919,184B (+637,533,568B)lookup items: 1.27s +1,980,919,184B (+0B)
create judy: 0.00s +1,339,744B (+3,200B)create items: 7.31s +1,343,385,488B (+1,342,045,744B)insert items: 3.75s +1,925,402,256B (+582,016,768B)lookup items: 2.70s +1,925,402,256B (+0B)
create judys: 0.00s +1,339,744B (+3,200B)create items: 7.37s +1,343,385,488B (+1,342,045,744B)insert items: 5.31s +2,082,851,424B (+739,465,936B)lookup items: 4.30s +2,082,851,424B (+0B)
create lsd: 0.00s +68,453,408B (+67,116,864B)create items: 7.28s +1,410,494,528B (+1,342,041,120B)insert items: 1.91s +1,662,152,768B (+251,658,240B)lookup items: 1.82s +1,662,152,768B (+0B)
create map: 0.00s +1,344,544B (+8,000B)create items: 7.28s +1,343,385,536B (+1,342,040,992B)insert items: 11.39s +3,021,108,352B (+1,677,722,816B)lookup items: 10.78s +3,021,108,352B (+0B)
create umap: 0.00s +1,343,840B (+7,296B)create items: 7.48s +1,343,385,552B (+1,342,041,712B)insert items: 6.13s +2,853,332,880B (+1,509,947,328B)lookup items: 2.65s +2,853,332,880B (+0B)
create umapc: 0.00s +1,343,840B (+7,296B)create items: 7.27s +1,343,385,552B (+1,342,041,712B)insert items: 5.81s +2,853,332,880B (+1,509,947,328B)lookup items: 2.30s +2,853,332,880B (+0B)
create umapc32: 0.00s +1,343,840B (+7,296B)create items: 7.37s +1,343,385,552B (+1,342,041,712B)insert items: 5.83s +2,853,332,880B (+1,509,947,328B)lookup items: 2.40s +2,853,332,880B (+0B)
create umaplsd: 0.00s +1,343,840B (+7,296B)create items: 7.36s +1,343,385,552B (+1,342,041,712B)insert items: 5.72s +2,853,332,880B (+1,509,947,328B)lookup items: 2.16s +2,853,332,880B (+0B)
create dmap: 0.00s +1,375,584B (+39,040B)create items: 7.28s +1,343,389,296B (+1,342,013,712B)insert items: 6.63s +5,101,729,904B (+3,758,340,608B)lookup items: 1.78s +5,101,729,904B (+0B)
create dmapc: 0.00s +1,375,584B (+39,040B)create items: 7.35s +1,343,389,296B (+1,342,013,712B)insert items: 5.91s +5,101,729,904B (+3,758,340,608B)lookup items: 1.48s +5,101,729,904B (+0B)
create dmapc32: 0.00s +1,375,584B (+39,040B)create items: 7.33s +1,343,389,296B (+1,342,013,712B)insert items: 6.16s +5,101,729,904B (+3,758,340,608B)lookup items: 1.61s +5,101,729,904B (+0B)
create dmaplsd: 0.00s +1,375,584B (+39,040B)create items: 7.39s +1,343,389,296B (+1,342,013,712B)insert items: 5.84s +5,101,729,904B (+3,758,340,608B)lookup items: 1.40s +5,101,729,904B (+0B)
create smap: 0.00s +1,339,744B (+3,200B)create items: 7.33s +1,343,385,616B (+1,342,045,872B)insert items: 13.41s +2,624,828,992B (+1,281,443,376B)lookup items: 3.84s +2,624,828,992B (+0B)
create smapc: 0.00s +1,339,744B (+3,200B)create items: 7.39s +1,343,385,616B (+1,342,045,872B)insert items: 12.89s +2,624,595,136B (+1,281,209,520B)lookup items: 3.56s +2,624,595,136B (+0B)
create smapc32: 0.00s +1,339,744B (+3,200B)create items: 7.30s +1,343,385,616B (+1,342,045,872B)insert items: 12.74s +2,624,760,512B (+1,281,374,896B)lookup items: 3.65s +2,624,760,512B (+0B)
create smaplsd: 0.00s +1,339,744B (+3,200B)create items: 7.40s +1,343,385,616B (+1,342,045,872B)insert items: 12.53s +2,624,330,048B (+1,280,944,432B)lookup items: 3.32s +2,624,330,048B (+0B)
create iumap: 0.02s +85,229,920B (+83,893,376B)create items: 7.28s +1,427,271,760B (+1,342,041,840B)insert items: 4.38s +2,853,330,960B (+1,426,059,200B)lookup items: 2.53s +2,853,330,960B (+0B)
create iumapc: 0.02s +85,229,920B (+83,893,376B)create items: 7.42s +1,427,271,760B (+1,342,041,840B)insert items: 4.14s +2,853,330,960B (+1,426,059,200B)lookup items: 2.23s +2,853,330,960B (+0B)
create iumapc32: 0.02s +85,229,920B (+83,893,376B)create items: 7.28s +1,427,271,760B (+1,342,041,840B)insert items: 4.15s +2,853,330,960B (+1,426,059,200B)lookup items: 2.30s +2,853,330,960B (+0B)
create iumaplsd: 0.02s +85,229,920B (+83,893,376B)create items: 7.40s +1,427,271,760B (+1,342,041,840B)insert items: 4.05s +2,853,330,960B (+1,426,059,200B)lookup items: 2.09s +2,853,330,960B (+0B)
create idmap: 0.90s +3,759,436,128B (+3,758,099,584B)create items: 7.30s +5,101,482,224B (+1,342,046,096B)insert items: 2.07s +5,101,482,224B (+0B)lookup items: 1.79s +5,101,482,224B (+0B)
create idmapc: 0.90s +3,759,436,128B (+3,758,099,584B)create items: 7.29s +5,101,482,224B (+1,342,046,096B)insert items: 1.86s +5,101,482,224B (+0B)lookup items: 1.49s +5,101,482,224B (+0B)
create idmapc32: 0.90s +3,759,436,128B (+3,758,099,584B)create items: 7.28s +5,101,482,224B (+1,342,046,096B)insert items: 1.94s +5,101,482,224B (+0B)lookup items: 1.60s +5,101,482,224B (+0B)
create idmaplsd: 0.90s +3,759,436,128B (+3,758,099,584B)create items: 7.28s +5,101,482,224B (+1,342,046,096B)insert items: 1.79s +5,101,482,224B (+0B)lookup items: 1.40s +5,101,482,224B (+0B)
create ismap: 0.00s +7,631,200B (+6,294,656B)create items: 7.28s +1,349,677,184B (+1,342,045,984B)insert items: 7.72s +2,624,831,552B (+1,275,154,368B)lookup items: 3.50s +2,624,831,552B (+0B)
create ismapc: 0.00s +7,631,200B (+6,294,656B)create items: 7.31s +1,349,677,184B (+1,342,045,984B)insert items: 7.46s +2,624,655,296B (+1,274,978,112B)lookup items: 3.20s +2,624,655,296B (+0B)
create ismapc32: 0.00s +7,631,200B (+6,294,656B)create items: 7.28s +1,349,677,184B (+1,342,045,984B)insert items: 7.30s +2,624,620,480B (+1,274,943,296B)lookup items: 3.30s +2,624,620,480B (+0B)
create ismaplsd: 0.00s +7,631,200B (+6,294,656B)create items: 7.28s +1,349,677,184B (+1,342,045,984B)insert items: 7.15s +2,624,281,664B (+1,274,604,480B)lookup items: 3.00s +2,624,281,664B (+0B)

As previously, the judy array is best for space and pretty darn good for speed. LSD is great for both size and speed, but if we up the number of elements that is likely to degrade as it stacks up more. The densehash is best for performance overall when initialized, but uses almost 2x more space than most of the others. Interestingly enough, the technique of just lopping off the first few characters as a hash works quite well, and the string-centric cityhash variants are pretty close behind.

garlick · 2015-11-24T01:15:47Z

Thanks for doing this @trws. Could you push the test somewhere so I can have a look?

trws · 2015-11-24T16:43:11Z

Absolutely, give me a few minutes to tie it into the build system and push the deps to spack so they're easy to grab and I'll pop it up in a branch. It's 99% the same benchmark, just with a templated c++ hash-table test and jemalloc hooks for memory usage checking. If you want, it actually can do much more detailed profiling than this, it's the same setup I used to get the KVS memory trace before.

trws · 2015-11-24T17:31:55Z

Ok, the updated version is up on a branch in my fork here. The autoconf/automake setup is probably not entirely complete, I was just building it directly with g++ and setting the paths because I was pulling in a number of external packages, some of which do not have pkg-config in their installs.

The extra dependencies for this version are:

for the c++14 template features g++ >= 4.9 or clang >= 3.5 with a working libc++ or up-to-date libstdc++ (don't use the dotkit clang, it's using ancient headers and will not work)
boost headers, for boost::hash, system boost is too old but dotkit boost seems to be fine, or the current spack version
cityhash for the string hashing routines, put this in the spack develop branch
sparsehash for the google dense and sparse hash options, also in spack under sparsehash
jemalloc with stats enabled, in spack with the +stats variant

Oh, and the actual test was run with this:

for HT in zhash zhashx judy judys lsd map {,i}{u,d,s}map{,c,c32,lsd} ; do
    LD_LIBRARY_PATH=~/programs/lib ./hashtest $HT | tee -a hash-table-times.out
done

This is a performance test for various hash containers, used to investigate possible alternatives to zhash_t in the content cache and KVS object cache. Some results are captured in issue flux-framework#474.

garlick mentioned this issue Nov 16, 2015

kvs integration with content store #471

Merged

garlick mentioned this issue Dec 11, 2015

Content service #493

Merged

trws mentioned this issue Sep 12, 2016

4096 limit (per handle) on kvs_watch()? #802

Closed

garlick closed this as completed Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find optimum hash data structure used for kvs object cache #474

find optimum hash data structure used for kvs object cache #474

garlick commented Nov 7, 2015

garlick commented Nov 9, 2015

grondo commented Nov 9, 2015

dongahn commented Nov 9, 2015

grondo commented Nov 9, 2015

garlick commented Nov 9, 2015

garlick commented Nov 10, 2015

trws commented Nov 16, 2015

trws commented Nov 16, 2015

trws commented Nov 19, 2015

trws commented Nov 19, 2015

trws commented Nov 23, 2015

garlick commented Nov 24, 2015

trws commented Nov 24, 2015

trws commented Nov 24, 2015

find optimum hash data structure used for kvs object cache #474

find optimum hash data structure used for kvs object cache #474

Comments

garlick commented Nov 7, 2015

garlick commented Nov 9, 2015

grondo commented Nov 9, 2015

dongahn commented Nov 9, 2015

grondo commented Nov 9, 2015

garlick commented Nov 9, 2015

garlick commented Nov 10, 2015

trws commented Nov 16, 2015

trws commented Nov 16, 2015

trws commented Nov 19, 2015

trws commented Nov 19, 2015

trws commented Nov 23, 2015

garlick commented Nov 24, 2015

trws commented Nov 24, 2015

trws commented Nov 24, 2015