Optimized HashMap. RawTable exposes a safe interface. #15720

pczarn · 2014-07-16T15:08:34Z

First, a benchmark of the original hashmap implementation on Intel i3.

find_existing    ... 54450 ns/iter (+/- 5524)
find_nonexisting ... 52472 ns/iter (+/- 1672)
find_pop_insert  ...   262 ns/iter (+/- 3)
hashmap_as_queue ...   148 ns/iter (+/- 0)
insert           ...   183 ns/iter (+/- 6)
new_drop         ...   200 ns/iter (+/- 52)
new_insert_drop  ...   317 ns/iter (+/- 104)

`bucket_distance` and `pop_internal`

A branchless implementation of probe count calculation. A branch just avoided unsigned underflow. Since the capacity is a power of 2, we can ignore underflow and simply return the difference modulo capacity. We can use the fact that 'an index argument >= capacity' is acceptable later on.

pop_internal returns V instead of Option<V>.

No more low hanging fruits...

iteration over buckets

Let's use external iterators for their greatest advantage in Rust: to avoid bounds checks and indexing. Thus {Empty,Full}Index becomes {Empty,Full}Bucket.

The growing is done an optimized reinsertion algorithm based on insert_hashed_ordered. HashMap::new() returns a table with the capacity set to 0. Hashmaps that never had an element inserted won't allocate, as measured by bench::new_drop.

Removed two pointers from RawTable as they can be recalculated once per iteration rather than every indexing operation. RawTable struct is as small as Vec with 24 bytes.

Reduced code duplication between swap and insert_hashed_nocheck. Created a new method insert_or_replace_with that accepts a simple closure for this. Moreover, robin_hood will most likely get inlined.

find_existing    ... 45556 ns/iter (+/- 1510)
find_nonexisting ... 43812 ns/iter (+/- 1212)
find_pop_insert  ...   224 ns/iter (+/- 9)
hashmap_as_queue ...   134 ns/iter (+/- 0)
insert           ...   183 ns/iter (+/- 14)
new_drop         ...   118 ns/iter (+/- 2)
new_insert_drop  ...   241 ns/iter (+/- 29)

safe interface

This is possible thanks to a lot of prior work by @nikomatsakis! Relevant commit: nikomatsakis@2fcb95b

To eliminate the double-take issue, we must tie a bucket pointer to a reference to the table within a private structure. Only a bucket that holds a unique reference can be updated. GapThenFull uses a similar strategy to encapsulate two consecutive buckets and a single reference. Some of HashMap's methods became functions parameterized over mutability.

We could use an uninit reference or drop trickery to replace one zeroing of a bucket's hash per shift of GapThenFull with a single memory access. The former language feature is not planned and the latter seems inefficient.

find_existing    ... 48131 ns/iter (+/- 3469)
find_nonexisting ... 48518 ns/iter (+/- 1785)
find_pop_insert  ...   238 ns/iter (+/- 5)
hashmap_as_queue ...   131 ns/iter (+/- 2)
insert           ...   217 ns/iter (+/- 6)
new_drop         ...   120 ns/iter (+/- 11)
new_insert_drop  ...   238 ns/iter (+/- 25)

split hashmap.rs into separate files

Total number of lines approaches 3000.

final refactoring

Standalone robin_hood.

I realized that insert_or_update_with can call insert_or_replace_with directly.

Inlining now started happening in microbenchmarks for some reason. Looks like search_hashed_generic is quite small:

find_existing    ... 51137 ns/iter (+/- 8663)
find_nonexisting ... 17286 ns/iter (+/- 2961)
find_pop_insert  ...   198 ns/iter (+/- 4)
hashmap_as_queue ...   134 ns/iter (+/- 1)
insert           ...   220 ns/iter (+/- 5)
new_drop         ...   121 ns/iter (+/- 11)
new_insert_drop  ...   234 ns/iter (+/- 6)

conclusion

Further optimizations are increasingly difficult, usually they would make the code unreadable or would require new language features.

It's possible to build cache aware HashMap on top of a vector of SoAs, such as Vec<([u64, ..N], [K, ..N], [V, ..N])>. Possibility of in-place growing is the only performance advantage of this approach. Unfortunately, an increase in iteration complexity surpasses benefits of storage reuse.

Documentation is not finished. Most methods need more comments and tests.

Robin hood hashing scheme was introduced in #12081.

cc @cgaebel

cgaebel · 2014-07-16T17:38:25Z

src/libstd/collections/hashmap/table.rs

+        }
+    }
+
+    fn as_mut_ptrs(&self) -> RawBucket<K, V> {


This function looks pretty similar to calculate_offsets. Can you try to share as much code as possible? It was pretty subtle for me to get right in the first place, and bugs in this type of code can be pretty catastrophic.

I'll try to reuse the code or write a comprehensive test, at least.

calculate_offsets returns more than we need here. I found it difficult to control its inlining. This function is called millions of times from rustc. It's an important part of every search.

This method has mut in the name, but takes &self: would taking &mut self be more appropriate? (Is that even possible?)

pczarn · 2014-07-17T13:26:17Z

Had to write a branchless bucket.next() to keep the search code both efficient and safe. ~~It speeds up LLVM passes: ...~~

huonw · 2014-07-17T13:28:56Z

Are you sure that that is from this patch? stage0 can be quite old, and thus missing other optimisations that have happened in the meantime. (The "correct" comparison would be two stage1s or (preferably) stage2s, one with the patch, one without.)

pczarn · 2014-07-17T14:23:46Z

That's right, let's compare two stage1s.

stage1 before the patch
time: 0.216 s   parsing
time: 0.010 s   gated feature checking
time: 0.000 s   crate injection
time: 0.034 s   configuration 1
time: 0.001 s   plugin loading
time: 0.000 s   plugin registration
time: 0.487 s   expansion
time: 0.038 s   configuration 2
time: 0.036 s   maybe building test harness
time: 0.000 s   prelude injection
time: 0.044 s   assigning node ids and indexing ast
time: 0.003 s   checking that all macro invocations are gone
time: 0.008 s   external crate/lib resolution
time: 0.007 s   language item collection
time: 0.095 s   resolution
time: 0.004 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.004 s   looking for plugin registrar
time: 0.006 s   freevar finding
time: 0.011 s   region resolution
time: 0.004 s   loop checking
time: 0.008 s   stability index
time: 0.024 s   type collecting
time: 0.007 s   variance inference
time: 0.065 s   coherence checking
time: 1.416 s   type checking
time: 0.005 s   check static items
time: 0.013 s   const marking
time: 0.004 s   const checking
time: 0.041 s   privacy checking
time: 0.007 s   intrinsic checking
time: 0.006 s   effect checking
time: 0.039 s   match checking
time: 0.016 s   liveness checking
time: 0.139 s   borrow checking
time: 0.031 s   kind checking
time: 0.006 s   reachability checking
time: 0.020 s   death checking
time: 0.106 s   lint checking
time: 0.000 s   resolving dependency formats
time: 2.257 s   translation
  time: 0.356 s llvm function passes
  time: 6.334 s llvm module passes
  time: 3.113 s codegen passes
time: 9.950 s   LLVM passes
  time: 0.096 s running linker
time: 0.870 s   linking

real    0m16.262s
user    0m15.487s
sys 0m0.343s

stage1 with the patch
time: 0.226 s   parsing
time: 0.010 s   gated feature checking
time: 0.000 s   crate injection
time: 0.036 s   configuration 1
time: 0.001 s   plugin loading
time: 0.000 s   plugin registration
time: 0.496 s   expansion
time: 0.040 s   configuration 2
time: 0.038 s   maybe building test harness
time: 0.000 s   prelude injection
time: 0.044 s   assigning node ids and indexing ast
time: 0.003 s   checking that all macro invocations are gone
time: 0.009 s   external crate/lib resolution
time: 0.007 s   language item collection
time: 0.091 s   resolution
time: 0.004 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.004 s   looking for plugin registrar
time: 0.007 s   freevar finding
time: 0.010 s   region resolution
time: 0.004 s   loop checking
time: 0.007 s   stability index
time: 0.023 s   type collecting
time: 0.007 s   variance inference
time: 0.066 s   coherence checking
time: 1.392 s   type checking
time: 0.005 s   check static items
time: 0.015 s   const marking
time: 0.004 s   const checking
time: 0.042 s   privacy checking
time: 0.007 s   intrinsic checking
time: 0.006 s   effect checking
time: 0.041 s   match checking
time: 0.016 s   liveness checking
time: 0.135 s   borrow checking
time: 0.033 s   kind checking
time: 0.007 s   reachability checking
time: 0.020 s   death checking
time: 0.101 s   lint checking
time: 0.000 s   resolving dependency formats
time: 2.226 s   translation
  time: 0.352 s llvm function passes
  time: 6.361 s llvm module passes
  time: 3.167 s codegen passes
time: 10.025 s  LLVM passes
  time: 0.101 s running linker
time: 0.749 s   linking

real    0m16.169s
user    0m15.493s
sys 0m0.347s

huonw · 2014-07-18T05:36:02Z

src/libstd/collections/hashmap/table.rs

+        //                                   1111111b
+        // Then AND with the capacity:     & 1000000b
+        //                               ------------
+        //                                   1000000b


"... and it's zero at all other times."

pczarn · 2014-07-19T22:39:50Z

This is nearly ready for a merge. I'm going check a few details and document them.

~~Note to myself: check that the map never reaches the capacity~~

pczarn · 2014-07-24T23:06:58Z

~~The most recent build has failed on a test for a recent LLVM bug, #15793. Oddly, only comments have changed between builds.~~

pczarn · 2014-07-26T19:26:21Z

The build passes once again. Everything is resolved so far and ready for a merge.

I hope I didn't accidentally remove anything during rebase.

arthurprs · 2014-07-30T23:02:33Z

I was running some JSON tests today and got these results:

libserialize::json + TreeMap (default) :  21805 ns/iter (+/- 949)
libserialize::json + HashMap (old) : 25397 ns/iter (+/- 815)
libserialize::json + new HashMap (new) : 23245 ns/iter (+/- 795)

So it's an improvement but it still doesn't match the JSON decoder using TreeMap.
I tried a few micro benchmarks inserting 9 small strings into both Maps and the new HashMap is slight faster ~5%. So the ~7% slower JSON decoder sounds funky to me.

I don't have the expertise to check the reason, but someone more experienced may find it worth a look.

Code


extern crate time;
extern crate test;


extern crate serialize;
use serialize::json;


use std::rand::Rng;
use std::rand;
use test::Bencher;
use std::collections::{TreeMap, HashMap};

pub struct TestStruct1  {
    data_int1: int,
    data_int2: int,
    data_int3: int,
    data_str1: String,
    data_str2: String,
    data_str3: String,
    data_map: Option<Vec<Box<TestStruct1>>>,
    data_vector: Vec<u8>,
    data_vector_s: Vec<String>,
}

#[bench]
fn bench_decode(b: &mut Bencher) {
    let mut object = TestStruct1
         {data_int1: -999, data_int2: 999, data_int3: 9999, data_str1:"toto".to_string(), data_str2:"toto".to_string(), data_str3:"toto".to_string(),
         data_vector:vec![2,3,4,5], data_vector_s:vec!["hi".to_string(), "mom".to_string()], data_map:None};

    object = TestStruct1
         {data_int1: -999, data_int2: 999, data_int3: 9999,  data_str1:"toto".to_string(), data_str2:"toto".to_string(), data_str3:"toto".to_string(),
         data_vector:vec![2,3,4,5], data_vector_s:vec!["hi".to_string(), "mom".to_string()], data_map:Some(vec![box object])};

    object = TestStruct1
         {data_int1: -999, data_int2: 999, data_int3: 9999, data_str1:"toto".to_string(), data_str2:"toto".to_string(), data_str3:"toto".to_string(),
         data_vector:vec![2,3,4,5], data_vector_s:vec!["hi".to_string(), "mom".to_string()], data_map:Some(vec![box object])};

     // Serialize using `json::encode`
    let encoded = json::encode(&object);

    b.iter( || {
        let _ = json::from_str(encoded.as_slice());
    });
}


#[bench]
fn bench_hashmap_insert(b: &mut Bencher) {

    let v0: Vec<int> = rand::task_rng().gen_iter::<int>().take(9).collect();

    let mut v: Vec<String> = Vec::new();

    for i in v0.iter() {
        v.push("test_".to_string() + i.to_string());
    }

    b.iter(|| {
        let mut m = HashMap::new();

        for i_s in v.iter() {
            m.insert(i_s.clone(), i_s);
        }
    });
}



#[bench]
fn bench_treemap_insert(b: &mut Bencher) {


    let v0: Vec<int> = rand::task_rng().gen_iter::<int>().take(9).collect();

    let mut v: Vec<String> = Vec::new();

    for i in v0.iter() {
        v.push("test_".to_string() +  i.to_string());
    }

    b.iter(|| {
        let mut m = TreeMap::new();

        for i_s in v.iter() {
            m.insert(i_s.clone(), i_s);
        }
    });
}

JSON used:

{
    "data_int1": -999,
    "data_int2": 999,
    "data_int3": 9999,
    "data_str1": "toto",
    "data_str2": "toto",
    "data_str3": "toto",
    "data_map": [
        {
            "data_int1": -999,
            "data_int2": 999,
            "data_int3": 9999,
            "data_str1": "toto",
            "data_str2": "toto",
            "data_str3": "toto",
            "data_map": [
                {
                    "data_int1": -999,
                    "data_int2": 999,
                    "data_int3": 9999,
                    "data_str1": "toto",
                    "data_str2": "toto",
                    "data_str3": "toto",
                    "data_map": null,
                    "data_vector": [
                        2,
                        3,
                        4,
                        5
                    ],
                    "data_vector_s": [
                        "hi",
                        "mom"
                    ]
                }
            ],
            "data_vector": [
                2,
                3,
                4,
                5
            ],
            "data_vector_s": [
                "hi",
                "mom"
            ]
        }
    ],
    "data_vector": [
        2,
        3,
        4,
        5
    ],
    "data_vector_s": [
        "hi",
        "mom"
    ]
}

pczarn · 2014-08-01T12:35:57Z

@arthurprs Benchmarking insertion is not reliable since it depends on the allocator. Does serialization create an empty map for every structure? The difference would be best measured with a profiler.

arthurprs · 2014-08-01T14:28:11Z

I tried to, but since I have little experience with rust internals and C-ish profilers I couldn't identify the reason.

The object building code is bellow, it's pretty simple. The only changes I made was replacing TreeMap with HashMap.


pub enum Json {
    Number(f64),
    String(String),
    Boolean(bool),
    List(List),
    Object(Object),
    Null,
}

pub type Object = HashMap<String, Json>;


    fn build_object(&mut self) -> Result<Json, BuilderError> {
        self.bump();

        let mut values = HashMap::new();

        loop {
            match self.token {
                Some(ObjectEnd) => { return Ok(Object(values)); }
                Some(Error(e)) => { return Err(e); }
                None => { break; }
                _ => {}
            }
            let key = match self.parser.stack().top() {
                Some(Key(k)) => { k.to_string() }
                _ => { fail!("invalid state"); }
            };
            match self.build_value() {
                Ok(value) => { values.insert(key, value); }
                Err(e) => { return Err(e); }
            }
            self.bump();
        }
        return self.parser.error(EOFWhileParsingObject);
    }

* branchless `bucket.next()` * updated documentation after interface changes

alexcrichton · 2014-08-19T05:12:49Z

Closing due to inactivity, but feel free to reopen with a rebase!

nikomatsakis · 2014-08-20T13:57:43Z

Argh! @pczarn I completely forgot about this pull request -- if you do rebase, please ping me and I will try to review promptly.

@nikomatsakis

This is #15720, rebased and reopened. cc @nikomatsakis

cgaebel reviewed Jul 16, 2014
View reviewed changes

pczarn referenced this pull request in nikomatsakis/rust Jul 17, 2014

Bring in new hashmap code

2fcb95b

huonw reviewed Jul 18, 2014
View reviewed changes

pczarn added 5 commits August 9, 2014 01:27

std: branchless bucket distance for hashmap

6d2c8e3

std: RawTable exposes a safe interface for HashMap

85f8059

std: Split hashmap.rs into modules

67c98a8

std: Refine and document HashMap's code

86a8610

* branchless `bucket.next()` * updated documentation after interface changes

std: Fix overflow of HashMap's capacity

f5bc6d2

alexcrichton closed this Aug 19, 2014

pczarn mentioned this pull request Aug 20, 2014

Optimized HashMap. RawTable exposes a safe interface. (reopen) #16628

Merged

bors added a commit that referenced this pull request Sep 5, 2014

auto merge of #16628 : pczarn/rust/hashmap-opt, r=nikomatsakis

82c0527

This is #15720, rebased and reopened. cc @nikomatsakis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized HashMap. RawTable exposes a safe interface. #15720

Optimized HashMap. RawTable exposes a safe interface. #15720

pczarn commented Jul 16, 2014

cgaebel Jul 16, 2014

pczarn Jul 16, 2014

huonw Jul 18, 2014

pczarn commented Jul 17, 2014

huonw commented Jul 17, 2014

pczarn commented Jul 17, 2014

huonw Jul 18, 2014

pczarn Jul 18, 2014

pczarn commented Jul 19, 2014

pczarn commented Jul 24, 2014

pczarn commented Jul 26, 2014

arthurprs commented Jul 30, 2014

pczarn commented Aug 1, 2014

arthurprs commented Aug 1, 2014

alexcrichton commented Aug 19, 2014

nikomatsakis commented Aug 20, 2014

Optimized HashMap. RawTable exposes a safe interface. #15720

Optimized HashMap. RawTable exposes a safe interface. #15720

Conversation

pczarn commented Jul 16, 2014

bucket_distance and pop_internal

iteration over buckets

safe interface

split hashmap.rs into separate files

final refactoring

conclusion

cgaebel Jul 16, 2014

Choose a reason for hiding this comment

pczarn Jul 16, 2014

Choose a reason for hiding this comment

huonw Jul 18, 2014

Choose a reason for hiding this comment

pczarn commented Jul 17, 2014

huonw commented Jul 17, 2014

pczarn commented Jul 17, 2014

huonw Jul 18, 2014

Choose a reason for hiding this comment

pczarn Jul 18, 2014

Choose a reason for hiding this comment

pczarn commented Jul 19, 2014

pczarn commented Jul 24, 2014

pczarn commented Jul 26, 2014

arthurprs commented Jul 30, 2014

pczarn commented Aug 1, 2014

arthurprs commented Aug 1, 2014

alexcrichton commented Aug 19, 2014

nikomatsakis commented Aug 20, 2014

`bucket_distance` and `pop_internal`