Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized HashMap. RawTable exposes a safe interface. #15720

Closed
wants to merge 5 commits into from

Conversation

pczarn
Copy link
Contributor

@pczarn pczarn commented Jul 16, 2014

First, a benchmark of the original hashmap implementation on Intel i3.

find_existing    ... 54450 ns/iter (+/- 5524)
find_nonexisting ... 52472 ns/iter (+/- 1672)
find_pop_insert  ...   262 ns/iter (+/- 3)
hashmap_as_queue ...   148 ns/iter (+/- 0)
insert           ...   183 ns/iter (+/- 6)
new_drop         ...   200 ns/iter (+/- 52)
new_insert_drop  ...   317 ns/iter (+/- 104)

bucket_distance and pop_internal

A branchless implementation of probe count calculation. A branch just avoided unsigned underflow. Since the capacity is a power of 2, we can ignore underflow and simply return the difference modulo capacity. We can use the fact that 'an index argument >= capacity' is acceptable later on.

pop_internal returns V instead of Option<V>.

No more low hanging fruits...

iteration over buckets

Let's use external iterators for their greatest advantage in Rust: to avoid bounds checks and indexing. Thus {Empty,Full}Index becomes {Empty,Full}Bucket.

The growing is done an optimized reinsertion algorithm based on insert_hashed_ordered. HashMap::new() returns a table with the capacity set to 0. Hashmaps that never had an element inserted won't allocate, as measured by bench::new_drop.

Removed two pointers from RawTable as they can be recalculated once per iteration rather than every indexing operation. RawTable struct is as small as Vec with 24 bytes.

Reduced code duplication between swap and insert_hashed_nocheck. Created a new method insert_or_replace_with that accepts a simple closure for this. Moreover, robin_hood will most likely get inlined.

find_existing    ... 45556 ns/iter (+/- 1510)
find_nonexisting ... 43812 ns/iter (+/- 1212)
find_pop_insert  ...   224 ns/iter (+/- 9)
hashmap_as_queue ...   134 ns/iter (+/- 0)
insert           ...   183 ns/iter (+/- 14)
new_drop         ...   118 ns/iter (+/- 2)
new_insert_drop  ...   241 ns/iter (+/- 29)

safe interface

This is possible thanks to a lot of prior work by @nikomatsakis! Relevant commit: nikomatsakis@2fcb95b

To eliminate the double-take issue, we must tie a bucket pointer to a reference to the table within a private structure. Only a bucket that holds a unique reference can be updated. GapThenFull uses a similar strategy to encapsulate two consecutive buckets and a single reference. Some of HashMap's methods became functions parameterized over mutability.

We could use an uninit reference or drop trickery to replace one zeroing of a bucket's hash per shift of GapThenFull with a single memory access. The former language feature is not planned and the latter seems inefficient.

find_existing    ... 48131 ns/iter (+/- 3469)
find_nonexisting ... 48518 ns/iter (+/- 1785)
find_pop_insert  ...   238 ns/iter (+/- 5)
hashmap_as_queue ...   131 ns/iter (+/- 2)
insert           ...   217 ns/iter (+/- 6)
new_drop         ...   120 ns/iter (+/- 11)
new_insert_drop  ...   238 ns/iter (+/- 25)

split hashmap.rs into separate files

Total number of lines approaches 3000.

final refactoring

Standalone robin_hood.

I realized that insert_or_update_with can call insert_or_replace_with directly.

Inlining now started happening in microbenchmarks for some reason. Looks like search_hashed_generic is quite small:

find_existing    ... 51137 ns/iter (+/- 8663)
find_nonexisting ... 17286 ns/iter (+/- 2961)
find_pop_insert  ...   198 ns/iter (+/- 4)
hashmap_as_queue ...   134 ns/iter (+/- 1)
insert           ...   220 ns/iter (+/- 5)
new_drop         ...   121 ns/iter (+/- 11)
new_insert_drop  ...   234 ns/iter (+/- 6)

conclusion

Further optimizations are increasingly difficult, usually they would make the code unreadable or would require new language features.

It's possible to build cache aware HashMap on top of a vector of SoAs, such as Vec<([u64, ..N], [K, ..N], [V, ..N])>. Possibility of in-place growing is the only performance advantage of this approach. Unfortunately, an increase in iteration complexity surpasses benefits of storage reuse.

Documentation is not finished. Most methods need more comments and tests.

Robin hood hashing scheme was introduced in #12081.

cc @cgaebel

}
}

fn as_mut_ptrs(&self) -> RawBucket<K, V> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks pretty similar to calculate_offsets. Can you try to share as much code as possible? It was pretty subtle for me to get right in the first place, and bugs in this type of code can be pretty catastrophic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to reuse the code or write a comprehensive test, at least.

calculate_offsets returns more than we need here. I found it difficult to control its inlining. This function is called millions of times from rustc. It's an important part of every search.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has mut in the name, but takes &self: would taking &mut self be more appropriate? (Is that even possible?)

@pczarn
Copy link
Contributor Author

pczarn commented Jul 17, 2014

Had to write a branchless bucket.next() to keep the search code both efficient and safe. It speeds up LLVM passes: ...

@huonw
Copy link
Member

huonw commented Jul 17, 2014

Are you sure that that is from this patch? stage0 can be quite old, and thus missing other optimisations that have happened in the meantime. (The "correct" comparison would be two stage1s or (preferably) stage2s, one with the patch, one without.)

@pczarn
Copy link
Contributor Author

pczarn commented Jul 17, 2014

That's right, let's compare two stage1s.

stage1 before the patch
time: 0.216 s   parsing
time: 0.010 s   gated feature checking
time: 0.000 s   crate injection
time: 0.034 s   configuration 1
time: 0.001 s   plugin loading
time: 0.000 s   plugin registration
time: 0.487 s   expansion
time: 0.038 s   configuration 2
time: 0.036 s   maybe building test harness
time: 0.000 s   prelude injection
time: 0.044 s   assigning node ids and indexing ast
time: 0.003 s   checking that all macro invocations are gone
time: 0.008 s   external crate/lib resolution
time: 0.007 s   language item collection
time: 0.095 s   resolution
time: 0.004 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.004 s   looking for plugin registrar
time: 0.006 s   freevar finding
time: 0.011 s   region resolution
time: 0.004 s   loop checking
time: 0.008 s   stability index
time: 0.024 s   type collecting
time: 0.007 s   variance inference
time: 0.065 s   coherence checking
time: 1.416 s   type checking
time: 0.005 s   check static items
time: 0.013 s   const marking
time: 0.004 s   const checking
time: 0.041 s   privacy checking
time: 0.007 s   intrinsic checking
time: 0.006 s   effect checking
time: 0.039 s   match checking
time: 0.016 s   liveness checking
time: 0.139 s   borrow checking
time: 0.031 s   kind checking
time: 0.006 s   reachability checking
time: 0.020 s   death checking
time: 0.106 s   lint checking
time: 0.000 s   resolving dependency formats
time: 2.257 s   translation
  time: 0.356 s llvm function passes
  time: 6.334 s llvm module passes
  time: 3.113 s codegen passes
time: 9.950 s   LLVM passes
  time: 0.096 s running linker
time: 0.870 s   linking

real    0m16.262s
user    0m15.487s
sys 0m0.343s

stage1 with the patch
time: 0.226 s   parsing
time: 0.010 s   gated feature checking
time: 0.000 s   crate injection
time: 0.036 s   configuration 1
time: 0.001 s   plugin loading
time: 0.000 s   plugin registration
time: 0.496 s   expansion
time: 0.040 s   configuration 2
time: 0.038 s   maybe building test harness
time: 0.000 s   prelude injection
time: 0.044 s   assigning node ids and indexing ast
time: 0.003 s   checking that all macro invocations are gone
time: 0.009 s   external crate/lib resolution
time: 0.007 s   language item collection
time: 0.091 s   resolution
time: 0.004 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.004 s   looking for plugin registrar
time: 0.007 s   freevar finding
time: 0.010 s   region resolution
time: 0.004 s   loop checking
time: 0.007 s   stability index
time: 0.023 s   type collecting
time: 0.007 s   variance inference
time: 0.066 s   coherence checking
time: 1.392 s   type checking
time: 0.005 s   check static items
time: 0.015 s   const marking
time: 0.004 s   const checking
time: 0.042 s   privacy checking
time: 0.007 s   intrinsic checking
time: 0.006 s   effect checking
time: 0.041 s   match checking
time: 0.016 s   liveness checking
time: 0.135 s   borrow checking
time: 0.033 s   kind checking
time: 0.007 s   reachability checking
time: 0.020 s   death checking
time: 0.101 s   lint checking
time: 0.000 s   resolving dependency formats
time: 2.226 s   translation
  time: 0.352 s llvm function passes
  time: 6.361 s llvm module passes
  time: 3.167 s codegen passes
time: 10.025 s  LLVM passes
  time: 0.101 s running linker
time: 0.749 s   linking

real    0m16.169s
user    0m15.493s
sys 0m0.347s

pczarn referenced this pull request in nikomatsakis/rust Jul 17, 2014
// 1111111b
// Then AND with the capacity: & 1000000b
// ------------
// 1000000b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... and it's zero at all other times."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pczarn
Copy link
Contributor Author

pczarn commented Jul 19, 2014

This is nearly ready for a merge. I'm going check a few details and document them.

Note to myself: check that the map never reaches the capacity

@pczarn
Copy link
Contributor Author

pczarn commented Jul 24, 2014

The most recent build has failed on a test for a recent LLVM bug, #15793. Oddly, only comments have changed between builds.

@pczarn
Copy link
Contributor Author

pczarn commented Jul 26, 2014

The build passes once again. Everything is resolved so far and ready for a merge.

I hope I didn't accidentally remove anything during rebase.

@arthurprs
Copy link
Contributor

I was running some JSON tests today and got these results:

libserialize::json + TreeMap (default) :  21805 ns/iter (+/- 949)
libserialize::json + HashMap (old) : 25397 ns/iter (+/- 815)
libserialize::json + new HashMap (new) : 23245 ns/iter (+/- 795)

So it's an improvement but it still doesn't match the JSON decoder using TreeMap.
I tried a few micro benchmarks inserting 9 small strings into both Maps and the new HashMap is slight faster ~5%. So the ~7% slower JSON decoder sounds funky to me.

I don't have the expertise to check the reason, but someone more experienced may find it worth a look.

Code


extern crate time;
extern crate test;


extern crate serialize;
use serialize::json;


use std::rand::Rng;
use std::rand;
use test::Bencher;
use std::collections::{TreeMap, HashMap};

pub struct TestStruct1  {
    data_int1: int,
    data_int2: int,
    data_int3: int,
    data_str1: String,
    data_str2: String,
    data_str3: String,
    data_map: Option<Vec<Box<TestStruct1>>>,
    data_vector: Vec<u8>,
    data_vector_s: Vec<String>,
}

#[bench]
fn bench_decode(b: &mut Bencher) {
    let mut object = TestStruct1
         {data_int1: -999, data_int2: 999, data_int3: 9999, data_str1:"toto".to_string(), data_str2:"toto".to_string(), data_str3:"toto".to_string(),
         data_vector:vec![2,3,4,5], data_vector_s:vec!["hi".to_string(), "mom".to_string()], data_map:None};

    object = TestStruct1
         {data_int1: -999, data_int2: 999, data_int3: 9999,  data_str1:"toto".to_string(), data_str2:"toto".to_string(), data_str3:"toto".to_string(),
         data_vector:vec![2,3,4,5], data_vector_s:vec!["hi".to_string(), "mom".to_string()], data_map:Some(vec![box object])};

    object = TestStruct1
         {data_int1: -999, data_int2: 999, data_int3: 9999, data_str1:"toto".to_string(), data_str2:"toto".to_string(), data_str3:"toto".to_string(),
         data_vector:vec![2,3,4,5], data_vector_s:vec!["hi".to_string(), "mom".to_string()], data_map:Some(vec![box object])};

     // Serialize using `json::encode`
    let encoded = json::encode(&object);

    b.iter( || {
        let _ = json::from_str(encoded.as_slice());
    });
}


#[bench]
fn bench_hashmap_insert(b: &mut Bencher) {

    let v0: Vec<int> = rand::task_rng().gen_iter::<int>().take(9).collect();

    let mut v: Vec<String> = Vec::new();

    for i in v0.iter() {
        v.push("test_".to_string() + i.to_string());
    }

    b.iter(|| {
        let mut m = HashMap::new();

        for i_s in v.iter() {
            m.insert(i_s.clone(), i_s);
        }
    });
}



#[bench]
fn bench_treemap_insert(b: &mut Bencher) {


    let v0: Vec<int> = rand::task_rng().gen_iter::<int>().take(9).collect();

    let mut v: Vec<String> = Vec::new();

    for i in v0.iter() {
        v.push("test_".to_string() +  i.to_string());
    }

    b.iter(|| {
        let mut m = TreeMap::new();

        for i_s in v.iter() {
            m.insert(i_s.clone(), i_s);
        }
    });
}


JSON used:

{
    "data_int1": -999,
    "data_int2": 999,
    "data_int3": 9999,
    "data_str1": "toto",
    "data_str2": "toto",
    "data_str3": "toto",
    "data_map": [
        {
            "data_int1": -999,
            "data_int2": 999,
            "data_int3": 9999,
            "data_str1": "toto",
            "data_str2": "toto",
            "data_str3": "toto",
            "data_map": [
                {
                    "data_int1": -999,
                    "data_int2": 999,
                    "data_int3": 9999,
                    "data_str1": "toto",
                    "data_str2": "toto",
                    "data_str3": "toto",
                    "data_map": null,
                    "data_vector": [
                        2,
                        3,
                        4,
                        5
                    ],
                    "data_vector_s": [
                        "hi",
                        "mom"
                    ]
                }
            ],
            "data_vector": [
                2,
                3,
                4,
                5
            ],
            "data_vector_s": [
                "hi",
                "mom"
            ]
        }
    ],
    "data_vector": [
        2,
        3,
        4,
        5
    ],
    "data_vector_s": [
        "hi",
        "mom"
    ]
}

@pczarn
Copy link
Contributor Author

pczarn commented Aug 1, 2014

@arthurprs Benchmarking insertion is not reliable since it depends on the allocator. Does serialization create an empty map for every structure? The difference would be best measured with a profiler.

@arthurprs
Copy link
Contributor

I tried to, but since I have little experience with rust internals and C-ish profilers I couldn't identify the reason.

The object building code is bellow, it's pretty simple. The only changes I made was replacing TreeMap with HashMap.


pub enum Json {
    Number(f64),
    String(String),
    Boolean(bool),
    List(List),
    Object(Object),
    Null,
}

pub type Object = HashMap<String, Json>;


    fn build_object(&mut self) -> Result<Json, BuilderError> {
        self.bump();

        let mut values = HashMap::new();

        loop {
            match self.token {
                Some(ObjectEnd) => { return Ok(Object(values)); }
                Some(Error(e)) => { return Err(e); }
                None => { break; }
                _ => {}
            }
            let key = match self.parser.stack().top() {
                Some(Key(k)) => { k.to_string() }
                _ => { fail!("invalid state"); }
            };
            match self.build_value() {
                Ok(value) => { values.insert(key, value); }
                Err(e) => { return Err(e); }
            }
            self.bump();
        }
        return self.parser.error(EOFWhileParsingObject);
    }

@alexcrichton
Copy link
Member

Closing due to inactivity, but feel free to reopen with a rebase!

@nikomatsakis
Copy link
Contributor

Argh! @pczarn I completely forgot about this pull request -- if you do rebase, please ping me and I will try to review promptly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants