Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map feature comparison #2

Open
xacrimon opened this issue Feb 28, 2020 · 73 comments
Open

Map feature comparison #2

xacrimon opened this issue Feb 28, 2020 · 73 comments

Comments

@xacrimon
Copy link

Worth noting is some implementations provide different features and guarantees such as being able to share reference guards across threads which is useful for async. I think that all maps included in the benchmark should be compared in features and guarantees. This can be a short table and doesn't have to be any sizable writeup. Just something to help a user choose an implementation that provides the features and guarantees they need.

@xacrimon
Copy link
Author

xacrimon commented Feb 28, 2020

Also I should report this but the read and write guards that CHashMap uses are UB to send across threads but are marked Send + Sync. Well, file a rustsec advisory rather. CHashMap is unmaintained.

@jonhoo
Copy link
Owner

jonhoo commented Feb 28, 2020

Ah, you mean if we start listing concurrent maps in the README on this repo? Yes, I agree that that should probably be noted!

@xacrimon
Copy link
Author

Exactly.

@xacrimon
Copy link
Author

I'm going to start to put together some benchmarks for different concurrent map implementations here https://git.nebulanet.cc/Acrimon/conc-map-bench and list differences.

@xacrimon
Copy link
Author

xacrimon commented Feb 29, 2020

Alright. I've written map adapters for a bunch of concurrent hashmaps and locked maps from std available at the git repository above. I have only ran this on my i7-7700HQ laptop and it doesn't look good for flurry. At 8 CPU's it's about equal to a RwLock'ed std HashMap. Something feels off with it.

@jonhoo
Copy link
Owner

jonhoo commented Mar 1, 2020

That's really interesting — do you have plots anywhere (throughput on y, #cpus on x)? Are you using a git dependency? And crucially, are you re-using guards/HashMapRefs, or are you getting a new guard/ref for every operation?

@jonhoo
Copy link
Owner

jonhoo commented Mar 1, 2020

@domenicquirl ^

@xacrimon
Copy link
Author

xacrimon commented Mar 1, 2020

I haven't made plots yet since I haven't made any automatic system for it. Going to make some plots soon. You can trivially replicate this though by cloning the repository and running cargo bench. I am using the released version from crates.io. I am not reusing guards since that probably isn't what your average user will do. I do not reuse them for the contrie concurrent map crate either. And it manages much much better than flurry. So reusing guards is clearly not the issue.

@domenicquirl
Copy link

domenicquirl commented Mar 1, 2020

I didn't recently run our benchmarks since I'm currently in the middle of the tree bins, but if I recall correctly the last time I checked (after the Moved optimization), there was a significant difference when re-using guards (on my machine).

Now, we probably should have reasonable performance when guards are not re-used, which already seemed like it was not so much the case from our own benchmarks. Running the mentioned bustle benchmarks locally, flurry performed the worst from all maps by far, up to an order of magnitude worse. I'm surprised you saw comparable performance to RwLock<HashMap> if you didn't re-use guards.

Out of interest, I added an adapter loosely based on our bustle branch which holds one guard for each entire run. With this, local performance becomes very much competitive with the other maps. While we can certainly debate about what the default behaviour of our users will be (or, for that matter, what API we should mainly expose), I conjecture that guards are indeed an issue. I do not know about how or why exactly this differs from Contrie.

@xacrimon
Copy link
Author

xacrimon commented Mar 1, 2020

Hmm alright. Contrie seems to pin on every operation and is very competitive. I made these benchmarks after how the average user is likely to use the map which is what matters and what should be optimized for. So yeah, Flurry probably needs a good amount of love to improve the speed when guards aren't reused.

@jonhoo
Copy link
Owner

jonhoo commented Mar 2, 2020

My guess is that we (for some reason) generate significantly more garbage than contrie. That shouldn't be the case, and I don't think there's a fundamental reason why we would, but it would explain that difference. Guards essentially become more expensive proportionally to how frequently you generate garbage.

@xacrimon I just released flurry 0.2 to crates.io, which should have the Moved optimization which makes a significant difference. Probably still not enough to "fix" everything, but at least it makes headway.

@xacrimon
Copy link
Author

xacrimon commented Mar 2, 2020

That seems like a wierd EBR implementation if that happens. Going to update the benchmark and run it on my desktop today then. I am not super familiar with crossbeam-epoch though.

@domenicquirl
Copy link

Worth noting that bustle initializes the maps with a reasonably high initial capacity by default if I understand correctly, so there are likely not that many moves occurring that would create Moved to be optimized.

@xacrimon
Copy link
Author

xacrimon commented Mar 2, 2020 via email

@xacrimon
Copy link
Author

xacrimon commented Mar 3, 2020

Updating flurry to 0.2 helps a bit but it is nowhere near as fast as others.

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

When you say "as fast", what are you measuring? I am more interested in the scaling as the number of threads go up than the performance at lower core counts. The latter is just sort of "constant" overhead that can be fixed, whereas the former indicates a contention problem that may indicate an algorithmic problem with the datastructure.

@xacrimon
Copy link
Author

xacrimon commented Mar 3, 2020

Contrie and the DashMap variants scale better per thread. They see close to ideal scaling while flurry seems to have a flatter scaling. Atleast on my machine.

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

Interesting.. I mean, it could be that Java's ConcurrentHashMap just doesn't scale that well :p In some sense, that was part of what I wanted to find out by porting it. It seems more likely that the garbage collection is at fault though, and that we need to dig into why that doesn't scale for flurry.

@domenicquirl
Copy link

I agree. Just to have some exemplary numbers, here's the insert-heavy workload after the version bump locally:

-- Contrie
25165824 operations across 1 thread(s) in 17.4047412s; time/op = 691ns
25165824 operations across 2 thread(s) in 9.1185288s; time/op = 361ns
25165824 operations across 3 thread(s) in 5.0007643s; time/op = 198ns
25165824 operations across 4 thread(s) in 3.9588848s; time/op = 157ns
25165824 operations across 5 thread(s) in 3.675621s; time/op = 145ns
25165824 operations across 6 thread(s) in 3.5683745s; time/op = 141ns
25165824 operations across 7 thread(s) in 3.5062784s; time/op = 139ns
25165824 operations across 8 thread(s) in 3.5366413s; time/op = 140ns

-- Flurry
25165824 operations across 1 thread(s) in 48.9622421s; time/op = 1.945µs
25165824 operations across 2 thread(s) in 33.5371731s; time/op = 1.332µs
25165824 operations across 3 thread(s) in 21.5174031s; time/op = 854ns
25165824 operations across 4 thread(s) in 17.6734076s; time/op = 701ns
25165824 operations across 5 thread(s) in 15.4645651s; time/op = 614ns
25165824 operations across 6 thread(s) in 14.0445763s; time/op = 557ns
25165824 operations across 7 thread(s) in 13.3347464s; time/op = 529ns
25165824 operations across 8 thread(s) in 12.8026602s; time/op = 507ns

-- DashMapV3
25165824 operations across 1 thread(s) in 3.4451529s; time/op = 136ns
25165824 operations across 2 thread(s) in 2.0076125s; time/op = 79ns
25165824 operations across 3 thread(s) in 1.5233589s; time/op = 59ns
25165824 operations across 4 thread(s) in 1.2849448s; time/op = 50ns
25165824 operations across 5 thread(s) in 1.204213s; time/op = 47ns
25165824 operations across 6 thread(s) in 1.0978293s; time/op = 42ns
25165824 operations across 7 thread(s) in 1.0250471s; time/op = 39ns
25165824 operations across 8 thread(s) in 959.9066ms; time/op = 38ns

And flurry with only one guard:

25165824 operations across 1 thread(s) in 8.1154286s; time/op = 321ns
25165824 operations across 2 thread(s) in 7.1846062s; time/op = 285ns
25165824 operations across 3 thread(s) in 8.3408941s; time/op = 330ns
25165824 operations across 4 thread(s) in 2.8672222s; time/op = 113ns
25165824 operations across 5 thread(s) in 2.152728s; time/op = 85ns
25165824 operations across 6 thread(s) in 1.9390374s; time/op = 76ns
25165824 operations across 7 thread(s) in 1.7498945s; time/op = 68ns
25165824 operations across 8 thread(s) in 1.5876595s; time/op = 62ns

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

Yeah, that certainly looks a lot like we're generating some unnecessary garbage that ends up hampering scalability. It also suggests that crossbeam-epoch does not handle large amounts of garbage produced across many cores very well, giving us even more of an incentive to avoid generating it in the first place. I'm curious whether the flurry model fundamentally requires more allocations (and thus garbage) than contrie, or whether it is just a matter of the code not being sufficiently carefully written.

@xacrimon
Copy link
Author

xacrimon commented Mar 3, 2020

Contrie looks like it's doing about one allocation per insert in the average case. It uses no caching object pool either. I haven't studied the flurry code closely but it is definently allocating more. On another note, I discovered an issue in bustle. It instantly panics on prefill due to an erroneous assert. I also need to add prefill to the read and update workloads in my suite since that changes performance significantly due to probing schemes and a deeper tree in contrie.

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

Interesting, what assert is erroneous? The code is pretty closely modeled after the libcuckoo benchmark, including the asserts.

I'm not sure I follow why you would need to prefill differently in the read/update workloads. The prefill should always be happening regardless of the underlying workload.

@xacrimon
Copy link
Author

xacrimon commented Mar 3, 2020

The assert at line 309 I believe. The default prefill is 0. If I raise that to 0.5 I get dramatically different performance. After some metrics gathering dashmap is doing a lot more probing than without explicit prefill.

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

That assert should definitely be correct as long as the assumption that the keys are distinct as the documentation specifies. The whole benchmark relies on that assumption.

The default prefill should be 75% of initial capacity, not 0..?

@xacrimon
Copy link
Author

xacrimon commented Mar 3, 2020

pub fn new(threads: usize, mix: Mix) -> Self {
        Self {
            mix,
            initial_cap_log2: 25,
            prefill_f: 0.0,
            ops_f: 0.75,
            threads,
            seed: None,
        }
    }

@xacrimon
Copy link
Author

xacrimon commented Mar 3, 2020

The benchmark runs fine if prefill is off.

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

Oh, you're right, I was thinking of ops_f, not prefill_f. And yes, looks like the assertion is inverted. Will fix that shortly!

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

Fix published in 0.4.

@xacrimon
Copy link
Author

xacrimon commented Mar 3, 2020

Tests don't lie :P. Thanks for fixing it on a short notice though. Much appreciated. I will update the benchmarks. Prefill more accurately models real world scenarios. So it only makes sense to use it.

@jonhoo
Copy link
Owner

jonhoo commented Mar 3, 2020

:p

It's not clear to me that "real world" scenarios will always be prefilled. I think it makes sense to test both. A common use-case for example is to have a bunch of threads collectively fill up some shared map with data as they process, in which case there really is no prefill phase. That's when things like cooperative resizing (which ConcurrentHashMap has) becomes important.

@xacrimon
Copy link
Author

Ah, I'll do a read only benchmark too and I will also collect some profiling data and send it so we can figure out what is eating time.

@xacrimon
Copy link
Author

xacrimon commented Apr 20, 2020

I think you were right. Here is a flamegraph of the entire benchmark. Mix was the same as before. The flurry part is on the mix -> FlurryTable slab in the graph. All dependencies are updated.

@jonhoo
Copy link
Owner

jonhoo commented Apr 20, 2020

Yeah, that certainly doesn't look great. Seems like almost all the time is spent on the garbage collection. Which is surprising to me since reads shouldn't generate garbage. This is why I think a read-only mix might be instructive, to see if crossbeam-epoch also has this overhead when no garbage is generated (like if it's just from tracking the epochs). Also cc @domenicquirl. I wonder if it might be worth raising this as an issue on crossbeam-epoch with this image:
2020-04-20-092109_1920x1080_scrot

@xacrimon
Copy link
Author

xacrimon commented Apr 21, 2020

Something is up. I put together a read only benchmark for flurry and it seems to generate gigabytes of garbage in seconds on my machine. So you are definitively generating a lot of garbage. Tracking epochs is pretty fast, contrie also uses crossbeam-epoch and is within 3x of DashMap. I get pretty much the same graph as above in my read only benchmark.

@jonhoo
Copy link
Owner

jonhoo commented Apr 21, 2020

That's crazy, and should definitely not be happening. In a read-only benchmark, we shouldn't be generating any garbage. I'm currently kind of swamped, but maybe @domenicquirl can take a look?

@domenicquirl
Copy link

I'm in a similar situation at the moment I fear. I've had some forced free time due to Covid19, but now everything is heading back online, except more chaotic. I've been following the discussion through mail, this does look wrong. If this is an issue with us and not crossbeam, it's very likely the reason for a lot of the performance discrepancy. Depending on how much progress I get with other things during the week I'll take a look on the weekend, I'm definitely interested in finding out where this comes from.

@domenicquirl
Copy link

domenicquirl commented May 14, 2020

Finally found some time to look at this. I was interested most in the "gigabytes of garbage" for a start, so I ran valgrind's DHAT on a small test with two concurrent threads reading from a pre-initialized map.

To check, I first had the test use only one guard per thread, and I got only the initial creation of the map as significant allocations (plus some allocations from the test binary), split into the fast and the slow path in HashMap::put. Reading itself (HashMap::get) did not re-allocate anything.

Then I switched to calling HashMap::guard for every get. The same allocations exist for map initialization, but a whole lot of allocations are now caused by creating and dropping guards. Raw data for the two DHAT nodes:

AP 1.1.1.1.1.1/2 (3 children) {
  Total:     4,148,144 bytes (48.26%, 30,566.74/Minstr) in 2,002 blocks (29.02%, 14.75/Minstr), avg size 2,072 bytes, avg lifetime 267,613.74 instrs (0.2% of program duration)
  At t-gmax: 4,144 bytes (2.29%) in 2 blocks (0.08%), avg size 2,072 bytes
  At t-end:  10,360 bytes (59.84%) in 5 blocks (55.56%), avg size 2,072 bytes
  Reads:     4,351,992 bytes (39.87%, 32,068.85/Minstr), 1.05/byte
  Writes:    4,276,264 bytes (31.25%, 31,510.82/Minstr), 1.03/byte
  Allocated at {
    ^1: 0x48397B3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_dhat-amd64-linux.so)
    ^2: 0x28C80B: alloc::alloc::alloc (alloc.rs:80)
    ^3: 0x28CD53: <alloc::alloc::Global as core::alloc::AllocRef>::alloc (alloc.rs:174)
    ^4: 0x28C76C: alloc::alloc::exchange_malloc (alloc.rs:268)
    ^5: 0x2864B9: new<crossbeam_epoch::sync::queue::Node<crossbeam_epoch::internal::SealedBag>> (boxed.rs:174)
    ^6: 0x2864B9: crossbeam_epoch::atomic::Owned<T>::new (atomic.rs:664)
    ^7: 0x28DB25: crossbeam_epoch::sync::queue::Queue<T>::push (queue.rs:91)
    ^8: 0x28372D: crossbeam_epoch::internal::Global::push_bag (internal.rs:269)
    #9: 0x284CE0: crossbeam_epoch::internal::Local::finalize (internal.rs:576)
    #10: 0x284762: crossbeam_epoch::internal::Local::unpin (internal.rs:514)
    #11: 0x28C6C3: <crossbeam_epoch::guard::Guard as core::ops::drop::Drop>::drop (guard.rs:423)
    #12: 0x287F79: core::ptr::drop_in_place (mod.rs:178)
  }
}

AP 1.1.1.2/2 (3 children) {
  Total:     4,212,208 bytes (49%, 31,038.81/Minstr) in 2,002 blocks (29.02%, 14.75/Minstr), avg size 2,104 bytes, avg lifetime 202,202.7 instrs (0.15% of program duration)
  At t-gmax: 6,312 bytes (3.49%) in 3 blocks (0.12%), avg size 2,104 bytes
  At t-end:  6,312 bytes (36.46%) in 3 blocks (33.33%), avg size 2,104 bytes
  Reads:     5,028,320 bytes (46.06%, 37,052.56/Minstr), 1.19/byte
  Writes:    8,968,904 bytes (65.54%, 66,089.83/Minstr), 2.13/byte
  Allocated at {
    ^1: 0x48397B3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_dhat-amd64-linux.so)
    ^2: 0x28C80B: alloc::alloc::alloc (alloc.rs:80)
    ^3: 0x28CD53: <alloc::alloc::Global as core::alloc::AllocRef>::alloc (alloc.rs:174)
    ^4: 0x28C76C: alloc::alloc::exchange_malloc (alloc.rs:268)
    #5: 0x286439: new<crossbeam_epoch::internal::Local> (boxed.rs:174)
    #6: 0x286439: crossbeam_epoch::atomic::Owned<T>::new (atomic.rs:664)
    #7: 0x283D9C: crossbeam_epoch::internal::Local::register (internal.rs:388)
    #8: 0x28C1CD: crossbeam_epoch::collector::Collector::register (collector.rs:39)
    #9: 0x1F8455: flurry::map::HashMap<K,V,S>::guard (map.rs:369)
  }
} 

Note that absolute numbers are in relation to 1000 map elements, so 2000 calls to get across both threads.

@xacrimon
Copy link
Author

I might actually know why this happens. Guards may on drop trigger a flush of the local garbage list and it looks like it doesn't check if it's empty and swaps it regardless.

@xacrimon
Copy link
Author

Sometimes it's also moved onto the global queue which allocates a new thread local bag of garbage.

@jonhoo
Copy link
Owner

jonhoo commented May 14, 2020

Wow, that would be really bad. This smells like a bug if it allocates even if there is no garbage...

@domenicquirl
Copy link

I'm also a bit confused that we spend so much time/memory allocating epoch LocalHandles. I know that we have a collector tied to each map so we can't just use epoch::pin() against the global collector's local handle (as Contrie seems to do from a quick look at their repo). But do we need to register a new handle every time we pin? You added most of the collector safety measures I think Jon, so you probably have a better idea about that than I do, but it seems like a lot. At least the default handle is re-used all the time, so I don't understand what these handles would be useful for if you cannot do the same in other contexts as well.

@xacrimon
Copy link
Author

xacrimon commented May 14, 2020

Pretty sure you don't need to grab new handles.

@jonhoo
Copy link
Owner

jonhoo commented May 14, 2020

At least the default handle is re-used all the time

Hmm, that doesn't seem right. LocalHandle isn't Send/Sync, so you can't really re-use it except in a thread-local. It's true that we could cache the handle in a thread local ourselves, though we'd have to be very careful to make sure it works correctly even if the user is accessing multiple maps on the same thread. The idea behind HashMap::pin was that the application can choose how they wish to cache the handles, rather than us doing it for them. That does give a Guard, not a LocalHandle though, so maybe we're missing an intermediate there?

@xacrimon
Copy link
Author

Why are we using a custom collector anyway? I don't see a benefit compared to the global static one.

@domenicquirl
Copy link

It is in TLS, yeah, but that's where the epoch::pin() calls go. So we can't have just one handle per map. But accessing different maps should be covered by checking the guard passed to all the public interface methods, shouldn't it? Since locals are tied to their collector, you couldn't guards from one local accessing a different map (@xacrimon this is also why we have different collectors, see https://github.com/jonhoo/flurry/blob/master/src/map.rs#L3541). This would mean you'd need to have a local per map per thread however. That is for continuing to hand out guards. We could hand out the local, but the performance impact is so big if you then use guards, I still think we should aspire to have the "regular" case with guards perform reasonably well and not rely on the user to choose the correct API to be efficient.

@jonhoo
Copy link
Owner

jonhoo commented May 14, 2020

Why are we using a custom collector anyway? I don't see a benefit compared to the global static one.

The biggest reasons is for memory reclamation. By using a collector, you can guarantee that all memory is dropped when the data structure is dropped. This also, in theory, can allow the removal of the 'static bound on keys and values, which only needs to be there when using the global collector.

But accessing different maps should be covered by checking the guard passed to all the public interface methods, shouldn't it?

Are you thinking that if the guard doesn't match, then we update the thread local? That could work... Cleanup will get complicated though, because you need to also clear that thread local, or we'll potentially hold on to remaining garbage forever. It could also lead to really weird performance profiles, where if you do lookups into two maps, one after the other, then you'd get the slow path every time.

the performance impact is so big if you then use guards, I still think we should aspire to have the "regular" case with guards perform reasonably well and not rely on the user to choose the correct API to be efficient.

I'm not sure I fully follow what you're saying here. I agree that ideally performance should be good "by default", without the user having to do anything weird. I don't quite know how we do this without either moving back to just using the global allocator, or doing the aforementioned "optimistic caching" of the LocalHandle.

@domenicquirl
Copy link

I was only thinking so far as that the check would cover the case where you try to access with a guard from a wrong local. I don't think I like the idea of caching only for one map, a pattern like

map1.get(key1, guard1);
map2.get(key2, guard2);
map2.get(key3, guard3);

is common enough that this would be a bad idea for the reasons you outlined.

Is there a good way to store multiple handles in TLS dynamically, so we could store locals for each map separately for each thread? Independently, a third option would be to make HashMapRef the default (/only) way to interact with a map, which has the advantage that we could get rid of the checks altogether since we always know the guard comes from our map.

Ideally, probably crossbeam would store the thread for which a local gets registered together with the local itself, and then check if it needs to create a new one or if an existing one can be reused?

@jonhoo
Copy link
Owner

jonhoo commented May 14, 2020

I was only thinking so far as that the check would cover the case where you try to access with a guard from a wrong local.

Yup, the checks would indeed catch this, but the user would probably be quite confused if their use sometimes ended up with an error due to caching.

I don't think I like the idea of caching only for one map, a pattern like ... is common enough that this would be a bad idea for the reasons you outlined.

Completely agree.

Is there a good way to store multiple handles in TLS dynamically

We'd have to store a HashMap in the thread local. Which is doable, but also a little awkward.\

Independently, a third option would be to make HashMapRef the default (/only) way to interact with a map

This would get rid of the check, true, but it would not solve the problem — users would still then either have to manually keep the HashMapRef around, or they would be calling HashMap::pin over and over, resulting in the same slowdown as today.

Ideally, probably crossbeam would store the thread for which a local gets registered together with the local itself, and then check if it needs to create a new one or if an existing one can be reused?

Hmm, this would just make LocalHandle Send, but I'm not sure it would otherwise solve the issue?

I wonder what the picture looks like if a LocalHandle didn't sync empty changes. I think that could have a huge impact. A simple allocation usually doesn't matter that much.

If it turns out that it really is creating a LocalHandle that causes the issue, then my take here is that we either need to commit fully to the global allocator, or do a thread-local cache, or simply tell the user that they should re-use HashMapRef if they want performance. None of these are ideal.

@domenicquirl
Copy link

We'd have to store a HashMap in the thread local. Which is doable, but also a little awkward.

Why would we want to have the entire map be local to each thread? Then where would the common data be?

Hmm, this would just make LocalHandle Send, but I'm not sure it would otherwise solve the issue?

I'm not thinking about moving handles of safety checks, but about registering local handles and obtaining guards. If we had a list of locals with associated thread, then if a thread wants to obtain a new guard we could check that list for the given thread to see if we already registered a local for this thread. If so, we use that local to pin and get a guard, only otherwise we actually register a new local with the collector.

I wonder what the picture looks like if a LocalHandle didn't sync empty changes. I think that could have a huge impact. A simple allocation usually doesn't matter that much.

I agree it is likely that has more performance impact. However, given that having to do this "simple allocation" for every read accounted for almost half of the programs memory usage, it's still something I would like to avoid. Especially if having a new handle for each guard() is not necessary.

@jonhoo
Copy link
Owner

jonhoo commented May 14, 2020

Why would we want to have the entire map be local to each thread?

Ah, sorry, that wasn't what I meant. What I meant was that you'd store a std::collections::HashMap<SomeFlurryMapIdentifier, LocalHandle> in TLS.

@domenicquirl
Copy link

Oh, ok. Then it's just the other way round from me thinking about having something like a std::collections::HashMap<ThreadId, LocalHandle> in our map, or in the crossbeam Global.

@jonhoo
Copy link
Owner

jonhoo commented May 14, 2020

Oh, no, that won't work, since LocalHandle isn't Send.

@domenicquirl
Copy link

Not directly. There is a list of locals in the Global, they are added when you register handles (accessed through an Arc). These are internal Local objects, not LocalHandles, and they live behind a Shared, but it should be possible to have a thread Id there. Seeing that, maybe this could then be used in such a way, but I'm also not familiar enough with crossbeam's implementation to actually know that.

@jonhoo
Copy link
Owner

jonhoo commented May 14, 2020

Oh, I see, you're proposing changing the implementation of crossbeam-epoch! I was thinking of ways to make changes in flurry specifically. May be worth raising this on the crossbeam issue tracker actually, for a "how do we do this". And specifically call out the issue around synchronizing on empty.

@xacrimon
Copy link
Author

I've updated deps for the benchmark now and I have changed flurry to use hashmapref handles. Since I've released a v4 dashmap release candidate I believe implementations should be pretty stable. So when you can you've got the green light from me to run them.

@domenicquirl
Copy link

Did either of you actually get back to crossbeam on this? I did a bit more digging with different variations of global/local collector and guard source checks. (I even tried to put together a version using flize over crossbeam, but couldn't find a good solution for the epoch::unprotected() cases.) https://github.com/jonhoo/flurry/blob/9a443cbd393c1605eb6df302c0d7531d70eb7be6/src/map.rs#L369 (obtaining guards) is by far the biggest difference-maker in overall benchmarking results. Basically no matter the surrounding implementation, if at this spot we do epoch::pin() we are very much competitive in everything but single-thread in-/upserts. In all cases where this goes through some_collector.register(), we end up with the absurd performance difference we have seen already.

@domenicquirl
Copy link

Also @xacrimon I tried also running the updated benchmark, but some dependency has a cmake build script that wouldn't run under windows. Didn't matter for my evaluations and I have a Linux machine I can switch to when I want to look at this more, but wanted to mention in case you are not yet aware.

@jonhoo
Copy link
Owner

jonhoo commented Sep 19, 2020

@domenicquirl Interesting, flize doesn't have a way to (unsafely) get a reference when you know no-one else has one? That seems odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants