-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync: Map has poor performance #28938
Comments
\cc @bcmills |
For anyone looking, the code is actually located here I'm not sure how valid these benchmarks are. But removing the remainder operation
Benchcmp before/after
Additionally, I suggest anyone investigating to examine the method of which these benchmarks are being implemented and whether they are even valid: b.N is being updated by multiple goroutines, for instance.
Of course, after doing this, I noticed the data race in the benchmarks and ran it with the race detector. So all bets are off:
|
I am fairly certain the tests are valid. The b.N is not being updated - it is being used by multiple Go routines - the actual testing method does not exit until the Go routines complete - thus the WaitGroup. Still, the "multi" ones are not even the ones being referenced by this bug - the non-multi test use the single Go routine as provided by the testing harness. The reason the mod is there is to limit the number of entries in the map - a cap. Yes, the 'unshared' has a data-race when run under multi - but since the map is pre-populated and doesn't change in size during a read, it is a benign data race (but that is why it doesn't run the multi put tests - or an exception is thrown). And @as your "changed tests" are not invalid, as there is no way a test WITH locks (lock) should outperform the same test WITHOUT locks (unshared) - especially being 4x faster! ... so something is wrong... maybe if you linked to the code I could point it out. To make the point stronger - it is the performance difference between non-multi "sync" as compared to "lock" and "unshared". |
The "unshared" benchmarks still contain the remainder operation. I should have pointed that out -- I didn't remove it for all of your types. This was originally to illustrate how the use of high cycle cost division instructions are dominating the CPU time in the benchmark itself. There is no such thing as a benign data race. It is counterproductive to further discuss a racy benchmark (which is why I did not publish any code changes). |
@as I'm sorry but you are not correct on a couple of counts:
BenchmarkMain/unshared.get-8 20000000 91.8 ns/op BenchmarkMain/unshared.put-8 20000000 95.7 ns/op BenchmarkMain/unshared.putget-8 10000000 174 ns/op BenchmarkMain/unshared.multiget-8 20000000 89.9 ns/op BenchmarkMain/lock.get-8 20000000 112 ns/op BenchmarkMain/lock.put-8 10000000 123 ns/op BenchmarkMain/lock.putget-8 10000000 231 ns/op BenchmarkMain/lock.multiget-8 10000000 144 ns/op BenchmarkMain/lock.multiput-8 5000000 305 ns/op BenchmarkMain/lock.multiputget-8 3000000 495 ns/op BenchmarkMain/sync.get-8 5000000 287 ns/op BenchmarkMain/sync.put-8 5000000 340 ns/op BenchmarkMain/sync.putget-8 2000000 748 ns/op BenchmarkMain/sync.multiget-8 3000000 407 ns/op BenchmarkMain/sync.multiput-8 5000000 354 ns/op BenchmarkMain/sync.multiputget-8 2000000 773 ns/op |
@as your code changes are probably incorrect, since if you remove the mod/and, there is a very good chance that the random read will go for a value not in the map, which would be very fast... that is why you are seeing the "improved results". I think you should be less hostile in the future. edit: this is especially true since the go map uses a 'top hash' as a quick determination - many of the reads would quickly terminate |
@as lastly, your "racey" assessment is incorrect as well - if you examined the code, it was the use of the "sink" to avoid dead code elimination (not a race in using the map) - which btw, most of the internal Go benchmark tests don't do and they are invalid because of it... only some have been fixed to use a package sink. |
@as I would kindly ask that you delete your comments, as they are invalid, and only bring noise to this issue. |
@robaho A program with a data race in it is not a valid Go program. Even if the data races don't impact your benchmark results (which may be possible), I suggest you fix them, just to make your benchmarks valid Go code. |
@ALTree The data race is only the setting of the sink in the benchmark code - it is debatable that the 'race detector' should even report this as a data race, as it is only there to prevent dead code removal. Still, I went ahead and changed it to an atomic.Value so I assume it will be ok now. also, I moved the "masking" into the test harness and out of the core code to make it even simpler. |
After all of the changes, the results still show an issue: BenchmarkMain/unshared.get-8 20000000 85.5 ns/op BenchmarkMain/unshared.put-8 20000000 92.3 ns/op BenchmarkMain/unshared.putget-8 10000000 177 ns/op BenchmarkMain/unshared.multiget-8 20000000 88.2 ns/op BenchmarkMain/lock.get-8 10000000 109 ns/op BenchmarkMain/lock.put-8 10000000 125 ns/op BenchmarkMain/lock.putget-8 5000000 233 ns/op BenchmarkMain/lock.multiget-8 10000000 158 ns/op BenchmarkMain/lock.multiput-8 5000000 293 ns/op BenchmarkMain/lock.multiputget-8 3000000 571 ns/op BenchmarkMain/sync.get-8 5000000 299 ns/op BenchmarkMain/sync.put-8 5000000 344 ns/op BenchmarkMain/sync.putget-8 2000000 678 ns/op BenchmarkMain/sync.multiget-8 5000000 375 ns/op BenchmarkMain/sync.multiput-8 5000000 337 ns/op BenchmarkMain/sync.multiputget-8 2000000 707 ns/op |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
After rewriting the benchmark in a way I could understand, I obtained the results below. My variables of interest are:
With just one concurrent writer, the sync map already beats the lock.
Why are these results so much different than the other benchmark?
|
@as They are not that different. Your results agree with mine. Look at these two lines in your results. BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4 3000000 509 ns/op BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4 10000000 183 ns/op The Lock map is 2x faster. This should never happen. In a "read only" test, the sync.Map should behave with the properties of a standard map, so that should be the bound, the "lock" being higher (lower performance) as it has the locking overhead. |
@as Actually, it starts even earlier: BenchmarkMap/0Writer/Sync/Rand/Maskfff-4 20000000 80.6 ns/op BenchmarkMap/0Writer/Sync/Rand/Maskffff-4 5000000 343 ns/op BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4 3000000 509 ns/op BenchmarkMap/0Writer/Lock/Rand/Maskfff-4 50000000 30.3 ns/op BenchmarkMap/0Writer/Lock/Rand/Maskffff-4 20000000 121 ns/op BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4 10000000 183 ns/op In all of these cases, the lock is faster than the sync. |
@as your results don't seem consistent even within a category, as in BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4 3000000 509 ns/op BenchmarkMap/1Writer/Sync/Rand/Maskfffff-4 5000000 398 ns/op BenchmarkMap/2Writer/Sync/Rand/Maskfffff-4 5000000 403 ns/op I don't think that adding writers should make the reader perform faster... I think this is only possible if the writers are putting more elements into the L1/L2/L3 cache... Unless something in the read-only workload is causing mass cache evictions - maybe an interaction between atomic.Value and the GC ? Not sure, but there is definitely a problem here. An easy way to test thus is to use multiple readers only... My "multi-get" tests that, and still shows the "incorrect" results. |
@as as an aside, I'm sorry that you needed to rewrite my tests - the latest version I posted is far easier to follow and maintain - simpler I think than even yours, but to each his own. I'm happy you were able to duplicate my findings. |
I'm not sure I understand the benchmarks. It's not clear to me that they are testing the case that |
They are IMO. The 'sync' should not perform worse than 'lock' in a primarily read mode. (in this case it is an "only read", but you could sprinkle a few writes in there if you want). The implementation of sync.Map is an atomic read of the value (the standard map), then access the standard map. So the only performance difference between sync.Map and std map in a read only scenario should be the cost of the atomic read. In a read-only environment this cost is negligible as it will already be in the cache. There is definitely something wrong. It is why I included the "equivalent" java tests - in a read-only (or rare write), the sync.Map should perform the same as a unsynchronized std map - that is the point of lock-free data structures. |
@ianlancetaylor the tests by @as test the "read only" access in the Writer0 cases, which demonstrate the problem. His tests also show that the reader gets faster as writers are added - which should not be the case according to the documentation on/usage of sync.Map either... |
The point of |
@ianlancetaylor that is what my 'multi-get" tests and shows the same problem - but even still - a single reader and no writer should not perform worse than a lock based solution - it is a single atomic read. If it does, there is something wrong with atomic.Value |
I guess the powers that be can just ignore this, and claim this is the way it is supposed to work, but I've been working with lock-free structures for quite a while, and this is not a traditional performance profile. |
Modified the test so that
The behavior is expected and conforms to the documentation. Results
However there is an interesting (but likely irrelevant) observation: A large access window (i.e., the entire 2^20 keyspace) results in a 2x slower sync.Map when that access is random (and there are no writers, and only one reader). However, when the access is sequential, we see performance on par with the lock.
|
@as that is exactly the problem... but also, it is not on par for incr, it is 30% slower... but that doesn't matter, as a map is NOT designed to be read incrementally - that is an invalid use case - it is designed for random access, so being 3x slower than when using locks means a broken implementation somewhere... 30% is a problem for a common structure, 300% is a severe issue. What no one seems to accept is that even 1/8 writes is far too many - a common data structure in most HFT and HCP systems is a lock-free "read-only" concurrent cache (i.e. very, very few writes). It should be on par with a unsynchronized map when doing reads (which is why I also included the Java ConcurrentHashMap examples), but still allow some updates. I think your tests show that it doesn't match the documentation, to quote: // The Map type is optimized for two common use cases: (1) when the entry for a given // key is only ever written once but read many times, as in caches that only grow, // or ... so clearly this implementation is not optimized for goal 1, and your tests with 1/8 writes is not testing goal 1 either. My tests are designed to test goal 1, that is why the map is pre-populated, and the 'get' tests are most important. I really can't see how multiple people are arguing against this - it is almost like you are afraid to admit there is an underlying problem, and it is easier to just try and argue that "well, it works according to the design", but even that is an invalid argument. |
I don't think you are benchmarking your shareshard. robaho/go-concurrency-test@aed8afc
|
@as correct, bummer... knew it was too good to be true... still trying to figure out where the slowness is coming from... I wrote a different "shared shard" originally, with my own atomic.Value of a standard map, and it worked well for reads, but the writes were incredibly slow ** I deleted the comment with timings as it was noise... |
OK, so I think I figured out why the performance of sync.Map is poor in this case. Because Map uses the standard map implementation which isn't designed for CAS/concurrency, it has no choice but to have an extra layer of indirection in order to perform the CAS on an individual key/value pair. As the map working set increases, this level of indirection adds 100s of nanosecs to each lookup, since the entry must be loaded (map lookup), then the pointer loaded atomically from the entry, then the pointer dereferenced into the value - whereas the standard map is only the first map retrieval - there is the 3x slowness. This is why the newly created "shared sharded" map, which is multiple map[int]int it performs the same as lock map for concurrent reads, and outperforms for concurrent writes (the first shard lookup will always be in the cache, then a single map lookup) Here are the performance numbers: With 2 concurrent Go routines (multi) BenchmarkMain/shareshard.get-8 20000000 112 ns/op BenchmarkMain/shareshard.put-8 10000000 127 ns/op BenchmarkMain/shareshard.putget-8 10000000 239 ns/op BenchmarkMain/shareshard.multiget-8 10000000 123 ns/op BenchmarkMain/shareshard.multiput-8 10000000 147 ns/op BenchmarkMain/shareshard.multiputget-8 5000000 294 ns/op and with 8 concurrent Go routines (multi) BenchmarkMain/shareshard.get-8 20000000 114 ns/op BenchmarkMain/shareshard.put-8 10000000 130 ns/op BenchmarkMain/shareshard.putget-8 5000000 252 ns/op BenchmarkMain/shareshard.multiget-8 10000000 188 ns/op BenchmarkMain/shareshard.multiput-8 5000000 284 ns/op BenchmarkMain/shareshard.multiputget-8 2000000 846 ns/op which are far superior to standard "lock" and "sync" maps, but still are still 20% slower in the common read-only case. Which is why I am going to write a shared intmap that uses CAS as the table row level... stay tuned. |
The "shared intmap" shows the best true concurrent performance so far: with 2 Go routines: BenchmarkMain/sharedint.get-8 20000000 61.4 ns/op BenchmarkMain/sharedint.put-8 10000000 166 ns/op BenchmarkMain/sharedint.putget-8 10000000 199 ns/op BenchmarkMain/sharedint.multiget-8 20000000 64.0 ns/op BenchmarkMain/sharedint.multiput-8 10000000 170 ns/op BenchmarkMain/sharedint.multiputget-8 10000000 218 ns/op with 8 Go routines: BenchmarkMain/sharedint.get-8 20000000 59.0 ns/op BenchmarkMain/sharedint.put-8 10000000 167 ns/op BenchmarkMain/sharedint.putget-8 10000000 198 ns/op BenchmarkMain/sharedint.multiget-8 20000000 116 ns/op BenchmarkMain/sharedint.multiput-8 5000000 233 ns/op BenchmarkMain/sharedint.multiputget-8 5000000 404 ns/op which suggest that the performance problems with sync.Map are fundamental. I believe the only way to fix would be to write a much lower level sync.Map that mimicked the internal map but that supported concurrency behavior similar to Java's ConcurrentHashMap - probably the most often used structure in Java when developing concurrent applications. The current sync.Map implementation that leverages the existing internal map is not going to work since it was not designed for concurrent access & modification. Writing your own concurrent maps is also very cumbersome, as the internal "hash code" functionality is not available, which means all hashing code needs to be duplicated. |
Lastly, I wrote a true CAS shared intmap that is fully concurrent (although fixed size, no removals, etc). The performance numbers: with 2 Go routines: BenchmarkMain/sharedint.get-8 30000000 53.5 ns/op BenchmarkMain/sharedint.put-8 10000000 126 ns/op BenchmarkMain/sharedint.putget-8 10000000 198 ns/op BenchmarkMain/sharedint.multiget-8 30000000 56.3 ns/op BenchmarkMain/sharedint.multiput-8 10000000 128 ns/op BenchmarkMain/sharedint.multiputget-8 10000000 208 ns/op with 8 Go routines BenchmarkMain/sharedint.get-8 30000000 54.1 ns/op BenchmarkMain/sharedint.put-8 10000000 128 ns/op BenchmarkMain/sharedint.putget-8 10000000 202 ns/op BenchmarkMain/sharedint.multiget-8 20000000 97.2 ns/op BenchmarkMain/sharedint.multiput-8 10000000 158 ns/op BenchmarkMain/sharedint.multiputget-8 5000000 224 ns/op The performance numbers show what is possible with a true fully concurrent map. There is almost no degradation as the concurrency rises (most in this case can be attributed to the test machine only having 4 real cores, 4 hyper-threaded). The sync.Map implementation needs to be updated to use a similar methodology, leveraging the internal map/hash support functions. As it is now, a "sharded lock map" is a better solution than sync.Map for read-heavy use cases, as the "shared intmap" would require lots of custom code for general usage. For summary analysis, here are the complete timings using 8 Go routines during (multi): BenchmarkMain/unshared.get-8 20000000 87.5 ns/op BenchmarkMain/unshared.put-8 20000000 97.6 ns/op BenchmarkMain/unshared.putget-8 10000000 179 ns/op BenchmarkMain/unshared.multiget-8 10000000 131 ns/op BenchmarkMain/lock.get-8 20000000 111 ns/op BenchmarkMain/lock.put-8 10000000 125 ns/op BenchmarkMain/lock.putget-8 10000000 235 ns/op BenchmarkMain/lock.multiget-8 5000000 379 ns/op BenchmarkMain/lock.multiput-8 1000000 2485 ns/op BenchmarkMain/lock.multiputget-8 1000000 2401 ns/op BenchmarkMain/sync.get-8 5000000 294 ns/op BenchmarkMain/sync.put-8 5000000 317 ns/op BenchmarkMain/sync.putget-8 2000000 699 ns/op BenchmarkMain/sync.multiget-8 3000000 504 ns/op BenchmarkMain/sync.multiput-8 3000000 593 ns/op BenchmarkMain/sync.multiputget-8 1000000 1035 ns/op BenchmarkMain/channel.get-8 2000000 955 ns/op BenchmarkMain/channel.put-8 3000000 564 ns/op BenchmarkMain/channel.putget-8 1000000 1460 ns/op BenchmarkMain/channel.multiget-8 200000 6345 ns/op BenchmarkMain/channel.multiput-8 300000 4755 ns/op BenchmarkMain/channel.multiputget-8 200000 9575 ns/op BenchmarkMain/shard.get-8 20000000 100 ns/op BenchmarkMain/shard.put-8 20000000 101 ns/op BenchmarkMain/shard.putget-8 10000000 197 ns/op BenchmarkMain/shard.multiget-8 10000000 146 ns/op BenchmarkMain/shareshard.get-8 20000000 112 ns/op BenchmarkMain/shareshard.put-8 10000000 128 ns/op BenchmarkMain/shareshard.putget-8 5000000 241 ns/op BenchmarkMain/shareshard.multiget-8 10000000 178 ns/op BenchmarkMain/shareshard.multiput-8 5000000 276 ns/op BenchmarkMain/shareshard.multiputget-8 2000000 879 ns/op BenchmarkMain/intmap.get-8 20000000 105 ns/op BenchmarkMain/intmap.put-8 10000000 212 ns/op BenchmarkMain/intmap.putget-8 5000000 299 ns/op BenchmarkMain/intmap.multiget-8 10000000 183 ns/op BenchmarkMain/intmap.multiput-8 5000000 258 ns/op BenchmarkMain/intmap.multiputget-8 3000000 456 ns/op BenchmarkMain/intmap2.get-8 30000000 53.7 ns/op BenchmarkMain/intmap2.put-8 10000000 130 ns/op BenchmarkMain/intmap2.putget-8 10000000 200 ns/op BenchmarkMain/intmap2.multiget-8 20000000 99.4 ns/op BenchmarkMain/intmap2.multiput-8 10000000 151 ns/op BenchmarkMain/intmap2.multiputget-8 10000000 228 ns/op BenchmarkMain/sharedint.get-8 30000000 54.5 ns/op BenchmarkMain/sharedint.put-8 10000000 127 ns/op BenchmarkMain/sharedint.putget-8 10000000 206 ns/op BenchmarkMain/sharedint.multiget-8 20000000 96.1 ns/op BenchmarkMain/sharedint.multiput-8 10000000 152 ns/op BenchmarkMain/sharedint.multiputget-8 10000000 221 ns/op |
I don’t really understand what thumbs down means in this context. Why is that even an option on bug reports? If you disagree with the conclusions then offer an alternative. I stand by them. The only way to fix sync.Map requires a different data structure with reuse of the standard map internals. |
Parity with the built-in map wouldn't tell us much: would it mean that |
Have you run benchstat on these numbers? It will help. |
That's #21031. Also note that your benchmarks add one more layer of pointer indirection when using |
Also note that much of the overhead that N=4, as you are using for benchmarking, is low enough that the constant factors likely still dominate. (The original |
See #21195. |
Sharding is one of several possible solutions proposed in #21035. |
@bcmills thanks for your input, it seems the Go team has a good handle on the issues. I’m not sure about the cache contention claim - I tested with 8x - and I’m more concerned with the “read only” workload, but that is minor. My bad for not finding these issues before this - I just figured for such an important common data structure a lot of these would of already been addressed, thus I was looking for more esoteric causes. I would suggest that the internal Go map benchmarks be enhanced to test larger maps where the cache locality and number of indirections would surface as a performance problem as shown here. |
I guess this can be closed as duplicate given @bcmills references. Given the lack of generics I think the cited issues should be given higher priority - this is just too an important of a data structure to exhibit such poor performance. |
Thanks. Closing as a duplicate of #21031, #21035, and #15292. If you're interested in concurrent data structures, you're certainly welcome to take a run at those. (Personally, I use |
@bcmills as I mentioned earlier, large "mainly read" shared maps are an important structure in many financial applications. As a final closing to this, I'll report the standard cases: idiomatic locks in Go, sync.Map, and Java's ConcurrentHashMap, with 6x concurrency for multi, 2^20 map with full range: BenchmarkMain/lock.get-8 20000000 110 ns/op BenchmarkMain/lock.put-8 10000000 129 ns/op BenchmarkMain/lock.putget-8 5000000 240 ns/op BenchmarkMain/lock.multiget-8 5000000 328 ns/op BenchmarkMain/lock.multiput-8 1000000 1549 ns/op BenchmarkMain/lock.multiputget-8 1000000 1819 ns/op BenchmarkMain/sync.get-8 5000000 287 ns/op BenchmarkMain/sync.put-8 5000000 327 ns/op BenchmarkMain/sync.putget-8 2000000 706 ns/op BenchmarkMain/sync.multiget-8 3000000 486 ns/op BenchmarkMain/sync.multiput-8 3000000 535 ns/op BenchmarkMain/sync.multiputget-8 2000000 957 ns/op Benchmark (arg) Mode Cnt Score Error Units TestJavaCache.Test0Get concurrent avgt 5 72.301 ± 2.706 ns/op TestJavaCache.Test2Put concurrent avgt 5 168.011 ± 12.797 ns/op TestJavaCache.Test3PutGet concurrent avgt 5 293.249 ± 31.050 ns/op TestJavaCache.Test4MultiGet concurrent avgt 5 121.128 ± 5.543 ns/op TestJavaCache.Test5MultiPut concurrent avgt 5 261.543 ± 14.179 ns/op TestJavaCache.Test6MultiPutGet concurrent avgt 5 497.539 ± 34.685 ns/op |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?darwin, osx, core i7
What did you do?
Please see the full test analysis here but in summary, the performance of sync.Map is far worse than expected, even with read-only workloads.
Note, all of the maps are pre-populated, so get will always succeed, and put will always replace an existing element.
The "multi" above is when 2 Go routines are concurrently accessing the map.
What did you expect to see?
The performance of sync.Map for reads should be nearly equal to the standard built-in map. The is the demonstrated behavior with the similar Java ConcurrentHashMap.
What did you see instead?
sync.Map performance is nearly 3x slower in all cases, including when compared to using a RW mutex to control access to a shared standard map, even with uncontended usages.
Since the sync.Map implementation is lock-free, a 'read operation' should be at most a single extra volatile read - to load the map reference. This points to a possible problem with atomic.Value ?
The text was updated successfully, but these errors were encountered: