feat: Support concurrency for IAVL and fix Racing conditions #805

mattverse · 2023-08-07T09:01:15Z

cref: #696

This PR fixes the problem stated above by replacing maps that causes racing conditions to sync map to be thread safe.

Tested this on various chains' nodes that's having concurrency issue by applying patch manually, confirmed that after applying this patch concurrency issue is not happening any more. (special shout out to @keplr-wallet for helping out testing!)

Also confirmed and tested on a testnet node with the following procedure:

Have node running
Spam it with tx bots and query bots
Previously node would crash within 1, maximum 2 minutes with the racing condition logs mentioned above

p0mvn

Great work! Only minor comments.

Do we have a good understanding of the performance changes under normal conditions? Might be useful to run this before and after the change and compare it with benchstat

p0mvn · 2023-08-07T11:08:16Z

unsaved_fast_iterator.go

 			// If next fast node from disk is to be removed, skip it.
 			iter.fastIterator.Next()
 			iter.Next()
 			return
 		}

 		nextUnsavedKey := iter.unsavedFastNodesToSort[iter.nextUnsavedNodeIdx]
-		nextUnsavedNode := iter.unsavedFastNodeAdditions[string(nextUnsavedKey)]
+		nextUnsavedNodeVal, _ := iter.unsavedFastNodeAdditions.Load(nextUnsavedKey)


We should probably check whether the second return was set to true before proceeding

p0mvn · 2023-08-07T11:10:34Z

unsaved_fast_iterator.go

@@ -196,7 +203,8 @@ func (iter *UnsavedFastIterator) Next() {
 	// if only unsaved nodes are left, we can just iterate
 	if iter.nextUnsavedNodeIdx < len(iter.unsavedFastNodesToSort) {
 		nextUnsavedKey := iter.unsavedFastNodesToSort[iter.nextUnsavedNodeIdx]
-		nextUnsavedNode := iter.unsavedFastNodeAdditions[string(nextUnsavedKey)]
+		nextUnsavedNodeVal, _ := iter.unsavedFastNodeAdditions.Load(nextUnsavedKey)


Might be worthwhile checking second return here as well

tac0turtle

amazing changes! do you know if there were any performance regressions by using sync.Map?

mattverse · 2023-08-07T14:05:29Z

Oh sweet! I'll get back tmrw with the benchstats @p0mvn suggested

yihuang · 2023-08-09T01:14:11Z

https://medium.com/@nikolay.dimitrov/comparing-go-performance-of-mutex-sync-map-and-channels-for-syncing-a-map-of-counters-61810c201e17

I just run this benchmark script locally, I get results like this:

Start measuring Mutex ...
Finished measuring Mutex, time taken: 928.267951ms

Start measuring sync.Map ...
Finished measuring sync.Map, time taken: 3.284460863s

Start measuring Channels ...
Finished measuring Channels, time taken: 3.524194982s

so it seems sync.Map is slower than a mutex protected normal map, so how about we simply add a good old mutex to the mutable tree?

mattverse · 2023-08-09T09:23:50Z

@yihuang Previously, I have attempted to use good old mutexes instead of sync.Maps(prev attempt here mattverse@34d4565) , but I have changed to sync.Map for couple of reasons.

Using mutexes gives us the overhead of having to manage where we lock and unlock manually. Code becomes very very dirty as well as since we are manually locking & unlocking we might miss parts that actually cause concurrency issues. In my previous attempts to use mutexes, although I did use mutex locks and unlocks in places where I thought concurrency issue was happening, it did not seem to solve problem (although it did relieve some of the concurrency issue happening). I am suspecting this is because I probably have missed locking and unlocking mutexes in {unknown}{unknown} parts of the code
The bench test given does not seem fair for comparing sync maps and mutexes. Sync Maps are especially perf wise better upon lower writes, higher reads. However, given bench test seems to iteratively write, which would indeed result in better bench stats for mutexes

lmk your thoughts on this!

cool-develope · 2023-08-10T14:34:37Z

I am still confused about the parallel use-case of iavl, the parallel writing will break the tree structure, so the root hash will not be deterministic. Actually, the IAVL doesn't provide concurrency now.
@yihuang, do you think it is a real issue?

yihuang · 2023-08-11T00:12:38Z

I am still confused about the parallel use-case of iavl, the parallel writing will break the tree structure, so the root hash will not be deterministic. Actually, the IAVL doesn't provide concurrency now.

@yihuang, do you think it is a real issue?

I guess it's concurrent reading and writing, still single writer but multiple readers?

tac0turtle · 2023-08-11T08:39:39Z

I am still confused about the parallel use-case of iavl, the parallel writing will break the tree structure, so the root hash will not be deterministic. Actually, the IAVL doesn't provide concurrency now. @yihuang, do you think it is a real issue?

this is due to the fastnode system, there is an issue with reading and writing that causes a concurrent hashmap error. This aims to fix the issue. Its less so about using the iavl tree in parallel.

yihuang · 2023-08-11T08:50:09Z

my concern is consistency, nodedb use mutex to protect some maps operations, but here we use sync.Map, we should use a consistent approach towards the concurrency?

cool-develope · 2023-08-11T13:50:44Z

my concern is consistency, nodedb use mutex to protect some maps operations, but here we use sync.Map, we should use a consistent approach towards the concurrency?

yeah, I think RWMutex is more reasonable in this case

mattverse · 2023-08-13T13:45:26Z

ok, if that's the consensus, I'm down to change it to use sync.RWMutex and re-open PR!

tac0turtle · 2023-08-18T15:22:26Z

@mattverse any chance you can make the new pr? otherwise we can do it

mattverse · 2023-08-21T05:26:33Z

@tac0turtle Code is ready, currently testing, give me 1~2 more days!

mattverse · 2023-08-21T05:27:14Z

I'm half skeptical it would work, applying RW mutex was pain, alot of places caused deadlocks

mattverse · 2023-08-21T11:27:13Z

yep getting error

{"level":"info","module":"consensus","commit_round":0,"height":2277484,"commit":"CC3E91B29F5D0820397C70A0EA459E5DA2AEC9F83B006597BAB2D7FA79B00A29","proposal":{},"time":"2023-08-21T11:08:01Z","message":"commit is for a block we do not know about; set ProposalBlock=nil"}
{"level":"info","module":"consensus","hash":"CC3E91B29F5D0820397C70A0EA459E5DA2AEC9F83B006597BAB2D7FA79B00A29","height":2277484,"time":"2023-08-21T11:08:01Z","message":"received complete proposal block"}

state.go:1563 +0x2ff\ngithub.com/tendermint/tendermint/consensus.(*State).handleCompleteProposal(0xc000f1bc00, 0xc0013cca00?)\n\tgithub.com/tendermint/[email protected]/consensus/state.go:1942 +0x399\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc000f1bc00, {{0x2a9ea40, 0xc0099e38a8}, {0xc00316f650, 0x28}})\n\tgithub.com/tendermint/[email protected]/consensus/state.go:834 +0x1b7\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc000f1bc00, 0x0)\n\tgithub.com/tendermint/[email protected]/consensus/state.go:760 +0x3f0\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\tgithub.com/tendermint/[email protected]/consensus/state.go:379 +0x12d\n","time":"2023-08-21T11:08:01Z","message":"CONSENSUS FAILURE!!!"}
{"level":"info","module":"consensus","wal":"/root/.osmosisd/data/cs.wal/wal","impl":{"Logger":{}},"msg":{},"time":"2023-08-21T11:08:01Z","message":"service stop"}
{"level":"info","module":"consensus","wal":"/root/.osmosisd/data/cs.wal/wal","impl":{"Logger":{},"ID":"group:U/panic.go:884 +0x213\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc000f1bc00, 0x22c06c)\n\tgithub.com/tendermint/[email protected]/consensus/state.go:1594 +0x1105\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc000f1bc00, 0x22c06c)\n\tgithub.com/tendermint/[email protected]/consensus/state.go:1563 +0x2ff\ngithub.com/tendermint/tendermint/consensus.(*State).handleCompleteProposal(0xc000f1bc00, 0xc0013cca00?)\n\tgithub.com/tendermint/[email protected]/consensus/state.go:1942 +0x399\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc000f1bc00, {{0x2a9ea40, 0xc0099e38a8}, {0xc00316f650, 0x28}})\n\tgithub.com/tendermint/[email protected]/consensus/state.go:834 +0x1b7\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc000f1bc00, 0x0)\n\tgithub.com/tendermint/[email protected]/consensus/state.go:760 +0x3f0\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\tgithub.com/tendermint/[email protected]/consensus/state.go:379 +0x12d\n","time":"2023-08-21T11:08:01Z","message":"CONSENSUS FAILURE!!!"}
{"level":"info","module":"consensus","wal":"/root/.osmosisd/data/cs.wal/wal","impl":{"Logger":{}},"msg":{},"time":"2023-08-21T11:08:01Z","message":"service stop"}
{"level":"info","module":"consensus","wal":"/root/.osmosisd/data/cs.wal/wal","impl":{"Logger":{},"ID":"group:UL0yDN8TXC0W:/root/.osmosisd/data/cs.wal/wal","Head":{"ID":"UL0yDN8TXC0W:/root/.osmosisd/data/cs.wal/wal","Path":"/root/.osmosisd/data/cs.wal/wal"},"Dir":"/root/.osmosisd/data/cs.wal"},"msg":{},"time":"2023-08-21T11:08:01Z","message":"service stop"}
{"level":"info","module":"p2p","peer":{"id":"0e6f924a3afe15996bc3f2612a9dd3bcf1c64212","ip":"222.106.187.14","port":53400},"impl":"Peer{MConn{222.106.187.14:53400} 0e6f924a3afe15996bc3f2612a9dd3bcf1c64212 out}","msg":{},"time":"2023-08-21T11:08:01Z","message":"service start"}
{"level":"info","module":"p2p","peer":{"id":"0e6f924a3afe15996bc3f2612a9dd3bcf1c64212","ip":"222.106.187.14","port":53400},"impl":"MConn{222.106.187.14:53400}","msg":{},"time"

https://github.com/mattverse/iavl/tree/mattverse/rwmutex-iavl This was the attempt I have made, I would be more than happy if you guys want to take it over from this point! lmk on the updates

tac0turtle · 2023-08-21T11:46:28Z

id say for the current iavl this pr is fine as is. In iavl 2.0 we will remove the fastnode system so its only present for a little. @cool-develope @yihuang thoughts?

cool-develope · 2023-08-21T12:41:51Z

I don't think the above error is related to RWMutex or sync.Map, it seems like trying parallel writing and breaking deterministic.
But yeah I agree with the temporary solution if this PR resolves them.

yihuang · 2023-08-22T02:16:04Z

id say for the current iavl this pr is fine as is. In iavl 2.0 we will remove the fastnode system so its only present for a little. @cool-develope @yihuang thoughts?

no problem, since it's simpler.

tac0turtle · 2023-08-23T09:29:02Z

@Mergifyio backport release/v1.x.x

mergify · 2023-08-23T09:29:06Z

backport release/v1.x.x

✅ Backports have been created

#825 feat: Support concurrency for IAVL and fix Racing conditions (backport #805) has been created for branch release/v1.x.x

Co-authored-by: Marko <[email protected]> (cherry picked from commit ba6beb1)

…#805) (#825) Co-authored-by: Matt, Park <[email protected]>

mattverse · 2023-08-23T10:15:58Z

@tac0turtle Can we get this in for sdk v0.50?

tac0turtle · 2023-08-23T10:17:01Z

yup already backported, we will test it on existing chains for a bit then cut the final release alongside the sdk eden release

mattverse · 2023-08-23T10:18:02Z

Sweet, should I also create subsequent PRs for old major versions so that its compatible with old sdk versions as well? It's not state breaking so should be patchable

tac0turtle · 2023-08-23T10:19:24Z

we would need to backport it to v0.20 for older versions. We can do this and bump the old ones.

mattverse · 2023-08-23T10:22:32Z

ok so the step would be first backport to v0.20 and then we'd be able to bump the older ones.
Do you want me to handle merge conflicts for backporting it to v0.20?

tac0turtle

@mattverse seems like this version differs from the original a little. it is causing some dead locks in testing within the sdk.

mattverse · 2023-08-24T06:08:19Z

@tac0turtle wdym by "original"? The one without changes? If so, which version of sdk are you testing with? My node seemed fine

tac0turtle · 2023-08-24T06:29:04Z

sorry different issue. git bisect led to tthis first but it was wrong

mattverse added 2 commits August 7, 2023 17:37

Apply iavl fix

5546161

Fix test

eb6dc47

mattverse requested a review from a team as a code owner August 7, 2023 09:01

mattverse changed the title ~~Support concurrency for IAVL and fix Racing conditions~~ feat: Support concurrency for IAVL and fix Racing conditions Aug 7, 2023

mattverse added 3 commits August 7, 2023 18:03

Remove printlines

8f1424e

Fix test

824cb59

Delete comments

3653f99

p0mvn reviewed Aug 7, 2023

View reviewed changes

tac0turtle reviewed Aug 7, 2023

View reviewed changes

julienrbrt mentioned this pull request Aug 23, 2023

changelog prep for v1.0.0 #824

Merged

tac0turtle assigned cool-develope Aug 23, 2023

Merge branch 'master' into mattverse/master-iavl

587849a

yihuang approved these changes Aug 23, 2023

View reviewed changes

tac0turtle merged commit ba6beb1 into cosmos:master Aug 23, 2023
6 of 7 checks passed

mergify bot mentioned this pull request Aug 23, 2023

feat: Support concurrency for IAVL and fix Racing conditions (backport #805) #825

Merged

mergify bot pushed a commit that referenced this pull request Aug 23, 2023

feat: Support concurrency for IAVL and fix Racing conditions (#805)

78805ea

Co-authored-by: Marko <[email protected]> (cherry picked from commit ba6beb1)

tac0turtle pushed a commit that referenced this pull request Aug 23, 2023

feat: Support concurrency for IAVL and fix Racing conditions (backport …

4827590

…#805) (#825) Co-authored-by: Matt, Park <[email protected]>

tac0turtle reviewed Aug 23, 2023

View reviewed changes

This was referenced Sep 4, 2023

chore: Backport IAVL Concurrency fix for v0.20 #828

Merged

chore: Backport IAVL Concurrency fix for v0.19 #829

Merged

This was referenced Sep 4, 2023

set iavlDisablefastNodeDefault to true classic-terra/cosmos-sdk#12

Merged

fix: set default iavl-disable-fastnode to true classic-terra/core#339

Merged

coderabbitai bot mentioned this pull request May 13, 2024

chore: changelog onto master #946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support concurrency for IAVL and fix Racing conditions #805

feat: Support concurrency for IAVL and fix Racing conditions #805

mattverse commented Aug 7, 2023

p0mvn left a comment

p0mvn Aug 7, 2023

p0mvn Aug 7, 2023

tac0turtle left a comment

mattverse commented Aug 7, 2023 •

edited

Loading

yihuang commented Aug 9, 2023

mattverse commented Aug 9, 2023

cool-develope commented Aug 10, 2023

yihuang commented Aug 11, 2023

tac0turtle commented Aug 11, 2023

yihuang commented Aug 11, 2023 •

edited

Loading

cool-develope commented Aug 11, 2023

mattverse commented Aug 13, 2023 •

edited

Loading

tac0turtle commented Aug 18, 2023

mattverse commented Aug 21, 2023

mattverse commented Aug 21, 2023 •

edited

Loading

mattverse commented Aug 21, 2023

tac0turtle commented Aug 21, 2023

cool-develope commented Aug 21, 2023

yihuang commented Aug 22, 2023

tac0turtle commented Aug 23, 2023

mergify bot commented Aug 23, 2023 •

edited

Loading

mattverse commented Aug 23, 2023

tac0turtle commented Aug 23, 2023

mattverse commented Aug 23, 2023 •

edited

Loading

tac0turtle commented Aug 23, 2023

mattverse commented Aug 23, 2023 •

edited

Loading

tac0turtle left a comment

mattverse commented Aug 24, 2023 •

edited

Loading

tac0turtle commented Aug 24, 2023

feat: Support concurrency for IAVL and fix Racing conditions #805

feat: Support concurrency for IAVL and fix Racing conditions #805

Conversation

mattverse commented Aug 7, 2023

p0mvn left a comment

Choose a reason for hiding this comment

p0mvn Aug 7, 2023

Choose a reason for hiding this comment

p0mvn Aug 7, 2023

Choose a reason for hiding this comment

tac0turtle left a comment

Choose a reason for hiding this comment

mattverse commented Aug 7, 2023 • edited Loading

yihuang commented Aug 9, 2023

mattverse commented Aug 9, 2023

cool-develope commented Aug 10, 2023

yihuang commented Aug 11, 2023

tac0turtle commented Aug 11, 2023

yihuang commented Aug 11, 2023 • edited Loading

cool-develope commented Aug 11, 2023

mattverse commented Aug 13, 2023 • edited Loading

tac0turtle commented Aug 18, 2023

mattverse commented Aug 21, 2023

mattverse commented Aug 21, 2023 • edited Loading

mattverse commented Aug 21, 2023

tac0turtle commented Aug 21, 2023

cool-develope commented Aug 21, 2023

yihuang commented Aug 22, 2023

tac0turtle commented Aug 23, 2023

mergify bot commented Aug 23, 2023 • edited Loading

✅ Backports have been created

mattverse commented Aug 23, 2023

tac0turtle commented Aug 23, 2023

mattverse commented Aug 23, 2023 • edited Loading

tac0turtle commented Aug 23, 2023

mattverse commented Aug 23, 2023 • edited Loading

tac0turtle left a comment

Choose a reason for hiding this comment

mattverse commented Aug 24, 2023 • edited Loading

tac0turtle commented Aug 24, 2023

mattverse commented Aug 7, 2023 •

edited

Loading

yihuang commented Aug 11, 2023 •

edited

Loading

mattverse commented Aug 13, 2023 •

edited

Loading

mattverse commented Aug 21, 2023 •

edited

Loading

mergify bot commented Aug 23, 2023 •

edited

Loading

mattverse commented Aug 23, 2023 •

edited

Loading

mattverse commented Aug 23, 2023 •

edited

Loading

mattverse commented Aug 24, 2023 •

edited

Loading