Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't lock in-memory shares storage for database ops #1778

Open
wants to merge 4 commits into
base: stage
Choose a base branch
from

Conversation

anatolie-ssv
Copy link

@anatolie-ssv anatolie-ssv commented Oct 7, 2024

The approach taken here is to minimize the time spent with a locked shares map.
By separately locking the badger instance we allow access to the in-memory shares while the "heavy lifting" disk access is being executed.

@anatolie-ssv anatolie-ssv self-assigned this Oct 7, 2024
@anatolie-ssv anatolie-ssv marked this pull request as draft October 7, 2024 14:34
@anatolie-ssv anatolie-ssv changed the title Fix: don't lock in-memory shares storage for database ops fix: don't lock in-memory shares storage for database ops Oct 7, 2024
@anatolie-ssv
Copy link
Author

@moshe-blox I'm looking at ways to move the interaction with the database in a separate goroutine.
Right now the problem that I'm facing is the inability to propagate the potential database error from the goroutine to the method updating the in-memory shares map. So if there is an error from badger db we would find out about it only after we've made changes to the shares map. I'm not really sure how to get around this issue.

registry/storage/shares.go Outdated Show resolved Hide resolved
registry/storage/shares.go Outdated Show resolved Hide resolved
@iurii-ssv
Copy link
Contributor

iurii-ssv commented Oct 9, 2024

@moshe-blox I'm looking at ways to move the interaction with the database in a separate goroutine. Right now the problem that I'm facing is the inability to propagate the potential database error from the goroutine to the method updating the in-memory shares map. So if there is an error from badger db we would find out about it only after we've made changes to the shares map. I'm not really sure how to get around this issue.

One pattern I've seen for this is:

  • instead of returning an error (from func(s) like Save for example) you return chan error
  • Save implementation will pass this chan error to some background worker go-routine to either close it (important) or send an error through this channel
  • the caller of Save now has to receive from channel if he intends to handle the potential error (that might eventually come on this chan error), he can also handle it by spinning a go-routine of his own that will listen on this chan error and maybe log the result or something

It does complicate sharesStorage API and usage quite a bit (although, synchronous API can be created in the form of a wrapper on top of asynchronous one - for those callers who don't want to deal with chan error and just outright prefer wait for error instead), the question is whether there are real callers that can benefit from this ?

So if there is an error from badger db we would find out about it only after we've made changes to the shares map.

Save implementation itself can spin up a go-routine that will listen on chan error (similar construction to what I've mentioned above), and upon receiving error from background DB go-routine it can handle it however is considered meaningful (eg. it can remove it from in-memory map) before relaying this error to the original caller of Save (who will maybe re-try Save call after a while).

It doesn't really change the underlying assumptions this PR introduces - which is the consistency between in-memory data and DB data becomes eventual (meaning, you can call Save, and then another caller might call Range in parallel that will apply some fn to the newly saved object, and then Save will actually fail to store data in DB, so that in the end it turns out Range caller observed object that was never meant to be). Unless the implementation code is arranged in ad hoc manner (which is not always possible, tricky to implement correctly, and probably warrants some comments in the code that call it out) where for every update-operation we try to update DB first and only then update in-memory state - so that every read-operation can only read data after it has already been stored in DB).

@moshe-blox
Copy link
Contributor

moshe-blox commented Oct 10, 2024

@anatolie-ssv oh that's right during event sync we can't ignore DB errors, i forgot!

@iurii-ssv returning <-chan error is an interesting idea to simulate async/await behavior, with that said i like Go's approach of blocking by default and delegating async to the caller, who can simply spawn a goroutine if they wish, because it keeps the API trivial as you noted, and allows the caller to avoid unnecessary channels when they do need to block.

i think going for a pure solution might be too complex, because what does it mean if a database save fail — should we undo the changes we've done in the map?

maybe we should try to target the specific issue we have and witnessed causing a slowdown in the real world — that updating of thousands of share metadatas completely locks the storage for many seconds at a time

i'm not aware of other points of pressure around shares than that (so far!)

WDYT?

@iurii-ssv
Copy link
Contributor

iurii-ssv commented Oct 10, 2024

i'm not aware of other points of pressure around shares than that (so far!)

Yeah, that's what I was getting at. If the performance of fully blocking implementation is sufficient - there is no need to go for the harder solution.

with that said i like Go's approach of blocking by default and delegating async to the caller, who can simply spawn a goroutine if they wish, because it keeps the API trivial as you noted, and allows the caller to avoid unnecessary channels when they do need to block.

Yup, the simpler the better. I was just describing a potential solution (with it's tradeoffs) if we need to go down this path sometime in the future, because @anatolie-ssv mentioned he might want to consider it.

@moshe-blox
Copy link
Contributor

moshe-blox commented Oct 10, 2024

@iurii-ssv i've been discussing with @anatolie-ssv and indeed we're throwing away my original proposal, and instead expanding on his initial implementation with a few improvements (such as promoting the db mutex to being a parent lock of the in-memory mutex), which he's still playing around with to confirm it makes sense

hopefully @anatolie-ssv can fill you in on the details or present it once it's ready :)

@anatolie-ssv anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch 2 times, most recently from 5ad0a20 to 688454c Compare October 11, 2024 14:41
@anatolie-ssv anatolie-ssv marked this pull request as ready for review October 13, 2024 04:46
@anatolie-ssv anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch 3 times, most recently from 161c1f3 to ef25bf0 Compare October 14, 2024 10:16
Copy link
Contributor

@iurii-ssv iurii-ssv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but it seems we lock dbmu for longer times than necessary (I assume it's only goal is to manage access to sharesStorage.db).

Comment on lines 61 to 62
dbmu sync.Mutex // parent lock for in-memory mutex
mu sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps more detailed descriptions wouldn't hurt (because it's not quite clear what parent really means here):

        // dbMtx is used to synchronize access to db
	dbMtx           sync.Mutex
	// sMtx is used to synchronize access to shares
	sMtx             sync.RWMutex

Copy link
Contributor

@moshe-blox moshe-blox Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest renaming to something like storageMtx and memoryMtx because it explains the fundamental difference between the two

registry/storage/shares.go Outdated Show resolved Hide resolved
@anatolie-ssv anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch 6 times, most recently from d53f013 to 3257ec3 Compare October 17, 2024 09:57
Comment on lines +238 to +244
if err := s.validatorStore.handleSharesUpdated(updateShares...); err != nil {
return err
}

if err := s.validatorStore.handleSharesAdded(addShares...); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to wrap errors here (because they come from different code branches might be hard to debug without it):

		if err := s.validatorStore.handleSharesUpdated(updateShares...); err != nil {
			return fmt.Erorrf("handleSharesUpdated: %w", err)
		}

		if err := s.validatorStore.handleSharesAdded(addShares...); err != nil {
			return fmt.Erorrf("handleSharesAdded: %w", err)
		}

@@ -210,48 +210,57 @@ func (s *sharesStorage) Save(rw basedb.ReadWriter, shares ...*types.SSVShare) er
}
}

s.mu.Lock()
defer s.mu.Unlock()
s.logger.Debug("save validators to shares storage", zap.Int("count", len(shares)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: save validators to shares storage -> save validator shares to shares storage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants