-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protect cache integrity during reads #258
Conversation
Pull Request Test Coverage Report for Build 1426280710
💛 - Coveralls |
b65581e
to
dab66f1
Compare
/assign @dave-tucker |
Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag? |
client/client.go
Outdated
} | ||
startTime := time.Now() | ||
// TODO(trozet) make this an exponential backoff | ||
for !isCacheConsistent(db) || time.Since(startTime) > 20*time.Second { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops this should be less than 20 sec
How would that work? Ovn-kube would be responsible for retrying the operation? We block on transactions until we are connected and ready to send. That timeout looks like it would be 10 seconds. Even if we attempt to block and wait for it, should we still keep track of it and return an error on timeout? Isn't a 20 seconds timeout too much?? |
@trozet rather than hard-coding a timeout here, it would be better to propagate a |
We currently have the |
This is what I meant. Yes, it's an additional mutex but at least is very focused and specific. |
We have a constant that defines the timeout to 10 secs and I think there was a plan to bring it back down. Maybe we can use it for this as well. I guess if that we are not going to handle the error on ovn-k side returning an error makes no sense. Should we add a log if we are waiting here just show if we see gaps in logs we know this is the reason? What I keep thinking though is that we eventually have controllers that handle events, and for each event there is 1 transaction and upon any error that transaction is not committed and the event is retried from the beginning. Then we can handle errors more gracefully. |
@jcaamano deferredUpdates (plus the release of cacheMutex lock) is kind of our signal that the client has reconnected and the monitor is completely setup. I know that is not as obvious as if it had its own lock/name, but I prefer to not add another mutex (there are already too many in here). I can add some comments around isCacheConsistent to better explain why deferredUpdates works? |
dab66f1
to
9b92eae
Compare
ok updated and added the missing docs |
9b92eae
to
da66d76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just needs a rebase.
During an invalid cache state, the client will disconnect, then attempt reconnect to each endpoint. During this process reads against the cache is are not prohibited. This means a client could be reading stale data from the cache. During reconnect we use the meta db.cacheMutex (not the cache mutex) to control resetting the db's cache. This patch leverages that to guard reads to the cache based on the same mutex. Additionally, it tries to ensure that the cache will be in a consistent state when the read takes place. The db.cacheMutex is not held for the entire reconnect process, so we need to make some attempt to wait for a signal that a reconnect is complete... a best effort attempt to give the client an accurate cache read. Signed-off-by: Tim Rozet <[email protected]>
378c86f
to
91770f8
Compare
During an invalid cache state, the client will disconnect, then attempt
reconnect to each endpoint. During this process reads against the cache
is are not prohibited. This means a client could be reading stale data
from the cache.
During reconnect we use the meta db.cacheMutex (not the cache mutex) to
control resetting the db's cache. This patch leverages that to guard
reads to the cache based on the same mutex. Additionally, it tries to
ensure that the cache will be in a consistent state when the read takes
place. The db.cacheMutex is not held for the entire reconnect process,
so we need to make some attempt to wait for a signal that a reconnect is
complete... a best effort attempt to give the client an accurate cache
read.
Signed-off-by: Tim Rozet [email protected]