Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect cache integrity during reads #258

Merged
merged 1 commit into from
Nov 5, 2021

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Nov 3, 2021

During an invalid cache state, the client will disconnect, then attempt
reconnect to each endpoint. During this process reads against the cache
is are not prohibited. This means a client could be reading stale data
from the cache.

During reconnect we use the meta db.cacheMutex (not the cache mutex) to
control resetting the db's cache. This patch leverages that to guard
reads to the cache based on the same mutex. Additionally, it tries to
ensure that the cache will be in a consistent state when the read takes
place. The db.cacheMutex is not held for the entire reconnect process,
so we need to make some attempt to wait for a signal that a reconnect is
complete... a best effort attempt to give the client an accurate cache
read.

Signed-off-by: Tim Rozet [email protected]

@coveralls
Copy link

coveralls commented Nov 3, 2021

Pull Request Test Coverage Report for Build 1426280710

  • 39 of 43 (90.7%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.09%) to 72.983%

Changes Missing Coverage Covered Lines Changed/Added Lines %
client/client.go 37 41 90.24%
Files with Coverage Reduction New Missed Lines %
client/client.go 1 65.05%
Totals Coverage Status
Change from base Build 1425962558: 0.09%
Covered Lines: 4198
Relevant Lines: 5752

💛 - Coveralls

@trozet
Copy link
Contributor Author

trozet commented Nov 3, 2021

/assign @dave-tucker

@jcaamano
Copy link
Collaborator

jcaamano commented Nov 4, 2021

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?
Even if we attempt to block and wait for it, should we still keep track of it and return an error on timeout?
Isn't a 20 seconds timeout too much??

client/client.go Outdated
}
startTime := time.Now()
// TODO(trozet) make this an exponential backoff
for !isCacheConsistent(db) || time.Since(startTime) > 20*time.Second {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops this should be less than 20 sec

@trozet
Copy link
Contributor Author

trozet commented Nov 4, 2021

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

How would that work? Ovn-kube would be responsible for retrying the operation? We block on transactions until we are connected and ready to send. That timeout looks like it would be 10 seconds.

Even if we attempt to block and wait for it, should we still keep track of it and return an error on timeout?
Probably. Or at least warn the cache may be inconsistent. Most of our ovnkube handling isnt smart enough to retry the operation again later. Maybe that is something we need to fix in ovnk.

Isn't a 20 seconds timeout too much??
I just did connection + monitor timeout. I'm open to changing it. What do you think is more reasonable?

@dave-tucker
Copy link
Collaborator

@trozet rather than hard-coding a timeout here, it would be better to propagate a ctx context.Context through the Get and List functions and down in to waitForCacheConsistent. The caller can implement whatever retry/backoff strategy they like.

@dave-tucker
Copy link
Collaborator

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

We currently have the rpcMutex which is held by the reconnect logic. This stops any rpc calls being made until after the connection is established. Unfortunately it also prevents us checking if we're connected until after you've acquired the mutex as the connection test is rpcClient != nil. We could add a separate bool to track connection state and return an error to the client earlier, but that also needs to be mutex protected 😢

@jcaamano
Copy link
Collaborator

jcaamano commented Nov 4, 2021

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

We currently have the rpcMutex which is held by the reconnect logic. This stops any rpc calls being made until after the connection is established. Unfortunately it also prevents us checking if we're connected until after you've acquired the mutex as the connection test is rpcClient != nil. We could add a separate bool to track connection state and return an error to the client earlier, but that also needs to be mutex protected cry

This is what I meant. Yes, it's an additional mutex but at least is very focused and specific.

@jcaamano
Copy link
Collaborator

jcaamano commented Nov 4, 2021

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

How would that work? Ovn-kube would be responsible for retrying the operation? We block on transactions until we are connected and ready to send. That timeout looks like it would be 10 seconds.

Even if we attempt to block and wait for it, should we still keep track of it and return an error on timeout? Probably. Or at least warn the cache may be inconsistent. Most of our ovnkube handling isnt smart enough to retry the operation again later. Maybe that is something we need to fix in ovnk.

Isn't a 20 seconds timeout too much?? I just did connection + monitor timeout. I'm open to changing it. What do you think is more reasonable?

We have a constant that defines the timeout to 10 secs and I think there was a plan to bring it back down. Maybe we can use it for this as well.

I guess if that we are not going to handle the error on ovn-k side returning an error makes no sense. Should we add a log if we are waiting here just show if we see gaps in logs we know this is the reason?

What I keep thinking though is that we eventually have controllers that handle events, and for each event there is 1 transaction and upon any error that transaction is not committed and the event is retried from the beginning. Then we can handle errors more gracefully.

@trozet
Copy link
Contributor Author

trozet commented Nov 4, 2021

@jcaamano deferredUpdates (plus the release of cacheMutex lock) is kind of our signal that the client has reconnected and the monitor is completely setup. I know that is not as obvious as if it had its own lock/name, but I prefer to not add another mutex (there are already too many in here). I can add some comments around isCacheConsistent to better explain why deferredUpdates works?

@trozet
Copy link
Contributor Author

trozet commented Nov 4, 2021

ok updated and added the missing docs

Copy link
Collaborator

@dave-tucker dave-tucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just needs a rebase.

During an invalid cache state, the client will disconnect, then attempt
reconnect to each endpoint. During this process reads against the cache
is are not prohibited. This means a client could be reading stale data
from the cache.

During reconnect we use the meta db.cacheMutex (not the cache mutex) to
control resetting the db's cache. This patch leverages that to guard
reads to the cache based on the same mutex. Additionally, it tries to
ensure that the cache will be in a consistent state when the read takes
place. The db.cacheMutex is not held for the entire reconnect process,
so we need to make some attempt to wait for a signal that a reconnect is
complete... a best effort attempt to give the client an accurate cache
read.

Signed-off-by: Tim Rozet <[email protected]>
@dave-tucker dave-tucker merged commit 34a572b into ovn-org:main Nov 5, 2021
@dave-tucker dave-tucker added the fix label Nov 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants