Protect cache integrity during reads #258

trozet · 2021-11-03T21:44:01Z

During an invalid cache state, the client will disconnect, then attempt
reconnect to each endpoint. During this process reads against the cache
is are not prohibited. This means a client could be reading stale data
from the cache.

During reconnect we use the meta db.cacheMutex (not the cache mutex) to
control resetting the db's cache. This patch leverages that to guard
reads to the cache based on the same mutex. Additionally, it tries to
ensure that the cache will be in a consistent state when the read takes
place. The db.cacheMutex is not held for the entire reconnect process,
so we need to make some attempt to wait for a signal that a reconnect is
complete... a best effort attempt to give the client an accurate cache
read.

Signed-off-by: Tim Rozet [email protected]

coveralls · 2021-11-03T21:46:33Z

Pull Request Test Coverage Report for Build 1426280710

39 of 43 (90.7%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.09%) to 72.983%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
client/client.go	37	41	90.24%

Files with Coverage Reduction	New Missed Lines	%
client/client.go	1	65.05%

Totals
Change from base Build 1425962558:	0.09%
Covered Lines:	4198
Relevant Lines:	5752

💛 - Coveralls

trozet · 2021-11-03T22:04:43Z

/assign @dave-tucker

jcaamano · 2021-11-04T09:54:26Z

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?
Even if we attempt to block and wait for it, should we still keep track of it and return an error on timeout?
Isn't a 20 seconds timeout too much??

trozet · 2021-11-04T13:03:26Z

client/client.go

+	}
+	startTime := time.Now()
+	// TODO(trozet) make this an exponential backoff
+	for !isCacheConsistent(db) || time.Since(startTime) > 20*time.Second {


oops this should be less than 20 sec

trozet · 2021-11-04T13:07:38Z

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

How would that work? Ovn-kube would be responsible for retrying the operation? We block on transactions until we are connected and ready to send. That timeout looks like it would be 10 seconds.

Even if we attempt to block and wait for it, should we still keep track of it and return an error on timeout?
Probably. Or at least warn the cache may be inconsistent. Most of our ovnkube handling isnt smart enough to retry the operation again later. Maybe that is something we need to fix in ovnk.

Isn't a 20 seconds timeout too much??
I just did connection + monitor timeout. I'm open to changing it. What do you think is more reasonable?

dave-tucker · 2021-11-04T14:01:44Z

@trozet rather than hard-coding a timeout here, it would be better to propagate a ctx context.Context through the Get and List functions and down in to waitForCacheConsistent. The caller can implement whatever retry/backoff strategy they like.

client/client.go

dave-tucker · 2021-11-04T14:10:08Z

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

We currently have the rpcMutex which is held by the reconnect logic. This stops any rpc calls being made until after the connection is established. Unfortunately it also prevents us checking if we're connected until after you've acquired the mutex as the connection test is rpcClient != nil. We could add a separate bool to track connection state and return an error to the client earlier, but that also needs to be mutex protected 😢

jcaamano · 2021-11-04T14:21:31Z

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

We currently have the rpcMutex which is held by the reconnect logic. This stops any rpc calls being made until after the connection is established. Unfortunately it also prevents us checking if we're connected until after you've acquired the mutex as the connection test is rpcClient != nil. We could add a separate bool to track connection state and return an error to the client earlier, but that also needs to be mutex protected cry

This is what I meant. Yes, it's an additional mutex but at least is very focused and specific.

jcaamano · 2021-11-04T14:27:47Z

Should we immediately return an error if the client is used while reconnecting and keep track of that with it's own guarded flag?

How would that work? Ovn-kube would be responsible for retrying the operation? We block on transactions until we are connected and ready to send. That timeout looks like it would be 10 seconds.

Even if we attempt to block and wait for it, should we still keep track of it and return an error on timeout? Probably. Or at least warn the cache may be inconsistent. Most of our ovnkube handling isnt smart enough to retry the operation again later. Maybe that is something we need to fix in ovnk.

Isn't a 20 seconds timeout too much?? I just did connection + monitor timeout. I'm open to changing it. What do you think is more reasonable?

We have a constant that defines the timeout to 10 secs and I think there was a plan to bring it back down. Maybe we can use it for this as well.

I guess if that we are not going to handle the error on ovn-k side returning an error makes no sense. Should we add a log if we are waiting here just show if we see gaps in logs we know this is the reason?

What I keep thinking though is that we eventually have controllers that handle events, and for each event there is 1 transaction and upon any error that transaction is not committed and the event is retried from the beginning. Then we can handle errors more gracefully.

trozet · 2021-11-04T14:48:06Z

@jcaamano deferredUpdates (plus the release of cacheMutex lock) is kind of our signal that the client has reconnected and the monitor is completely setup. I know that is not as obvious as if it had its own lock/name, but I prefer to not add another mutex (there are already too many in here). I can add some comments around isCacheConsistent to better explain why deferredUpdates works?

trozet · 2021-11-04T19:37:31Z

ok updated and added the missing docs

dave-tucker

LGTM, just needs a rebase.

During an invalid cache state, the client will disconnect, then attempt reconnect to each endpoint. During this process reads against the cache is are not prohibited. This means a client could be reading stale data from the cache. During reconnect we use the meta db.cacheMutex (not the cache mutex) to control resetting the db's cache. This patch leverages that to guard reads to the cache based on the same mutex. Additionally, it tries to ensure that the cache will be in a consistent state when the read takes place. The db.cacheMutex is not held for the entire reconnect process, so we need to make some attempt to wait for a signal that a reconnect is complete... a best effort attempt to give the client an accurate cache read. Signed-off-by: Tim Rozet <[email protected]>

trozet force-pushed the protect_cache_read branch from b65581e to dab66f1 Compare November 3, 2021 22:04

trozet commented Nov 4, 2021

View reviewed changes

dave-tucker reviewed Nov 4, 2021

View reviewed changes

client/client.go Show resolved Hide resolved

trozet force-pushed the protect_cache_read branch from dab66f1 to 9b92eae Compare November 4, 2021 19:35

trozet force-pushed the protect_cache_read branch from 9b92eae to da66d76 Compare November 4, 2021 20:33

dave-tucker approved these changes Nov 5, 2021

View reviewed changes

trozet force-pushed the protect_cache_read branch from 378c86f to 91770f8 Compare November 5, 2021 15:18

dave-tucker merged commit 34a572b into ovn-org:main Nov 5, 2021

dave-tucker added the fix label Nov 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protect cache integrity during reads #258

Protect cache integrity during reads #258

trozet commented Nov 3, 2021

coveralls commented Nov 3, 2021 •

edited

Loading

trozet commented Nov 3, 2021

jcaamano commented Nov 4, 2021

trozet Nov 4, 2021

trozet commented Nov 4, 2021

dave-tucker commented Nov 4, 2021

dave-tucker commented Nov 4, 2021

jcaamano commented Nov 4, 2021

jcaamano commented Nov 4, 2021

trozet commented Nov 4, 2021

trozet commented Nov 4, 2021

dave-tucker left a comment

Protect cache integrity during reads #258

Protect cache integrity during reads #258

Conversation

trozet commented Nov 3, 2021

coveralls commented Nov 3, 2021 • edited Loading

Pull Request Test Coverage Report for Build 1426280710

💛 - Coveralls

trozet commented Nov 3, 2021

jcaamano commented Nov 4, 2021

trozet Nov 4, 2021

Choose a reason for hiding this comment

trozet commented Nov 4, 2021

dave-tucker commented Nov 4, 2021

dave-tucker commented Nov 4, 2021

jcaamano commented Nov 4, 2021

jcaamano commented Nov 4, 2021

trozet commented Nov 4, 2021

trozet commented Nov 4, 2021

dave-tucker left a comment

Choose a reason for hiding this comment

coveralls commented Nov 3, 2021 •

edited

Loading