-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serve watch without resourceVersion from cache and introduce a WatchFromStorageWithoutResourceVersion feature gate to allow serving watch from storage #123935
Conversation
a89f73c
to
e74151b
Compare
staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher_whitebox_test.go
Outdated
Show resolved
Hide resolved
e74151b
to
c586bca
Compare
@serathius @liggitt we want this in 1.30 and backports all the way back to 1.27 right? |
/priority critical-urgent |
/sig etcd |
c586bca
to
206b10f
Compare
yup :-/ |
206b10f
to
c13ddd4
Compare
c13ddd4
to
dde396d
Compare
/retest |
0e722fc
to
b692ace
Compare
Yey! Got through test issues! cc @liggitt Note; I have updated getWatchCacheResourceVersion and it's test to make sure it passes and treated TestWatchSemantics test as authoritative to Watch behavior. |
func (c *Cacher) getWatchCacheResourceVersion(ctx context.Context, parsedWatchResourceVersion uint64, opts storage.ListOptions) (uint64, error) { | ||
if len(opts.ResourceVersion) != 0 { | ||
return parsedWatchResourceVersion, nil | ||
} | ||
// legacy case | ||
if !utilfeature.DefaultFeatureGate.Enabled(features.WatchFromStorageWithoutResourceVersion) && opts.SendInitialEvents == nil && opts.ResourceVersion == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we add it here?
If we're already at this point, we will be serving watch from cache. So the problem we're facing [being watches on etcd overloading etcd] is not a problem here. I don't think this is needed here - I would revert changes to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactors for WatchList have changed the default behavior of watch cache, that was not a problem was previously this response was served from etcd, however to pass the tests (WatchSemantics and conformance test) I need to update this function.
Reason is that getWatchCacheResourceVersion returns a ResourceVersion to which the watch cache must be synchronized to
. Without this change getWatchCacheResourceVersion
will call GetCurrentResourceVersionFromStorage
which is meant for consistent read, and wait for such revision.
With my change I will return zero
, which means serve the data available from cache, which is consistent with the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - we will need to update tests when we will switch FG default, but that's fine.
So the tests should effectively be [if FG set, then 100, otherwise 0].
So maybe please add a short comment for those tests that when we switch FG, those should be reverted to 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a hard time following what this bit does, but at least the 1.27 / 1.28 / 1.29 backports won't need this and will be a straight gating of the etcd pass-through, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - this bit will not be part of backports
/hold |
b692ace
to
efb2abd
Compare
…romStorageWithoutResourceVersion feature gate to allow serving watch from storage.
efb2abd
to
0130072
Compare
/lgtm /hold cancel |
LGTM label has been added. Git tree hash: 1ee9a24abc8f29c9648098950fee946f000bcadb
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu, liggitt, serathius, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
defer featuregatetesting.SetFeatureGateDuringTest(t, utilfeature.DefaultFeatureGate, features.WatchFromStorageWithoutResourceVersion, false)() | ||
store, terminate := testSetupWithEtcdAndCreateWrapper(t) | ||
t.Cleanup(terminate) | ||
storagetesting.RunWatchSemantics(context.TODO(), t, store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doubling of the RunWatchSemantics test is responsible for the increased timeout rate in https://testgrid.k8s.io/sig-release-master-blocking#ci-kubernetes-unit&width=20 and #123850
…1.28 Cherry-pick of #123935: Serve watch without resourceVersion from cache and introduce a WatchFromStorageWithoutResourceVersion feature gate to allow serving watch from storage.
…1.27 Cherry-pick of #123935: Serve watch without resourceVersion from cache and introduce a WatchFromStorageWithoutResourceVersion feature gate to allow serving watch from storage.
/kind bug
With discovery of etcd watch correctness under the extreme cases of watch congestion (#123072 (comment)), #123532 is no longer a sufficient mitigation.
Even with watch cache protected, direct watches to etcd will still suffer loss of events and cause etcd memory blot. Rolling forward with improvements to etcd watch like etcd-io/etcd#17555 might be more riskly as will cause watch compaction affect new code paths.
Serving watches directly from etcd was reintroduced in 1.27 as a simple correctness fix #115096, however we didn't anticipate such big impact on stability and correctness of watch.
I propose to revert the #115096 and leave a WatchFromStorageWithoutResourceVersion feature flag to get the old behavior. At least for as long we implement
ConsistentWatchFromCache
as part of KEP-2340