-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4192: Move Storage Version Migrator in-tree #4240
Conversation
nilekhc
commented
Sep 25, 2023
- One-line PR description: Move Storage Version Migrator in-tree.
- Issue link: Move Storage Version Migrator in-tree #4192
- Other comments:
When the Migrator controller is integrated in-tree, it will leverage the `Streaming List` approach to obtain and monitor resources while executing storage version migrations, as opposed to using the resource-intensive `chunked list` method. In cases where, for any reason, the client cannot establish a streaming watch connection with the API server, it will fall back to the standard `chunked list` method, retaining the older LIST/WATCH semantics. | ||
|
||
- _Open Question_: | ||
- Does `Streaming List` support inconsistent lists? Currently, with chunked lists, we receive inconsistent lists. We need to determine if we can achieve the same with Streaming List. Depending on the outcome, we may consider removing the [ContinueToken](https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/60dee538334c2366994c2323c0db5db8ab4d2838/pkg/apis/migration/v1alpha1/types.go#L63) from the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k @jpbetz @wojtek-t @p0lyn0mial in an attempt to verify this, I hacked in continue token support to streaming watch. While that implementation is obviously not what we would ship, would y'all be okay with the alpha SVM code simply using streaming watch and assuming it will work (i.e. for large resource sets it will fail to migrate if the watch cache or etcd are unable to provide a consistent list over the slow migration loop that SVM performs). Assuming that streaming watch gives lexicographical ordering of events (it appears to, though I do not know if that is on purpose), it can support inconsistent continue tokens, at least if the caller has a way to bypass the watch cache. @wojtek-t @p0lyn0mial is that something y'all would be willing to support as part of the streaming watch KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, streaming doesn't support pagination (passing a continuation token) because the watchcache
doesn't have built-in support for this mechanism.
I know that @wojtek-t considered adding support (as part of a new KEP) to the watchcache
and even discussed it at a meeting, but I don’t know the current status.
As for using streaming in general, we plan to enable this feature for some component in this iteration. So, we really don't know if this mechanism works. We are about to find out.
I personally don't have anything against using the new mechanism for the alpha version of SVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that streaming watch gives lexicographical ordering of events
mhm, streaming as give me a consistent list (snapshot) of data as a stream
gives you a random order. The main use is that you collect the data received from the server in some local store and only then operate on it (you don’t process individual events).
mhm, watching
in general gives you a deterministic order (but not lexicographical) - the events are delivered in the order in which they occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you cannot use an informer (I haven't checked) please check out https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/reflector.go#L330 to see how you could implement a fallback logic.
At some point I wanted to also change the client-go's LIST method to follow streaming but I haven't looked into it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW - I opened kubernetes/kubernetes#120897 to fix the ordering issue for watch-cache, so once that merges, the ordering problem is solved. But I'm still far from being convinced about supporting continue token...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like is supporting continueToken for resuming watchlist, because the semantic of this would be strange
Yeah I'm not crazy about doing something like this. It makes resumption subtly semantically different and I don't really like the idea of giving people foot guns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tough problem, but I'd be disappointed if it prevented us from making forward progress in 1.29.
How about we stick with paginated list for alpha and add a beta graduation criteria line item like "Resolve API server memory utilization problems with paginated list by enabling watch-list to handle large lists (either via a "inconsistent pagination token" mechanism or via some other approach)" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with defering to beta, but I see having a predictable and limited memory footprint as critical at scale. The list OOMing problem, even with low hundreds of items, is a serious problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree that resolving it should block Beta.
But for Alpha, I'm with Joe - I suggest proceeding with paginated lists to unblock progress.
Can we mark it explicitly as
80c39e2
to
a6c3c50
Compare
|
||
### APIs to move | ||
#### We will move following [APIs](https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/60dee538334c2366994c2323c0db5db8ab4d2838/pkg/apis/migration/v1alpha1/types.go) in-tree: | ||
- `v1alpha1` of `storageversionmigrations.migration.k8s.io` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would have expected some apiserver related group.
a6c3c50
to
4e65451
Compare
|
||
- Automatic storage version migration for CRDs | ||
- Make it easy for Kuberentes developers to drop old API schemas by guaranteeing that storage migration is run automatically on SV hash changes (should this also be on a timer or API server identity?) | ||
- Automated Storage Version Migration via the hash exposed by the `StorageVersion` API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that we generally want to get to this point where the migration happens automatically.
But making it a follow-up is probably reasonable for the sake of making progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right we all agree that we need this to happen automatically, but we have tried to keep this KEP small so that we don't stall like the previous efforts.
} | ||
|
||
// Describes the state of a migration at a certain point. | ||
type MigrationCondition struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead (for the sake of unification) use metav1.Condition instead:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1497
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t Thanks for the review. I have updated it.
When the Migrator controller is integrated in-tree, it will leverage the `Streaming List` approach to obtain and monitor resources while executing storage version migrations, as opposed to using the resource-intensive `chunked list` method. In cases where, for any reason, the client cannot establish a streaming watch connection with the API server, it will fall back to the standard `chunked list` method, retaining the older LIST/WATCH semantics. | ||
|
||
- _Open Question_: | ||
- Does `Streaming List` support inconsistent lists? Currently, with chunked lists, we receive inconsistent lists. We need to determine if we can achieve the same with Streaming List. Depending on the outcome, we may consider removing the [ContinueToken](https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/60dee538334c2366994c2323c0db5db8ab4d2838/pkg/apis/migration/v1alpha1/types.go#L63) from the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are couple different things mentioned here that we're touching
Assuming that streaming watch gives lexicographical ordering of events
This is only true for etcd implementation (because the list from etcd is returned in order).
This is not true for watchcache implementation as of now:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/watch_cache_interval.go#L120-L150
The fix would be trivial though - it just requires adding sorting there [and we should probably do that].
it can support inconsistent continue tokens
My problem with that is that I'm not sure what that would effectively mean from the API perspective.
The way how watch-list currently works is that:
- it lists all items [providing guarantees about freshness based on input parameters]
- [if requests] returns a bookmark showing that "list" operation finished
- continues to watch for items from that point [as part of the same request]
If we decide to support continuation token, (3) no longer makes sense to me [we started list from the middle of the collection, but watch is actually watching for the whole collection].
[I guess we would have to stop after point (2) then, but that still seems a bit sloppy...]
We also don't return continuation token when list-watch fails in the middle of listing phase [though that we can probably do].
at least if the caller has a way to bypass the watch cache.
If the caller bypasses watchcache, then the gain from watch-list is zero, because etcd implementation is doing a regular list from etcd. [We basically had need to ensure that API is supported if watch-cache is disabled, so we have both watchcache and etcd implementation, but the assumption is that watchcache implementation is the one that will generally be used and this is the one that actually provides real benefits].
That said, watchcache doesn't support pagination...
would y'all be okay with the alpha SVM code simply using streaming watch and assuming it will work
That sounds reasonable to me for Alpha...
... but we need some solution for Beta.
And here my problem is that based on what I wrote above I'm not sure that supporting continuation for watch-list is the right path forward. And if not we would need a fallback to regular paginated list [and if so maybe we should just use paginated lists?]
4e65451
to
549dc63
Compare
When the Migrator controller is integrated in-tree, it will leverage the `Streaming List` approach to obtain and monitor resources while executing storage version migrations, as opposed to using the resource-intensive `chunked list` method. In cases where, for any reason, the client cannot establish a streaming watch connection with the API server, it will fall back to the standard `chunked list` method, retaining the older LIST/WATCH semantics. | ||
|
||
- _Open Question_: | ||
- Does `Streaming List` support inconsistent lists? Currently, with chunked lists, we receive inconsistent lists. We need to determine if we can achieve the same with Streaming List. Depending on the outcome, we may consider removing the [ContinueToken](https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/60dee538334c2366994c2323c0db5db8ab4d2838/pkg/apis/migration/v1alpha1/types.go#L63) from the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like is supporting continueToken for resuming watchlist, because the semantic of this would be strange
Yeah I'm not crazy about doing something like this. It makes resumption subtly semantically different and I don't really like the idea of giving people foot guns.
549dc63
to
91f4559
Compare
// The resource that is being migrated. The migrator sends requests to | ||
// the endpoint serving the resource. | ||
// Immutable. | ||
Resource GroupVersionResource `json:"resource"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading this it sounded like this is the GVR being migrated "from". Is that correct? Should we clarify further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment and the field name as well. PTAL. I'm open to other suggestions as well. 😄
// The storage version hash to avoid races. | ||
StorageVersionHash string `json:"storageVersionHash,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this is obvious, but what's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to resolve - https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/60dee538334c2366994c2323c0db5db8ab4d2838/pkg/apis/migration/v1alpha1/types.go#L64-L65
If I understand this correctly, we can use the hash to validate whether the resource still needs migration.
However, upon further examination, I realize that in the out-of-tree SVM migration, the Migration Resource is created automatically, making it easier to populate this field. With the current implementation, we are asking the user to create the Migration Resource. This may not provide the best user experience, as users would need to know the appropriate hash. WDYT?
cc @enj
|
||
### [Controller](https://github.com/kubernetes-sigs/kube-storage-version-migrator/tree/60dee538334c2366994c2323c0db5db8ab4d2838/pkg/controller) to move | ||
#### [Migrator Controller](https://github.com/kubernetes-sigs/kube-storage-version-migrator/tree/60dee538334c2366994c2323c0db5db8ab4d2838/pkg/migrator) | ||
Currently, the Storage Version Migrator comprises two controllers: the `Trigger` controller and the `Migrator` controller. The Trigger controller performs resource discovery, identifying supported resources with the preferred server version every `10 minutes`. Subsequently, the Trigger controller creates the `StorageVersionMigration` resource to initiate the migration process. The Migrator controller then picks up this resource and executes the actual migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this design level or edge triggered? If edge triggered, how do we ensure that we don't miss (fail to observe) storage version changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When transitioning the Storage Version Migrator in-tree, we will exclusively move the Migrator controller as a component of KCM. The creation of the Migration resource will be deferred to the user, instead of being triggered automatically.
With this approach, we are requiring users to be more proactive and knowledgeable about when they need storage migration. The automatic trigger for Storage Migration is deferred to future enhancements. In this KEP, we are scoping to have a controller and feature in-tree so that we can further enhance it.
We have listed it as one of the unclear goals in the KEP.
### UNCLEAR Goals and/or Non-Goals
- Automatic storage version migration for CRDs
- Make it easy for Kubernetes developers to drop old API schemas by guaranteeing that storage migration is run automatically on SV hash changes (should this also be on a timer or API server identity?)
- Automated Storage Version Migration via the hash exposed by the `StorageVersion` API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think this answers my question. Can we make it super clear in the KEP that there will be no automatic triggering, I somehow missed that when first reading it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Proposal
section, I updated this:
**[UNCLEAR]** Create a new controller that watches the storage version API and automatically triggers migrations based on changes (to the SV API)
to:
For Alpha, we will not trigger automatic storage version migration, and it will be deferred to the user.
### Rollout, Upgrade and Rollback Planning | ||
|
||
###### How can a rollout or rollback fail? Can it impact already running workloads? | ||
This may impact workloads, as issuing too many Storage Version Migration requests can increase the latency on the API server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also explain the OOM risk caused when paginating lists and each item is large due to memory multiplication during encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the answer to add this context. PTAL. Thanks.
91f4559
to
db1e6f6
Compare
db1e6f6
to
0279d7f
Compare
PRR looks good, but I think we agree that "on by default" (beta), requires resolving the "this may cause kube-apiserver OOMs and cascading failure even for pagination values due to memory explosion". The 6G allocation caused by 400 one meg secrets was subpar and not worse case. |
0279d7f
to
a3a7774
Compare
Signed-off-by: Nilekh Chaudhari <[email protected]>
a3a7774
to
471341a
Compare
/approve Leaving lgtm with @deads2k |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, jpbetz, nilekhc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- Feature is enabled by default | ||
- All of the above documented tests are complete | ||
|
||
### Upgrade / Downgrade Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please fill in upgrade/downgrade statrgy and version skew stragegy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
###### Will enabling / using this feature result in any new API calls? | ||
|
||
Yes. Creation of the Migration Request which creates `StorageVersionMigration` resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually the least important stuff.
The most important stuff are the LIST requests that it's sending for resource types to be migrated and an Update request for each of the resources of that type to actually migrate it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it in the #4276.