-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology beta #238
Topology beta #238
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @jsafrane @xing-yang |
f95c396
to
d8335fc
Compare
pkg/features/features.go
Outdated
@@ -31,5 +32,5 @@ func init() { | |||
// defaultKubernetesFeatureGates consists of all known feature keys specific to external-provisioner. | |||
// To add a new feature, define a key for it above and add it here. | |||
var defaultKubernetesFeatureGates = map[utilfeature.Feature]utilfeature.FeatureSpec{ | |||
Topology: {Default: false, PreRelease: utilfeature.Alpha}, | |||
Topology: {Default: true, PreRelease: utilfeature.Beta}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the sidecar errors if the CSINode object doesn't exist + the topology feature is enabled.
This may break backwards compatibility if you use a newer sidecar version on 1.13 where csi topology is disabled.
We could either:
- Say that this sidecar version is not supported on versions < 1.14
- Fallback to topology disabled behavior if CSINode doesn't exist
I'm leaning towards the latter, but that also means if we're on 1.14 and there's actually a problem with CSINode/driver registration, then we will continue to successfully provision a volume in a potentially bad zone.
wdyt? @jsafrane @verult @davidz627
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually now that I think about it more, is that what we want? If the api server is temporarily having problems, then we end up provisioning in the wrong zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option is that we leave topology off by default (and we offer no compatibility guarantees from alpha->beta, including k/k alpha->beta).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new sidecar would work on version <1.14
with Topology enabled though right? So this is not a compatibility breaking thing. More of a "did you enable the feature on one component but not enable it on another" type problem. Maybe this is ok if we give a good error message making it very clear what is wrong in this case and how to solve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user on 1.13 has delayed binding enabled on the storageclass in a single zone cluster, then with the old provisioner, provisioning still succeeds. If they upgrade the sidecar, then provisioning will fail with an error.
Now it's questionable what benefit they would get by enabling delayed binding in a single zone cluster, but there's currently no harm in doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to make all the APIs call return an error if they fail (no fallback), but set topology feature gate to false. Drivers will need to explicitly enable the topology feature gate after upgrading to 1.14.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For drivers with topology disabled, upgrading the sidecar on 1.13 should not break.
For drivers with topology enabled, upgrading the sidecar on 1.13 will break. I think this is ok given it was an alpha feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @jsafrane some more. To simplify our documentation and testing matrix, we should just require 1.14 for this new version of the sidecar. So then I can have topology enabled by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. We then need to maintain the older release branch for those people who use CSI with 1.13.
The Prow-based CI is ready for testing against two different Kubernetes releases. My proposal in https://github.com/kubernetes-csi/csi-driver-host-path/pull/16/commits is to maintain one or more deployments in the master branch, so users (and the CI) will always find what they look for there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first (partial) pass
I made substantial changes in order to fail appropriately while still preserving backwards compatibility with topology disabled behavior. A summary of the changes:
Most of the test cases are under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. kind of skimmed the test cases though so if someone else can look at those more carefully. The comments are amazing!! They really help with understanding
// CSINode beta feature on K8s apiserver must also be enabled. | ||
// We don't want to fallback and provision in the wrong topology if there's some temporary | ||
// error with the API server. | ||
return nil, fmt.Errorf("error listing CSINodes: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to support running newer external provisioner on 1.13, then I think we need to disable topology by default. Otherwise, provisioning on 1.13 will start to hit this error (if the driver also supports csi topology capability).
I've addressed comments and also wrapped the logic to lookup the CSINode object to reduce duplication of error handling. I've also turned topology off by default again. |
pkg/controller/topology.go
Outdated
// In either case, provisioning needs to be allowed to proceed. | ||
return nil, nil | ||
// Even though selectedNode is set, we still need to aggregate topology values across | ||
// all nodes in order to find secondary topologies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in order to compute a priority ordering of topologies for preferredTopologies
"
Strictly speaking, it's only limited to a "secondary" in the case of regional PDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some desire to be able to limit the number of topologies returned, especially for most plugins that only have a single primary topology: #221
So I wanted to add in some more explicit comments here explaining the reason why we need to return all possible topologies even though a plugin may only need one.
@@ -88,6 +91,13 @@ func GenerateAccessibilityRequirements( | |||
requisiteTerms = flatten(allowedTopologies) | |||
} | |||
|
|||
// It might be possible to reach here if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: At the beginning of this function, could you describe the high-level compatibility considerations we discussed offline? I found it hard to see the high-level picture reading through the different levels of conditional branches.
Would it also be helpful to label which pieces of logic can be removed after 1.13 is no longer a supported version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a giant comment at the top. Let me know if this format works or you prefer something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's super clear. Thanks!
topologyKeys: nil, | ||
expectedRequisite: nil, | ||
}, | ||
// Driver on node has not been updated to report topology keys | ||
"random node: missing keys": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a test case where there's only a single CSINode with keys, in which there should be no errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the existing "same keys and values across cluster" and "different values across cluster" cases address this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. I'm thinking about the case where k8s nodes are partially upgraded to 1.14, so some nodes have keys but others don't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
nodeLabels: []map[string]string{
{"com.example.csi/zone": "zone1"},
{"com.example.csi/zone": "zone2"},
{"com.example.csi/zone": "zone2"},
},
topologyKeys: []map[string][]string{
{testDriverName: []string{"com.example.csi/zone"}},
},
in which the requisite should only have "com.example.csi/zone": "zone1", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 2 more test cases.
So actually, we in your case above, requisite has both zone1 and zone2. even though the other nodes don't have csiNode. It's because csiNode is only used to determine which topology keys to use, but we just iterate over all nodes that have the topology key.
addressed latest comments |
@@ -31,5 +32,5 @@ func init() { | |||
// defaultKubernetesFeatureGates consists of all known feature keys specific to external-provisioner. | |||
// To add a new feature, define a key for it above and add it here. | |||
var defaultKubernetesFeatureGates = map[utilfeature.Feature]utilfeature.FeatureSpec{ | |||
Topology: {Default: false, PreRelease: utilfeature.Alpha}, | |||
Topology: {Default: false, PreRelease: utilfeature.Beta}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an issue tracking when to enable this by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if we need actually need to. It's important that kubernetes enables topology by default since it needs to be able to support all kinds of drivers, but since this flag is only controlling a specific driver implementation, I'm not sure if it actually gives much benefit being enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
@@ -88,6 +91,13 @@ func GenerateAccessibilityRequirements( | |||
requisiteTerms = flatten(allowedTopologies) | |||
} | |||
|
|||
// It might be possible to reach here if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's super clear. Thanks!
topologyKeys: nil, | ||
expectedRequisite: nil, | ||
}, | ||
// Driver on node has not been updated to report topology keys | ||
"random node: missing keys": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. I'm thinking about the case where k8s nodes are partially upgraded to 1.14, so some nodes have keys but others don't
/hold |
E2E tests are passing (I had to explicitly enable topology in the driver because it's now turned it off by default). I'll keep the hold to squash commits after review. |
/lgtm |
Squashed commits |
/lgtm |
…19rc Rework to klog/v2
Move from alpha CRDs to beta in-tree CSINode object.
I am temporarily copying the new API files into vendor