Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology beta #238

Merged
merged 3 commits into from
Mar 27, 2019
Merged

Topology beta #238

merged 3 commits into from
Mar 27, 2019

Conversation

msau42
Copy link
Collaborator

@msau42 msau42 commented Mar 2, 2019

Move from alpha CRDs to beta in-tree CSINode object.

I am temporarily copying the new API files into vendor

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 2, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2019
@msau42
Copy link
Collaborator Author

msau42 commented Mar 2, 2019

/assign @jsafrane @xing-yang

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 2, 2019
@msau42 msau42 force-pushed the topology-beta branch 2 times, most recently from f95c396 to d8335fc Compare March 3, 2019 06:06
README.md Outdated Show resolved Hide resolved
@@ -31,5 +32,5 @@ func init() {
// defaultKubernetesFeatureGates consists of all known feature keys specific to external-provisioner.
// To add a new feature, define a key for it above and add it here.
var defaultKubernetesFeatureGates = map[utilfeature.Feature]utilfeature.FeatureSpec{
Topology: {Default: false, PreRelease: utilfeature.Alpha},
Topology: {Default: true, PreRelease: utilfeature.Beta},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the sidecar errors if the CSINode object doesn't exist + the topology feature is enabled.

This may break backwards compatibility if you use a newer sidecar version on 1.13 where csi topology is disabled.

We could either:

  • Say that this sidecar version is not supported on versions < 1.14
  • Fallback to topology disabled behavior if CSINode doesn't exist

I'm leaning towards the latter, but that also means if we're on 1.14 and there's actually a problem with CSINode/driver registration, then we will continue to successfully provision a volume in a potentially bad zone.

wdyt? @jsafrane @verult @davidz627

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually now that I think about it more, is that what we want? If the api server is temporarily having problems, then we end up provisioning in the wrong zone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option is that we leave topology off by default (and we offer no compatibility guarantees from alpha->beta, including k/k alpha->beta).

Copy link
Contributor

@davidz627 davidz627 Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new sidecar would work on version <1.14 with Topology enabled though right? So this is not a compatibility breaking thing. More of a "did you enable the feature on one component but not enable it on another" type problem. Maybe this is ok if we give a good error message making it very clear what is wrong in this case and how to solve it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user on 1.13 has delayed binding enabled on the storageclass in a single zone cluster, then with the old provisioner, provisioning still succeeds. If they upgrade the sidecar, then provisioning will fail with an error.

Now it's questionable what benefit they would get by enabling delayed binding in a single zone cluster, but there's currently no harm in doing so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to make all the APIs call return an error if they fail (no fallback), but set topology feature gate to false. Drivers will need to explicitly enable the topology feature gate after upgrading to 1.14.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For drivers with topology disabled, upgrading the sidecar on 1.13 should not break.

For drivers with topology enabled, upgrading the sidecar on 1.13 will break. I think this is ok given it was an alpha feature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @jsafrane some more. To simplify our documentation and testing matrix, we should just require 1.14 for this new version of the sidecar. So then I can have topology enabled by default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We then need to maintain the older release branch for those people who use CSI with 1.13.

The Prow-based CI is ready for testing against two different Kubernetes releases. My proposal in https://github.com/kubernetes-csi/csi-driver-host-path/pull/16/commits is to maintain one or more deployments in the master branch, so users (and the CI) will always find what they look for there.

@msau42 msau42 changed the title WIP Topology beta Topology beta Mar 15, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 15, 2019
Copy link
Contributor

@davidz627 davidz627 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm

deploy/kubernetes/rbac.yaml Outdated Show resolved Hide resolved
pkg/controller/topology.go Show resolved Hide resolved
pkg/controller/topology.go Outdated Show resolved Hide resolved
Copy link
Contributor

@verult verult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first (partial) pass

pkg/controller/topology.go Show resolved Hide resolved
pkg/controller/topology.go Outdated Show resolved Hide resolved
pkg/controller/topology_test.go Show resolved Hide resolved
pkg/controller/topology_test.go Show resolved Hide resolved
@msau42 msau42 changed the title Topology beta WIP: Topology beta Mar 18, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2019
@msau42 msau42 changed the title WIP: Topology beta Topology beta Mar 19, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2019
@msau42
Copy link
Collaborator Author

msau42 commented Mar 19, 2019

I made substantial changes in order to fail appropriately while still preserving backwards compatibility with topology disabled behavior. A summary of the changes:

  • Immediate binding:
    • If a beta CSINode object that registers topology keys is found, at least one node has been upgraded, fail appropriately if needed
    • Otherwise, fallback to "topology disabled" behavior
  • Delayed binding:
    • If selected node doesn't have beta CSINode object and the node is < 1.14, fallback to "topology disabled" behavior
    • Otherwise require that topology keys/labels are set.

Most of the test cases are under TestTopologyAggregation.

Copy link
Contributor

@davidz627 davidz627 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. kind of skimmed the test cases though so if someone else can look at those more carefully. The comments are amazing!! They really help with understanding

pkg/controller/topology.go Outdated Show resolved Hide resolved
pkg/controller/controller_test.go Outdated Show resolved Hide resolved
// CSINode beta feature on K8s apiserver must also be enabled.
// We don't want to fallback and provision in the wrong topology if there's some temporary
// error with the API server.
return nil, fmt.Errorf("error listing CSINodes: %v", err)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to support running newer external provisioner on 1.13, then I think we need to disable topology by default. Otherwise, provisioning on 1.13 will start to hit this error (if the driver also supports csi topology capability).

@msau42
Copy link
Collaborator Author

msau42 commented Mar 25, 2019

I've addressed comments and also wrapped the logic to lookup the CSINode object to reduce duplication of error handling. I've also turned topology off by default again.

pkg/controller/topology.go Outdated Show resolved Hide resolved
pkg/controller/topology.go Outdated Show resolved Hide resolved
pkg/controller/topology.go Outdated Show resolved Hide resolved
// In either case, provisioning needs to be allowed to proceed.
return nil, nil
// Even though selectedNode is set, we still need to aggregate topology values across
// all nodes in order to find secondary topologies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in order to compute a priority ordering of topologies for preferredTopologies"

Strictly speaking, it's only limited to a "secondary" in the case of regional PDs

Copy link
Collaborator Author

@msau42 msau42 Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some desire to be able to limit the number of topologies returned, especially for most plugins that only have a single primary topology: #221

So I wanted to add in some more explicit comments here explaining the reason why we need to return all possible topologies even though a plugin may only need one.

@@ -88,6 +91,13 @@ func GenerateAccessibilityRequirements(
requisiteTerms = flatten(allowedTopologies)
}

// It might be possible to reach here if:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: At the beginning of this function, could you describe the high-level compatibility considerations we discussed offline? I found it hard to see the high-level picture reading through the different levels of conditional branches.

Would it also be helpful to label which pieces of logic can be removed after 1.13 is no longer a supported version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a giant comment at the top. Let me know if this format works or you prefer something else

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's super clear. Thanks!

pkg/controller/topology_test.go Outdated Show resolved Hide resolved
topologyKeys: nil,
expectedRequisite: nil,
},
// Driver on node has not been updated to report topology keys
"random node: missing keys": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a test case where there's only a single CSINode with keys, in which there should be no errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the existing "same keys and values across cluster" and "different values across cluster" cases address this?

Copy link
Contributor

@verult verult Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. I'm thinking about the case where k8s nodes are partially upgraded to 1.14, so some nodes have keys but others don't

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

nodeLabels: []map[string]string{
				{"com.example.csi/zone": "zone1"},
				{"com.example.csi/zone": "zone2"},
				{"com.example.csi/zone": "zone2"},
			},
			topologyKeys: []map[string][]string{
				{testDriverName: []string{"com.example.csi/zone"}},
			},

in which the requisite should only have "com.example.csi/zone": "zone1", right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 2 more test cases.

So actually, we in your case above, requisite has both zone1 and zone2. even though the other nodes don't have csiNode. It's because csiNode is only used to determine which topology keys to use, but we just iterate over all nodes that have the topology key.

pkg/controller/topology.go Outdated Show resolved Hide resolved
@msau42
Copy link
Collaborator Author

msau42 commented Mar 26, 2019

addressed latest comments

@@ -31,5 +32,5 @@ func init() {
// defaultKubernetesFeatureGates consists of all known feature keys specific to external-provisioner.
// To add a new feature, define a key for it above and add it here.
var defaultKubernetesFeatureGates = map[utilfeature.Feature]utilfeature.FeatureSpec{
Topology: {Default: false, PreRelease: utilfeature.Alpha},
Topology: {Default: false, PreRelease: utilfeature.Beta},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an issue tracking when to enable this by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if we need actually need to. It's important that kubernetes enables topology by default since it needs to be able to support all kinds of drivers, but since this flag is only controlling a specific driver implementation, I'm not sure if it actually gives much benefit being enabled by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@@ -88,6 +91,13 @@ func GenerateAccessibilityRequirements(
requisiteTerms = flatten(allowedTopologies)
}

// It might be possible to reach here if:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's super clear. Thanks!

topologyKeys: nil,
expectedRequisite: nil,
},
// Driver on node has not been updated to report topology keys
"random node: missing keys": {
Copy link
Contributor

@verult verult Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. I'm thinking about the case where k8s nodes are partially upgraded to 1.14, so some nodes have keys but others don't

@msau42
Copy link
Collaborator Author

msau42 commented Mar 26, 2019

/hold
debugging e2e test failures

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 26, 2019
@msau42
Copy link
Collaborator Author

msau42 commented Mar 26, 2019

E2E tests are passing (I had to explicitly enable topology in the driver because it's now turned it off by default).

I'll keep the hold to squash commits after review.

@verult
Copy link
Contributor

verult commented Mar 27, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2019
@msau42
Copy link
Collaborator Author

msau42 commented Mar 27, 2019

Squashed commits
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019
@verult
Copy link
Contributor

verult commented Mar 27, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2019
@k8s-ci-robot k8s-ci-robot merged commit c8043f0 into kubernetes-csi:master Mar 27, 2019
kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this pull request Dec 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants