Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add migration upgrade/downgrade test #927

Merged
merged 2 commits into from
Jun 25, 2021

Conversation

wongma7
Copy link
Contributor

@wongma7 wongma7 commented Jun 8, 2021

Is this a bug fix or adding new feature?

What is this PR about? / Why do we need it? see #920. Need to make sure toggling migration on/off works correctly.

For kops, toggling is easy by editing the cluster spec, that's what this test does
For EKS, ATM it's impossible to get a cluster with migration enabled in the first place.

What testing is done?
I am testing it locally.

ginkgo -nodes=1 -v -- --state s3://aws-efs-csi-driver-e2e --name test-cluster-41400.k8s.local --delete-namespace=false --max-nodes-to-gather-from=0

having this test as part of CI is not a criteria for enabling CSIMigrationAWS on by default, but at least it has to be manually run. It's very slow (even with just 1 master + 1 node instance) so probably it should not be a PR blocking job, but a periodic.

TODO:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 8, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ayberk and gnufied June 8, 2021 23:44
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 8, 2021
@coveralls
Copy link

coveralls commented Jun 8, 2021

Coverage Status

Coverage increased (+0.008%) to 79.586% when pulling d3a95a4 on wongma7:upgradetest into 54f1649 on kubernetes-sigs:master.

@wongma7 wongma7 force-pushed the upgradetest branch 2 times, most recently from 0c7e98c to 33f9376 Compare June 9, 2021 18:25
@wongma7
Copy link
Contributor Author

wongma7 commented Jun 9, 2021

This is only running should call csi plugin for all operations after migration toggled on I will run should call in-tree plugin for all operations after migration toggled off next.

(2177s = 37 minutes so yeah it takes a while to toggle the gates since if you toggle kubelet, need to replace master+node, if you toggle KCM, need to replace master)

Ran 3 of 3 Specs in 2177.249 seconds
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 0 Skipped

@wongma7 wongma7 changed the title WIP: Add migration upgrade/downgrade/skew test Add migration upgrade/downgrade test Jun 9, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2021
@wongma7
Copy link
Contributor Author

wongma7 commented Jun 9, 2021

should call in-tree plugin for all operations after migration toggled off is failing because of the different IAM policies relied on by kops in-tree KCM vs CSI controller.

kops configures the master instance with policy such that it (and kube-controller-manager running on it) is only allowed to AttachVolumes with tag KubernetesCluster": "test-cluster-31300.k8s.local" . This "legacy" tag is added by kube-controller-manager/cloud-provider for every created EBS volume: https://github.com/kubernetes/kubernetes/blob/82878d208ba0da50392949a045bc5e6647400d85/staging/src/k8s.io/legacy-cloud-providers/aws/tags.go#L303.
CSI doesn't add this tag by default since it doesn't know the cluster name unless provided via flag:
https://github.com/kubernetes-sigs/aws-ebs-csi-driver#tagging

  1. CSI provisions volume, it's missing tag KubernetesCluster": "test-cluster-31300.k8s.local"
  2. volume is successfully attached, detached
  3. migration toggled off
  4. KCM fails to re-attach volume because it's missing tag KubernetesCluster": "test-cluster-31300.k8s.local"

The reverse situation should call csi plugin for all operations after migration toggled on works fine provided CSI is created with an example policy with wildcard AttachVolume permissions like https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/example-iam-policy.json#L8

  1. KCM provisions volume, it has tag KubernetesCluster": "test-cluster-31300.k8s.local"
  2. volume is successfully attached, detached
  3. migration toggled on
  4. CSI successfully re-attaches volumes because it has wildcard permissions

HOWEVER....our example policy restricts DeleteVolume permissions to volumes created by itself...so if KCM creates a volume, then you toggle migration, then CSI won't have permission to delete it.

{
            "Action": [
                "ec2:AttachVolume",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CreateRoute",
                "ec2:DeleteRoute",
                "ec2:DeleteSecurityGroup",
                "ec2:DeleteVolume",
                "ec2:DetachVolume",
                "ec2:RevokeSecurityGroupIngress"
            ],
            "Condition": {
                "StringEquals": {
                    "ec2:ResourceTag/KubernetesCluster": "test-cluster-31300.k8s.local"
                }
            },
            "Effect": "Allow",
            "Resource": [
                "*"
            ]
        },

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 9, 2021

Summary, assuming that user has configured default kops instance IAM policy and CSI example IAM policy:

AttachVolume

  • If CSI creates a volume then you toggle migration off, KCM will fail to attach it because it's only allowed to attach volumes with tag KubernetesCluster
  • If KCM creates a volume then you toggle migration on, CSI will be able to attach it because it's allowed to attach all volumes

DeleteVolume

  • If CSI creates a volume then you toggle migration off, KCM will fail to delete it because it's only allowed to delete volumes with tag KubernetesCluster (the test didn't even reach this point since the attach assertion failed)
  • If KCM creates a volume then you toggle migration on, CSI will fail to delete it because it's only allowed to delete volumes with tag CSIVolumeName or ebs.csi.aws.com/cluster (my test passed this because I am using a less-restrictive policy for the CSI driver than the example)

DetachVolume

  • we need not worry about this case because there is no situation in which CSI will perform Attach and KCM Detach (or vice-versa) assuming that the node draining procedure has been followed.
  • edit: we do need to worry in the sense that , of course CSI/KCM needs permission to both Attach and Detach a given volume whether it was provisioned by CSI or KCM. I meant to say that if we fix the AttachVolume case then the DetachVolume case gets fixed too.

IIRC this was a known issue already, hence we added support for supplying tags to the driver and such, but, either way we should try to solve it in such a way that everything "just works" without user having to supply tags.

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 9, 2021

ref: #530 , this is where jsafrane already anticipated for migration purposes the installer of the driver must set clustername. HOwever, it only sets the tag kubernetes.io/cluster/" , not KubernetesCluster. KubernetesCluster is supposed to be legacy but kops relies on it still

@jsafrane
Copy link
Contributor

I think it's ok to add KubernetesCluster to volumes provisioned by the CSI driver, even as part of --k8s-tag-cluster-id cmdline option. As seen above, missing this tag breaks clusters deployment by a popular tool.

@wongma7 wongma7 changed the title Add migration upgrade/downgrade test WIP: Add migration upgrade/downgrade test Jun 11, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2021
@wongma7 wongma7 force-pushed the upgradetest branch 4 times, most recently from d3a95a4 to f1ac38b Compare June 18, 2021 19:01
@wongma7 wongma7 force-pushed the upgradetest branch 3 times, most recently from 64e48b7 to bf1ca9d Compare June 19, 2021 00:45
@wongma7
Copy link
Contributor Author

wongma7 commented Jun 21, 2021

• [SLOW TEST:1713.611 seconds]
Kops
/home/ANT.AMAZON.COM/mattwon/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/tests/e2e-upgrade/e2e_test.go:56
  should call in-tree plugin for all operations after migration toggled off
  /home/ANT.AMAZON.COM/mattwon/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/tests/e2e-upgrade/e2e_test.go:85
------------------------------

Ran 1 of 1 Specs in 1713.611 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 28m38.932813632s
Test Suite Passed

@wongma7
Copy link
Contributor Author

wongma7 commented Jun 21, 2021

• [SLOW TEST:1813.686 seconds]
Kops
/home/ANT.AMAZON.COM/mattwon/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/tests/e2e-upgrade/e2e_test.go:56
  should call csi plugin for all operations after migration toggled on
  /home/ANT.AMAZON.COM/mattwon/go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/tests/e2e-upgrade/e2e_test.go:70
------------------------------

Ran 1 of 1 Specs in 1813.687 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 30m29.244804733s
Test Suite Passed

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 21, 2021
@wongma7 wongma7 changed the title WIP: Add migration upgrade/downgrade test Add migration upgrade/downgrade test Jun 21, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2021
@ayberk
Copy link
Contributor

ayberk commented Jun 24, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2021
@k8s-ci-robot k8s-ci-robot merged commit e175fe6 into kubernetes-sigs:master Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants