Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation to add external health monitor controller and agent #3

Merged

Conversation

NickrenREN
Copy link
Contributor

@NickrenREN NickrenREN commented Mar 6, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR adds an external health monitor controller and an external health monitor agent.

Which issue(s) this PR fixes:
Fixes #1
Fixes #2

Does this PR introduce a user-facing change?:

Adds an external health monitor controller and an external health monitor agent.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 6, 2020
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 6, 2020
@NickrenREN
Copy link
Contributor Author

NickrenREN commented Mar 6, 2020

Ignore this PR first please.

It is far away from completion

Just a WIP PR to show progress.

@NickrenREN NickrenREN force-pushed the initial-implementation branch 2 times, most recently from 86e741f to 4bcef6a Compare March 9, 2020 17:06
@NickrenREN NickrenREN force-pushed the initial-implementation branch 2 times, most recently from e92257d to e20cae9 Compare March 11, 2020 07:16
@NickrenREN NickrenREN force-pushed the initial-implementation branch 2 times, most recently from af812e9 to e353fb1 Compare March 11, 2020 16:21
@xing-yang
Copy link
Contributor

/test continuous-integration/travis-ci/pr

@NickrenREN NickrenREN force-pushed the initial-implementation branch from e353fb1 to e437396 Compare April 17, 2020 04:32
@NickrenREN NickrenREN changed the title WIP: Initial implementation Initial implementation Apr 23, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2020
os.Exit(1)
}

supportGetVolume, err := supportGetVolume(ctx, csiConn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check LIST_VOLUMES support first. If both LIST_VOLUMES and VOLUME_CONDITION are supported, we should use that to retrieve volume health. This will have much better performance than making individual calls to GetVolume.

If LIST_VOLUMES is not support, then we check the combination of GET_VOLUME and VOLUME_CONDITION.

}

// TODO: move this to csi-lib-utils
// TODO: do we need to consider ControllerServiceCapability_RPC_LIST_VOLUMES_VOLUME_HEALTH ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I added a comment above that LIST_VOLUMES should be checked and it should be the preferred way to retrieve information in batch.

Also you may be using an earlier version of the spec? LIST_VOLUMES_VOLUME_HEALTH is removed. Now we just have one VOLUME_CONDITION which is used for both LIST_VOLUMES and GET_VOLUME.

showVersion = flag.Bool("version", false, "Show version.")
timeout = flag.Duration("timeout", 15*time.Second, "Timeout for waiting for attaching or detaching the volume.")
workerThreads = flag.Uint("worker-threads", 10, "Number of pv monitor worker threads")
kubeletRootPath = flag.String("kubelet-root-path", "/var/lib/kubelet", "The root path pf kubelet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pf/of

monitorInterval = flag.Duration("monitor-interval", 1*time.Minute, "Interval for controller to check volumes health condition.")

kubeconfig = flag.String("kubeconfig", "", "Absolute path to the kubeconfig file. Required only when running out of cluster.")
resync = flag.Duration("resync", 10*time.Minute, "Resync interval of the controller.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metricsManager.SetDriverName(storageDriver)
metricsManager.StartMetricsEndpoint(*metricsAddress, *metricsPath)

supportGetNodeVolumeHealth, err := supportGetNodeVolumeHealth(ctx, csiConn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's renamed to VOLUME_CONDITION in CSI spec, so change to supportNodeGetVolumeCondition

os.Exit(1)
}
if !supportGetNodeVolumeHealth {
klog.V(2).Infof("CSI driver does not support VolumeHealth Node Capability, exiting")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VolumeCondition

continue
}
t := rpc.GetType()
if t == csi.NodeServiceCapability_RPC_GET_VOLUME_STATS {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeServiceCapability_RPC_VOLUME_CONDITION


metricsAddress = flag.String("metrics-address", "", "The TCP network address where the prometheus metrics endpoint will listen (example: `:8080`). The default is empty string, which means metrics endpoint is disabled.")
metricsPath = flag.String("metrics-path", "/metrics", "The HTTP path where prometheus metrics will be exposed. Default is `/metrics`.")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add retryIntervalStart and retryIntervalMax?

}

// TODO: move this to csi-lib-utils
func supportGetVolumeCondition(ctx context.Context, csiConn *grpc.ClientConn) (supportGetVolumeHealth bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supportControllerGetVolumeCondition

Add one for ListVolumes

os.Exit(1)
}

supportGetVolume, err := supportGetVolume(ctx, csiConn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supportControllerGetVolume

}

// TODO: move this to csi-lib-utils
func supportGetVolume(ctx context.Context, csiConn *grpc.ClientConn) (supportGetVolume bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supportControllerGetVolume

- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
#Secret permission is optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need secrets yet for get and list.

mountPath: /var/lib/csi/sockets/pluginproxy/

- name: mock-driver
image: quay.io/k8scsi/mock-driver:canary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to submit a PR in csi-test to implement volume health in the mock-driver? Otherwise we can't test this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test this on my local env first, and can send a PR in csi-test if needed

- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
#Secret permission is optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need Secrets for this.

go.mod Outdated
go 1.12

require (
github.com/container-storage-interface/spec v1.2.0-rc1.0.20200415022618-e129a75169c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have the latest changes we need for VolumeCondition? Should we use master for now?

go.mod Outdated
golang.org/x/tools v0.0.0-20190524140312-2c0ae7006135 // indirect
google.golang.org/grpc v1.28.0
honnef.co/go/tools v0.0.0-20190523083050-ea95bdfd59fc // indirect
k8s.io/api v0.17.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update everything to v0.18.0

// TODO: support inline csi volumes
for _, volume := range pod.Spec.Volumes {
if volume.PersistentVolumeClaim != nil {
pvc, err := agent.pvcLister.PersistentVolumeClaims(pod.Namespace).Get(volume.PersistentVolumeClaim.ClaimName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with v0.18.0, you'll need to have a context and options in all these API calls as well.

@@ -0,0 +1,65 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a sub-folder "pv-monitor-controller" under "controller"? Should the files just be directly under "controller"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case we want to add new controllers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll change the directory structure when there's a need for new controllers.
For now, please remove extra layer of directory,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@@ -0,0 +1,74 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a sub-folder "pv-monitor-agent" under "agent"? Should the files just be directly under "agent"?

volumeListAndAddInterval = flag.Duration("volume-list-add-interval", 5*time.Minute, "Time interval for listing volumes and add them to queue")
nodeListAndAddInterval = flag.Duration("node-list-add-interval", 5*time.Minute, "Time interval for listing nodess and add them to queue")
workerThreads = flag.Uint("worker-threads", 10, "Number of pv monitor worker threads")
enableNodeWatcher = flag.Bool("enable-node-watcher", false, "whether we want to enable node watcher, node watcher will only have effects on local PVs now, it may be useful for block storages too, will take this into account later.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be useful for block storages too, will take this into account later

"block storage"? Do you mean "remote storage"?
If this only works for local pv now, how do you verify and stop user from enabling this feature if it is not local pv?

showVersion = flag.Bool("version", false, "Show version.")
timeout = flag.Duration("timeout", 15*time.Second, "Timeout for waiting for attaching or detaching the volume.")
listVolumesInterval = flag.Duration("list-volumes-interval", 5*time.Minute, "Time interval for calling ListVolumes RPC to check volumes' health condition")
volumeListAndAddInterval = flag.Duration("volume-list-add-interval", 5*time.Minute, "Time interval for listing volumes and add them to queue")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is used when ListVolumes is not supported? Should it be called GetVolumeAndAddInterval instead?

<-stopCh
}

func (agent *PVMonitorAgent) atLeastOneFieldIsEmpty(podWithPVItem *PodWithPVItem) bool {
func (agent *PVMonitorAgent) isPodWithPVItemInValid(podWithPVItem *PodWithPVItem) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/InValid/Invalid

}

klog.V(4).Infof("Started PV processing %q", podWithPV.pvName)
klog.V(6).Infof("Started PV processing %q", podWithPV.pvName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use V(5) for debugging. Can you change V(6) to V(5)?

@fengzixu
Copy link
Contributor

fengzixu commented Jun 27, 2020

After we merged the mock-driver PR, I try to deploy external-health-monitor in my personal cluster(v1.15.3). I found it cannot work normally.

I just found two bugs because the controller and agent always were panic after they started for a while

I will try to modify it in my own experimental env first for testing.

@NickrenREN @xing-yang

@fengzixu
Copy link
Contributor

fengzixu commented Jun 27, 2020

After we merged the mock-driver PR, I try to deploy external-health-monitor in my personal cluster(v1.15.3). I found it cannot work normally.

I just found two bugs because the controller and agent always were panic after they started for a while

I will try to modify it in my own experimental env first for testing.

@NickrenREN @xing-yang

After I changed, I think agent and controller can work fine

diff --git a/pkg/agent/pv_monitor_agent.go b/pkg/agent/pv_monitor_agent.go
index cf8f38c..d8e2252 100644
--- a/pkg/agent/pv_monitor_agent.go
+++ b/pkg/agent/pv_monitor_agent.go
@@ -100,7 +100,7 @@ func NewPVMonitorAgent(client kubernetes.Interface, driverName string, conn *grp

        // PVC informer
        agent.pvcLister = pvcInformer.Lister()
-       agent.pvListerSynced = pvcInformer.Informer().HasSynced
+       agent.pvcListerSynced = pvcInformer.Informer().HasSynced

        // Pod informer
        podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
diff --git a/pkg/controller/pv_monitor_controller.go b/pkg/controller/pv_monitor_controller.go
index ec9ad28..db847fd 100644
--- a/pkg/controller/pv_monitor_controller.go
+++ b/pkg/controller/pv_monitor_controller.go
@@ -134,6 +134,7 @@ func NewPVMonitorController(client kubernetes.Interface, conn *grpc.ClientConn,

        // PVC informer
        ctrl.pvcLister = pvcInformer.Lister()
+       ctrl.pvcListerSynced = pvcInformer.Informer().HasSynced
        ctrl.pvListerSynced = pvcInformer.Informer().HasSynced

        // Pod informer
(END)
[root@experiment-lab001-xuran:~/workspace/csi-test/external-health-monitor]# kubectl get pods
NAME                                                      READY   STATUS    RESTARTS   AGE
csi-external-health-monitor-agent-mcxqn                   2/2     Running   0          4m8s
csi-external-health-monitor-agent-sjcgm                   2/2     Running   0          3m58s
csi-external-health-monitor-controller-6bddc8c88c-6ngcq   2/2     Running   0          111s
csi-external-health-monitor-controller-6bddc8c88c-bvm89   2/2     Running   0          104s
csi-external-health-monitor-controller-6bddc8c88c-ls26g   2/2     Running   0          2m1s

@xing-yang
Copy link
Contributor

Thanks @fengzixu for testing and fixing the bugs!

Hi @NickrenREN, do you want to fix these bugs on your PR or do you want @fengzixu to submit a PR on top of yours?

@fengzixu
Copy link
Contributor

fengzixu commented Jun 27, 2020

In addition to bugs, I mentioned above, I finished the basic test of external-health-monitor using the csi-test. So, let me summarize the testing report as below.

Test Cases

  • Check if volmue-health-monitro-controller can call the ControllerListVolume or ControllerGetVolume gPRC interface. And Check if the response of ControllerListVolume or ControllerGetVolume gPRC interface satisfy the csi-specification
  • Test one simple negative case to check response of ControllerListVolume

Cluster Info

  • Version: 1.16.6
  • Control Plane: 3
  • Worker: 2

What I did

  1. Clone https://github.com/kubernetes-csi/external-health-monitor and https://github.com/kubernetes-csi/csi-test
  2. Build container images csi-external-health-monitor-controller and csi-external-health-monitor-agent and mocker-driver
  3. Push images to my personal images repo: https://quay.io/user/fengzixu
  4. Change the image tag of csi-external-health-monitor-controller and csi-external-health-monitor-agent and mocker-driver to my own images
  5. Process the command `kubectl create -f deploy/kubernetes/xxx" to deploy the external-health-monitor
  6. process the command kubectl logs csi-external-health-monitor-controller-5866488ddc-jxdb6 -c mock-driver -f to detect the log

Result

Case1

I run kubectl logs <controller_pod_name> -c mock-driver

// List
gRPCCall: {"Method":"/csi.v1.Controller/ListVolumes","Request":{},"Response":{"entries":[{"volume":{"capacity_bytes":107374182400,"volume_id":"1","volume_context":{"name":"Mock Volume 1"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},{"volume":{"capacity_bytes":107374182400,"volume_id":"2","volume_context":{"name":"Mock Volume 2"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},{"volume":{"capacity_bytes":107374182400,"volume_id":"3","volume_context":{"name":"Mock Volume 3"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}}]},"Error":"","FullError":null}

// Get
gRPCCall: {"Method":"/csi.v1.Controller/ControllerGetVolume","Request":{"volume_id":"1"},"Response":{"volume":{"capacity_bytes":107374182400,"volume_id":"1","volume_context":{"name":"Mock Volume 1"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},"Error":"","FullError":null}

Case2

Add some temporary code to test a simple negative case

// CheckControllerListVolumeStatuses checks volumes health condition by ListVolumes
func (checker *PVHealthConditionChecker) CheckControllerListVolumeStatuses() error {
        ctx, cancel := context.WithTimeout(context.Background(), checker.timeout)
        defer cancel()

        result, err := checker.csiPVHandler.ControllerListVolumeConditions(ctx)
        if err != nil {
                return err
        }

        // Positive Case
        _, err = checker.csiPVHandler.ControllerGetVolumeCondition(ctx, "1")
        if err != nil {
                return err
        }

        // Negative Case
        _, err = checker.csiPVHandler.ControllerGetVolumeCondition(ctx, "Mock Volume 10")
        if err != nil {
                return err
        }

Result

// Positive
gRPCCall: {"Method":"/csi.v1.Controller/ControllerGetVolume","Request":{"volume_id":"1"},"Response":{"volume":{"capacity_bytes":107374182400,"volume_id":"1","volume_context":{"name":"Mock Volume 1"}},"status":{"published_node_ids":["io.kubernetes.storage.mock"],"volume_condition":{}}},"Error":"","FullError":null}

// Negative
gRPCCall: {"Method":"/csi.v1.Controller/ControllerGetVolume","Request":{"volume_id":"Mock Volume 10"},"Response":{"status":{"volume_condition":{"abnormal":true,"message":"volume not found"}}},"Error":"rpc error: code = NotFound desc = Mock Volume 10","FullError":{"code":5,"message":"Mock Volume 10"}}

Conclustion

In the simple way, for the basic function of external-health-monitor, I think it's fine. We can merge this initial PR fist after we solved all existing comments.

As I said before, this PR of csi-test kubernetes-csi/csi-test#268 only for testing the basic function of external-health-monitor. Just checking if it can invoke the ControllerListVolume and ControllerGetVolume functions and get the desired response that satisfy csi-specification.

What we will do

After we merged this PR, I think we can do some tasks asap

  • 1. make the code base more clean
  • 2. add more positive and negative test cases in csi-test for external-health-monitor
  • 3. discuss the plan introduce external-health-monitor to 1.19 before code freeze

@fengzixu
Copy link
Contributor

@NickrenREN @xing-yang If you are available, please check my comment

@fengzixu
Copy link
Contributor

I fixed bugs that cause panic though submit the PR to @NickrenREN 's repository

NickrenREN#1

Maybe @NickrenREN can merge it and update this PR

@NickrenREN
Copy link
Contributor Author

@fengzixu thanks a lot for doing this.
@xing-yang fengzixu has send a PR for this, i have merged that.

serviceAccount: csi-external-health-monitor-controller
containers:
- name: csi-external-health-monitor-controller
image: quay.io/k8scsi/csi-external-health-monitor-controller:canary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to move the image repo to a different place so images for this repo won't be here any more.
To avoid confusion, please change this to:
image:

We can update this after this PR is merged and all necessary release-tools changes are merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this PR now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing this.

@fengzixu
Copy link
Contributor

fengzixu commented Jul 3, 2020

I forget one patch about the clusterrole of Event resource. I think controller needs it. So I fixed it.

NickrenREN#2

@NickrenREN @xing-yang

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 3, 2020
@xing-yang
Copy link
Contributor

@NickrenREN Can you squash the commits?

Looks good to me other than that. I'll merge after you've squashed them.

@xing-yang xing-yang changed the title Initial implementation Initial implementation to add external health monitor controller and agent Jul 4, 2020
@xing-yang
Copy link
Contributor

I forget one patch about the clusterrole of Event resource. I think controller needs it. So I fixed it.
NickrenREN#2

@fengzixu, If Nick merges your PR into his branch, it will be merged together with this PR. Otherwise you can submit a separate PR so it can be merged after this PR is merged.

NickrenREN and others added 4 commits July 6, 2020 11:12
1. remove the leader-election from argument list of agent
2. assign correct value to pvcListerSynced
@NickrenREN NickrenREN force-pushed the initial-implementation branch from 2961012 to 76fc3fd Compare July 6, 2020 03:15
@NickrenREN
Copy link
Contributor Author

@xing-yang done
@fengzixu Sorry, i forget to merge your second PR, could you send it to this repo directly after this one get merged ?

@fengzixu
Copy link
Contributor

fengzixu commented Jul 6, 2020

@xing-yang done
@fengzixu Sorry, i forget to merge your second PR, could you send it to this repo directly after this one get merged ?

Sure

@xing-yang
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 6, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: NickrenREN, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [NickrenREN,xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 826049d into kubernetes-csi:master Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement volume health monitoring agent Implement volume health monitoring controller
4 participants