Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Kubernetes: Reliable way to monitor capacity and free bytes of persistent volumes #166

Closed
BernhardGruen opened this issue Nov 11, 2021 · 8 comments

Comments

@BernhardGruen
Copy link

Currently there is no monitoring support for persistent volume claims (inside a managed Kubernetes cluster).
On most clusters this is done using an extension to the CSI driver that exports those metrics to the kubelet.
In Prometheus these metrics are then available as:

  • kubelet_volume_stats_available_bytes
  • kubelet_volume_stats_capacity_bytes
  • kubelet_volume_stats_used_bytes
  • kubelet_volume_stats_inodes
  • kubelet_volume_stats_inodes_free
  • kubelet_volume_stats_inodes_used

Without those metrics it is not possible to know and alert in advance if a persistent volume is near full and this could lead to severe outages of the hosted services.

Unfortunately I did not find a reliable workaround either. The one workaround that half way works but needs manual interaction often is to monitor using the node-exporter (node_filesystem_free_bytes). Unfortunately with this variant one has to restart all node-exporter every time a new StatefulSet is created or moves from one node to any other node. This just is not feasible and therefore it is currently not safe to host services with persistent volumes on OVH managed Kubernetes clusters.

@nsteinmetz
Copy link

nsteinmetz commented Nov 11, 2021

The last time I checked (several months ago, maybe almost a year ago), the openstack cinder csi driver did not report metrics yet - did it change since ?

Cf:

A few other issues related to metrics and cinder-csi seems fixed however - so hard to have a real status on this.

@BernhardGruen
Copy link
Author

I fear that not much has changed since.
Still I think there must be a way to get those metrics even with cinder-csi as I have access to a cluster (on a smaller local provider) that also uses cinder-csi and there I get those metrics.
I currently assume that they are provided by a DaemonSet called csi-cinder-nodeplugin (using the images docker.io/k8scloudprovider/cinder-csi-plugin:v1.19.0 and quay.io/k8scsi/csi-node-driver-registrar:v1.2.0).

@BernhardGruen
Copy link
Author

This is an output of cinder-csi-plugin in verbose mode:

I1111 09:36:37.962971      92 server.go:108] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I1111 09:37:03.691276      92 utils.go:80] GRPC call: /csi.v1.Node/NodeGetCapabilities
I1111 09:37:03.691325      92 utils.go:81] GRPC request: 
I1111 09:37:03.691891      92 nodeserver.go:454] NodeGetCapabilities called with req: &csi.NodeGetCapabilitiesRequest{XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
I1111 09:37:03.691946      92 utils.go:86] GRPC response: capabilities:<rpc:<type:STAGE_UNSTAGE_VOLUME > > capabilities:<rpc:<type:EXPAND_VOLUME > > capabilities:<rpc:<type:GET_VOLUME_STATS > > 
I1111 09:37:03.695056      92 utils.go:80] GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1111 09:37:03.695087      92 utils.go:81] GRPC request: volume_id:"e1c08eb0-3094-43c0-94c3-76ba1f70a0d5" volume_path:"/var/lib/kubelet/pods/31ba4a06-84d9-4c86-a478-b0459744d08c/volumes/kubernetes.io~csi/pvc-3e13f04f-5b2e-476f-af2f-a18eb26694f2/mount" 
I1111 09:37:03.695410      92 nodeserver.go:462] NodeGetVolumeStats: called with args {VolumeId:e1c08eb0-3094-43c0-94c3-76ba1f70a0d5 VolumePath:/var/lib/kubelet/pods/31ba4a06-84d9-4c86-a478-b0459744d08c/volumes/kubernetes.io~csi/pvc-3e13f04f-5b2e-476f-af2f-a18eb26694f2/mount StagingTargetPath: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1111 09:37:03.695481      92 utils.go:86] GRPC response: usage:<available:7997464576 total:8388009984 used:373768192 unit:BYTES > usage:<available:514666 total:524288 used:9622 unit:INODES > 

The last line clearly shows that the volume metrics were provided by this pod. So in principle that should work for OVH managed Kubernetes clusters too.

@mhurtrel
Copy link
Collaborator

I got confirmation from the team that this will be solved with the CSI update planned within a month ! @BernhardGruen @nsteinmetz

@BernhardGruen
Copy link
Author

Perfect - thank you for keeping us updated.

@mhurtrel
Copy link
Collaborator

mhurtrel commented May 2, 2022

Hello to everyone following this issue ! The CSI update that will enable this metrics avilable is to be prodded within the next 10 days. Note that this will enable the feature for all Openstack regions running Stein. This is the case for most regions, and will be the case for all regions within summer.

@mhurtrel
Copy link
Collaborator

We just upgraded the cinder CSI.
This upgrade will benefit all customers on supported Kubernetes versions (1.20+) . All new clusters will natively have it, and for existing clusters, customers will need to run the patch upgrade (for exemple through the notification that appear on his cluster page).
Here is a detail of all improvments and new use cases :

Hot snapshot ; capacity to snapshot a volume in use : This is key as it enable the use of all K8s compliant Backup and DRP tools such as Trillio and Kasten . This means that Kubernetes can call the snapshot feature from Cinder, while a block is being used. NOTE THAT THIS REQUIRES TO BE ON A STEIN REGION . On other regions, only cold snapshots are supported. : #77
Hot resize : this enables the capacity to grow a volume/block, while being in use NOTE THAT THIS REQUIRES TO BE ON A STEIN REGION. On other regions, only cold resisizes are supported.
Volumes metrics will now be accessible from APIserver : #166 so that customers can anticipate when he will miss space on a volume for example.
This also fixes the "SecurityContextFs" bug that blocked many Helm charts to be deployed without a trick : https://docs.ovh.com/sg/en/kubernetes/persistentvolumes-permission-errors/

Documentation will be updated in the upcoming weeks to reflect those changes.

Here is the list of regions (where the Managed Kubernetes product is present) on OpenStack Stein:

GRA5 , GRA9, SBG5
BHS5
WAW1
DE1
UK1 (region will be made available in the upciming days)

Regions still on OpenStack Newton (upgrade will be finalized this summer, more info will be published here : https://public-cloud.status-ovhcloud.com/ and by email to customers with these active regions)

GRA7 (Stein upgrade planned on 2022-05-31)
SYD1 (APAC)
SGP1 (APAC)
VA1 (US)
OR1 (US)

@brennerm
Copy link

@mhurtrel Just confirmed that the volume metrics are now being collected, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants