Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataSource label to pvcs created #128

Merged
merged 1 commit into from
Jan 2, 2023

Conversation

RaunakShah
Copy link
Contributor

@RaunakShah RaunakShah commented Sep 30, 2022

Follow up to kubernetes-csi/external-provisioner#792

This PR adds a label to the existing metrics to determine the data source of the PVC being monitored

Testing

  1. Deploy host path driver with external-provisioner using these changes. Deploy an nginx pod to query metrics:
% kubectl get pod -o wide       
NAME                   READY   STATUS    RESTARTS   AGE     IP            NODE              NOMINATED NODE   READINESS GATES
csi-hostpath-socat-0   1/1     Running   0          2d23h   10.244.1.6    csi-prow-worker   <none>           <none>
csi-hostpathplugin-0   8/8     Running   0          7m25s   10.244.1.31   csi-prow-worker   <none>           <none>
nginx                  1/1     Running   0          2d1h    10.244.1.18   csi-prow-worker   <none>           <none>
  1. Create a PVC and VolumeSnapshot of the PVC.
% kubectl get pvc
NAMESPACE   NAME            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
default     csi-pvc         Bound     pvc-7351461a-8bf4-4042-9a2e-228cf5d06624   1Gi        RWO            csi-hostpath-sc   7m21s
% kubectl get volumesnapshot
NAME                READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
new-snapshot-demo   true         csi-pvc                             1Gi           csi-hostpath-snapclass   snapcontent-9e8b8619-eafd-42c7-93ed-0dcf09443a4d   7m37s          7m37s
  1. Create a PVC from above VolumeSnapshot that will succeed.
% kubectl get pvc -A        
NAMESPACE   NAME            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
default     csi-pvc         Bound     pvc-7351461a-8bf4-4042-9a2e-228cf5d06624   1Gi        RWO            csi-hostpath-sc   8m14s
default     hpvc-restore    Bound     pvc-a52a8b54-6aed-4b5c-b9a3-24ec0d848cdd   1Gi        RWO            csi-hostpath-sc   7m54s
  1. Create a PVC from above VolumeSnapshot that will fail.
% kubectl get pvc -A        
NAMESPACE   NAME            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
default     csi-pvc         Bound     pvc-7351461a-8bf4-4042-9a2e-228cf5d06624   1Gi        RWO            csi-hostpath-sc   8m14s
default     hpvc-restor-e   Pending                                                                        csi-hostpath-sc   6m50s
default     hpvc-restore    Bound     pvc-a52a8b54-6aed-4b5c-b9a3-24ec0d848cdd   1Gi        RWO            csi-hostpath-sc   7m54s
  1. Exec into nginx pod and query metrics
% kubectl exec -it nginx -- bash
root@nginx:/# curl 10.244.1.31:8080/metrics
# HELP controller_persistentvolumeclaim_provision_duration_seconds Latency in seconds to provision persistent volumes. Failed provisioning attempts are ignored. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_duration_seconds histogram
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.005"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.01"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.025"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.05"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.25"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="2.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="10"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="+Inf"} 1
controller_persistentvolumeclaim_provision_duration_seconds_sum{class="csi-hostpath-sc",source=""} 0.033238542
controller_persistentvolumeclaim_provision_duration_seconds_count{class="csi-hostpath-sc",source=""} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.005"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.01"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.025"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.05"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.25"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="2.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="10"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="+Inf"} 1
controller_persistentvolumeclaim_provision_duration_seconds_sum{class="csi-hostpath-sc",source="VolumeSnapshot"} 0.022875916
controller_persistentvolumeclaim_provision_duration_seconds_count{class="csi-hostpath-sc",source="VolumeSnapshot"} 1
# HELP controller_persistentvolumeclaim_provision_failed_total Total number of persistent volume provision failed attempts. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_failed_total counter
controller_persistentvolumeclaim_provision_failed_total{class="csi-hostpath-sc",source="VolumeSnapshot"} 4
# HELP controller_persistentvolumeclaim_provision_total Total number of persistent volumes provisioned succesfully. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_total counter
controller_persistentvolumeclaim_provision_total{class="csi-hostpath-sc",source=""} 1
controller_persistentvolumeclaim_provision_total{class="csi-hostpath-sc",source="VolumeSnapshot"} 1
...
Added a new label `source` to `controller_persistentvolumeclaim_provision_total` metric.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2022
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 30, 2022
@jsafrane
Copy link
Contributor

/assign @gnufied
Is it OK to add a new label to an existing metric?

@RaunakShah
Copy link
Contributor Author

Is it OK to add a new label to an existing metric?

I was unsure too but until kubernetes-csi/external-provisioner#792 it seems like there was no user of the metrics exposed in this package

@xing-yang
Copy link

Let's check with sig-instrumentation whether this will break existing metric.

@jsafrane
Copy link
Contributor

jsafrane commented Dec 1, 2022

a Kubernetes blog says that:

Stable metrics can be guaranteed to not change, [...] no labels can be added or removed from this metric

It uses registration from github.com/prometheus, which does not allow setting of stability level, so I think it's a stable metric:

PersistentVolumeClaimProvisionFailedTotal: prometheus.NewCounterVec(
prometheus.CounterOpts{
Subsystem: subsystem,
Name: "persistentvolumeclaim_provision_failed_total",
Help: "Total number of persistent volume provision failed attempts. Broken down by storage class name.",
},
[]string{"class"},

@RaunakShah
Copy link
Contributor Author

RaunakShah commented Dec 5, 2022

@jsafrane from the blog -

When a StabilityLevel is not explicitly set, metrics default to “Alpha” stability

Having said that, there's also a note on deprecating metrics in case a label needs to be added:

In order to add or remove a label from an existing stable metric, one would have to introduce a new metric and deprecate the stable one

Do we want to consider deprecating the existing metrics and introducing new ones? The bonus is that we can potentially move to the kubeMetrics package which allows us to set the StabilityLevel and follow the Kubernetes metrics stability framework.

@xing-yang
Copy link

Hi @logicalhan Can you please comment on this? Are these PVC/PV related metrics alpha or stable?

@jsafrane
Copy link
Contributor

jsafrane commented Dec 6, 2022

Yes, deprecating the old metric + introducing a new one is IMO the right way forward.

@dgrisonnet
Copy link
Member

The stability guarantees that we have for Kubernetes only apply to k/k since the static analysis tools that make sure metrics don't change are only run there.

You could reuse that same framework here, but that would mean updating all your metrics to be created by component-base/metrics package and run the stability guarantee checks.

Going back to this PR, since you are not reusing the stability framework here, I would expect all the metrics to be considered Alpha so without any guarantees. Plus adding a label to a metric isn't a breaking change, so as long as you are mindful of the additional cardinality that it brings, it should be pretty safe to do and shouldn't require to go through a deprecation cycle.

@RaunakShah
Copy link
Contributor Author

@dgrisonnet thank you for the info!
@jsafrane are we okay moving ahead with this PR then? We can also move to the components-base/metrics package here to introduce the stability guarantees

@jsafrane
Copy link
Contributor

jsafrane commented Jan 2, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 2, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, RaunakShah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 2, 2023
@k8s-ci-robot k8s-ci-robot merged commit ae1e8fb into kubernetes-sigs:master Jan 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants