Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STOR-2040: oc adm top persistentvolumeclaim #1704

Merged
merged 1 commit into from
Nov 19, 2024

Conversation

tsmetana
Copy link
Member

This is an enhancement for the new oc CLI option to display the PVC percentual usage.

The feature is being implemented in openshift/oc#1854

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 23, 2024

@tsmetana: This pull request references STOR-2040 which is a valid jira issue.

In response to this:

This is an enhancement for the new oc CLI option to display the PVC percentual usage.

The feature is being implemented in openshift/oc#1854

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 23, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2024
@openshift-ci openshift-ci bot requested review from cybertron and tjungblu October 23, 2024 11:31
@tsmetana
Copy link
Member Author

@gmeghnag, @dobsonj, @ardaguclu PTAL

Copy link
Member

@ardaguclu ardaguclu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command sounds very useful to me and agree that it is a natural extension. However, getting data directly from metrics server by discarding API server has natural implications. So I dropped some questions.


### Implementation Details/Notes/Constraints

There are the `kubelet_volume_stats_used_bytes` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect-alerts command followed the same way by fetching the stats from metrics endpoint (in my opinion discarding the API server is not a good idea but I understand the motivation). However, inspect-alerts appeals to less people comparing to this command. Users (without checking metrics is enabled or not) will trigger this command and if metrics server is not enabled, what is the expected behavior?. I think we need to catch this earlier and should print an explanatory warning about how this command can be used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a case where Prometheus pods are both down (which means that the ClusterOperator/monitoring is Degraded) the command would fail with the following error message:

$ oc adm top persistentvolumeclaims 
error: failed to get persistentvolumeclaims from Prometheus: unable to get /api/v1/query from URI in the openshift-monitoring/prometheus-k8s Route: prometheus-k8s-openshift-monitoring.apps.sharedocp415.lab.local->GET status code=503

In the same case the oc adm top pod and oc adm top node would fail as well:

$ oc adm top pod
error: Metrics not available for pod openshift-monitoring/alertmanager-main-0, age: 81h52m43.855409s
$ oc adm top nodes
error: metrics not available yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should raise a better error message like metrics not available yet for this case.

FWIW;
If I was the driver of this KEP, I'd first get the feedback from upstream about adding this first in there by extending https://github.com/kubernetes/kubernetes/blob/18f3941c24b9ed40eecd384f78778d2ff185c31d/staging/src/k8s.io/metrics/pkg/apis/metrics/v1beta1/types.go#L50 . In my opinion, this is a useful feature and I believe upstream would consider this. Also if this is accepted by upstream, we'll have a natural client instead of sending a direct request to metrics server.

Copy link

@gmeghnag gmeghnag Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "I think, we should raise a better error message like metrics not available yet for this case."
    Ok, I can work on this.

### Implementation Details/Notes/Constraints

There are the `kubelet_volume_stats_used_bytes` and
`kubelet_volume_stats_capacity_bytes` Prometheus metrics which can be used to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that kubelet_volume_stats_capacity_bytes is a standard even in vanilla Kubernetes and we are certain that it will be there if Prometheus is enabled?.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, kubelet_volume_stats_capacity_bytes is a metric that Kubernetes exports by default, even in a vanilla cluster.


#### Story 1

As an OCP project user, I want to see a list of all PVC's in my namespace and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics fetching request will be sent to Prometheus with a bearerToken. What is the expected permission set for this bearerToken that can successfully fetch the required metrics in here?. Can a standard user use this command smoothly? or limited only to cluster admins?. If it is limited to cluster admins, wouldn't be less useful?, since we are providing a command to the users but only few of them can use it actually?.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "What is the expected permission set for this bearerToken that can successfully fetch the required metrics in here?"
    At least the ClusterRole/cluster-monitoring-view and the get/list verbs on Route objects in openshift-monitoring namespace are needed to execute the query.
  • "Can a standard user use this command smoothly?"
    A user without the above permissions can't use this command.
  • "or limited only to cluster admins?"
    Being created under oc adm where most of the commands used for administrative tasks on the cluster reside, this command is firstly intended for cluster admins.
  • "If it is limited to cluster admins, wouldn't be less useful?, since we are providing a command to the users but only few of them can use it actually?"
    I don't think so:
    • Being the cluster admins the people in charge of expanding volumes this command benefits them.
    • If needed, in a few steps, the following, anyone can execute such a command:
      $ oc create clusterrole routes-view --verb=get,list --resource=routes -n openshift-monitoring
      $ oc adm policy add-cluster-role-to-user routes-view <USER>
      $ oc adm policy add-cluster-role-to-user cluster-monitoring-view <USER>
      


1. Provide a simple CLI option to display filesystem usage of
PersistentVolumeClaims.
2. Display only the percentual usage for a given PersistentVolumeClaim or all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user invokes this command with --all-namespaces, can we get all the required data in 1 prometheus query? or are we going to send multiple queries per each namespace?. If user has 100 namespaces, sending 100 queries to metrics server would be impractical.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "If user invokes this command with --all-namespaces, can we get all the required data in 1 prometheus query?"
    Yes, and this is how it works when you execute the command with --all-namespaces, a single query will be executed:

    100*kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*"}/kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*"}
    
  • "or are we going to send multiple queries per each namespace?"
    No, this way would overload the metric server, and as you said would be impractical.

value. This could be implemented by a single call to Prometheus API using PromQL
and would not need additional API calls.

The additional columns (e.g. volume capacity, absolute value of the free space)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw in the PR in oc about the queries by getting Route, Ingress objects to construct metrics server endpoint. Are we certain that these objects are always created if metrics server is enabled?. Is there any possibility that metrics server is enabled but route and ingress are not defined by cluster admin. So user can easily think that metrics server is enabled but command still does not work?. If we are not certain, users can only use iff metrics server is enabled, route and ingress are created and required role permissions are correctly given.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally are we certain that this implementation method would also work on disconnected?.

Copy link

@gmeghnag gmeghnag Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "Are we certain that these objects are always created if metrics server is enabled?"
    Yes, the Route/prometheus-k8s is created by default during cluster installation.

  • "Is there any possibility that metrics server is enabled but route and ingress are not defined by cluster admin."
    If the admin for any reason deletes the Route/prometheus-k8s object the command will fail with the following error, which is self-explanatory:

    $ oc adm top persistentvolumeclaims -A
    error: failed to get persistentvolumeclaims from Prometheus: routes.route.openshift.io "prometheus-k8s" not found
    
  • "are we certain that this implementation method would also work on disconnected?"
    Yes, I've tested it.

@ardaguclu
Copy link
Member

Thank you for answering my questions. I still strongly recommend to ask feedback in upstream adding this feature in there first. But this is not a blocker for me.
LGTM
but markdownlint failure seems to be real.

@tsmetana
Copy link
Member Author

Thanks for the review. I'll fix the linter issues.

@tsmetana tsmetana force-pushed the oc-adm-top-pvc branch 2 times, most recently from 24ee6e1 to 176bef1 Compare October 29, 2024 13:57
@tsmetana
Copy link
Member Author

The linter looks happy now. I have also tried to incorporate the comments from the review itself into the document.

@tsmetana tsmetana changed the title [WIP] STOR-2040: oc adm top persistentvolumeclaim STOR-2040: oc adm top persistentvolumeclaim Oct 29, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 29, 2024
@dobsonj
Copy link
Member

dobsonj commented Oct 29, 2024

/lgtm
Thanks for putting this together and adding some details from the questions in this PR.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 29, 2024
@tsmetana
Copy link
Member Author

/assign trozet

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2024
@dobsonj
Copy link
Member

dobsonj commented Nov 18, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2024
@ardaguclu
Copy link
Member

/lgtm

@ardaguclu
Copy link
Member

I don't think I have privileges to approve this PR but since I'm in the approver list;
/approve

@jsafrane
Copy link
Contributor

/approve

Copy link
Contributor

openshift-ci bot commented Nov 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ardaguclu, gmeghnag, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2024
Copy link
Contributor

openshift-ci bot commented Nov 19, 2024

@tsmetana: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 8d48fb5 into openshift:master Nov 19, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants