Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Graphs & Configure Infrastructure #1023

Conversation

andrewballantyne
Copy link
Member

@andrewballantyne andrewballantyne commented Mar 17, 2023

Work towards: #1022

Description

Added the Runtime Metrics page & reworked the Inference Metrics page. Added a stacked line chart.

----Queries do not represent data right now----

Global - Inference Metrics
Screenshot 2023-03-17 at 2 18 53 PM
Screenshot 2023-03-17 at 2 19 46 PM

Project - Metrics kebab item added to Runtime:
Screenshot 2023-03-17 at 2 20 30 PM

Project - Runtime Metrics
Screenshot 2023-03-17 at 2 21 01 PM
Screenshot 2023-03-17 at 2 21 11 PM
Screenshot 2023-03-17 at 2 21 26 PM

Project - Inference Metrics
Screenshot 2023-03-17 at 2 34 02 PM
Screenshot 2023-03-17 at 2 34 12 PM

Scale options
image

Stacked Line Chart (random data):
image

Things that still need to be done (post this PR):

  • Improve x-axis to have more desired and static number of items (disable the auto x-axis from Victory)
  • Improve tooltips -- stacked charts should have one tooltip & it should be styled with the x-axis in the tooltip
  • Get queries from @VedantMahabaleshwarkar and add them to the utilities

How Has This Been Tested?

KFDef

apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  name: odh-with-modelmesh
  namespace: opendatahub
spec:
  applications:
    # Base -- NB Images, Manifest, NB Controller
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: odh-common
      name: odh-common
    - kustomizeConfig:
        overlays:
          - additional
        repoRef:
          name: manifests
          path: notebook-images
      name: notebook-images
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: odh-notebook-controller
      name: odh-notebook-controller
    # Model Mesh
    - kustomizeConfig:
        overlays:
          - odh-model-controller
        parameters:
          - name: monitoring-namespace
            value: opendatahub
        repoRef:
          name: manifests-modelmesh
          path: manifests/opendatahub
      name: model-mesh
    - kustomizeConfig:
        parameters:
          - name: deployment-namespace
            value: opendatahub
        repoRef:
          name: manifests
          path: modelmesh-monitoring
      name: modelmesh-monitoring
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: prometheus/cluster
      name: prometheus-cluster
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: prometheus/operator
      name: prometheus-operator
  repos:
    - name: manifests
      uri: https://github.com/opendatahub-io/odh-manifests/tarball/master
    - name: manifests-modelmesh
      uri: https://github.com/opendatahub-io/modelmesh-serving/tarball/main

Dashboard steps

  • Create a project
  • Create a runtime
  • Create a model (DM me if you need one, mino used to work, might be able to try that - slack link for setup steps)
  • Fire requests off to the model's route

Test Impact

At this time, no tests are being done. Will need to look at what we can really test here -- it's all dependant on graph values -- and testing the graphs themselves is more like testing Victory. Not sure we have a Storybook test. Might have unit tests, I'll look to see if any utilities could benefit from it.

Request review criteria:

  • The commits have meaningful messages (squashes happen on merge by the bot).
  • Included any necessary screenshots or gifs if it was a UI change.
  • Included tags to the UX team if it was a UI/UX change.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has added tests or explained why testing cannot be added (unit tests & storybook for related changes)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewballantyne. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andrewballantyne
Copy link
Member Author

@vconzola Don't focus on the graph data -- this PR is not for that. See the breadcrumbs, name, dropdown options for the time range... that kind of stuff. If those are good, we can consider the UX on-track.

Will look to add more effort behind the x-axis and making that stack line chart for a future PR.

@andrewballantyne andrewballantyne changed the title Fix metrics Fix Graphs & Configure Infrastructure Mar 17, 2023
@vconzola
Copy link

A couple small things...
(1) Are the number of requests really (x100)? In other words are the y-axis values the number of requests you're seeing divided by 100? Jeff thought these numbers could be really large and make the y-axis labels look funny so he suggested dividing the number by 100 and adding the "(x100)" to the chart title. But if it's easier to just show the actual number of requests and let Victory handle the y-axis labels, then let's do that.
(2) I see now my mockups aren't consistent (or correct) wrt to breadcrumbs and page titles. Sorry. I think what you've got for the model metric charts looks good, assuming "Sample" is the name of the model. But for the model server metrics the breadcrumb should be "Data science projects > Andrew's test > ovms metrics", where "ovms" is the model server type. For now it will always be "ovms", but later when we support additional server types, like Watson Core Serving, it'll be whatever the type is. (Eventually, we might end up naming servers and we'll make it the name instead of the type, but that won't be until we support KServe.) Similarly, page title should just be "ovms metrics", or whatever the server type is.

@andrewballantyne
Copy link
Member Author

A couple small things...
(1) Are the number of requests really (x100)? In other words are the y-axis values the number of requests you're seeing divided by 100? Jeff thought these numbers could be really large and make the y-axis labels look funny so he suggested dividing the number by 100 and adding the "(x100)" to the chart title. But if it's easier to just show the actual number of requests and let Victory handle the y-axis labels, then let's do that.

Not yet -- but they will be -- I have a TODO in the code to divide them. But without real data for these things, it's hard to test.

(2) I see now my mockups aren't consistent (or correct) wrt to breadcrumbs and page titles. Sorry. I think what you've got for the model metric charts looks good, assuming "Sample" is the name of the model. But for the model server metrics the breadcrumb should be "Data science projects > Andrew's test > ovms metrics", where "ovms" is the model server type. For now it will always be "ovms", but later when we support additional server types, like Watson Core Serving, it'll be whatever the type is. (Eventually, we might end up naming servers and we'll make it the name instead of the type, but that won't be until we support KServe.) Similarly, page title should just be "ovms metrics", or whatever the server type is.

Understood, I'll look to update.

Copy link
Contributor

@alexcreasy alexcreasy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been through the code, nothing really caught my eye as being obviously wrong - if that's worth anything 😆

It has been very useful for me to see what areas of the codebase you're changing for metrics / what components you're using however.

So far I haven't been able to get your KFDef working in OSD, I'll have another crack at that tomorrow

Copy link
Member Author

@andrewballantyne andrewballantyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed a couple refactor typos to correct 😓 Whoops.

timeframe: TimeframeTitle,
lastUpdateTime: number,
setLastUpdateTime: (time: number) => void,
): {
data: Record<ModelServingMetricType, ContextResourceData<PrometheusQueryRangeResultValue>>;
data: Record<RuntimeMetricType, ContextResourceData<PrometheusQueryRangeResultValue>>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo -- needs to have both types; sad TypeScript didn't catch this during my refactoring.

import RuntimeGraphs from '~/pages/modelServing/screens/metrics/RuntimeGraphs';
import { MetricType } from '~/pages/modelServing/screens/types';

const ProjectInferenceMetricsWrapper: React.FC = () => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

return (
<ModelServingMetricsProvider queries={queries} type={MetricType.RUNTIME}>
<MetricsPage
title={`ovm metrics`}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fixed label but we are currently supporting custom runtimes, it might be weird having a different runtime and displaying ovms. And I think is ovms (OpenVINO Model Server).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it's fixed, this will need to be adjusted when we get the runtime proper changes...

cc @vconzola please address the comment about the name -- I was given this "ovm" concept from you. I can definitely expand it -- I just didn't know what OVM was 😛

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewballantyne As I mentioned in my comment above, "ovms" is the model server type - OpenVINO Model Server, which comes from the Type column of the model server table. Currently that's all we support. But as we add support for Watson Core Serving, and other runtimes this value will change.

Copy link
Member Author

@andrewballantyne andrewballantyne Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for the model server metrics the breadcrumb should be "Data science projects > Andrew's test > ovms metrics", where "ovms" is the model server type

@vconzola you explicitly said ovm... do we want to expand it to the fully name? That's fine, just need clarification. Do we say OpenVINO Model Serving anywhere? 🤔

And yes, I am aware it is fixed text... but unless we have all of this text floating around in objects (I don't think we do) it's hardcoded until we support custom runtimes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ovms" is what shows up in the model server table Type column. I'm not sure where that text comes from because the user currently doesn't select a runtime. It must come the backend someplace. I want what's in the chart title to match what's in the table so the user can make a 1-1 connection. Just FYI, once we support multiple servers (which should be next sprint, I think) the "Type" is going to be replace by a "Name", so all this will change. I'll show what I mean in the UX meeting tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ServingRuntime object has a name attribute, the default one has ovms for example. I think we can dynamically fetch that name since it can change depending on the custom runtime installed.

link: `/projects/${currentProject.metadata.name}`,
},
{
label: `ovm metrics`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this label.

@andrewballantyne
Copy link
Member Author

Will adjust the PR tomorrow -- might even get some queries as well 🎉

@andrewballantyne
Copy link
Member Author

Going to merge this into the feature branch -- we can swing back to the comments pending and the right queries. I've made notes on the ticket so the conversations are not lost.

Adding labels manually so @alexcreasy can work off this refactor work.

@andrewballantyne
Copy link
Member Author

Going to merge this into the feature branch -- we can swing back to the comments pending and the right queries. I've made notes on the ticket so the conversations are not lost.

Adding labels manually so @alexcreasy can work off this refactor work.

1 similar comment
@andrewballantyne
Copy link
Member Author

Going to merge this into the feature branch -- we can swing back to the comments pending and the right queries. I've made notes on the ticket so the conversations are not lost.

Adding labels manually so @alexcreasy can work off this refactor work.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d9e950c into opendatahub-io:f/mserving-metrics Mar 27, 2023
alexcreasy pushed a commit to alexcreasy/odh-dashboard that referenced this pull request Mar 29, 2023
* Rework Inference Metrics & Add Runtime Metrics (invalid quieries)

* Add stacked line chart functionality
openshift-merge-robot pushed a commit that referenced this pull request Jun 30, 2023
* Re-enable Metrics

* Fix Graphs & Configure Infrastructure (#1023)

* Rework Inference Metrics & Add Runtime Metrics (invalid quieries)

* Add stacked line chart functionality

* Trustyai demo phase0 (#1093)

* Explainability: Fairness and Bias Metrics (Phase 0) (#1001) (#1006) (#1007) (#1008)
  - Initial feature set for TrustyAI related UI functionality
  - Adds tab based navigation to modelServing screen
  - Adds a bias metrics tab with charts for visualising SPD and DIR metrics
  - Enhances prometheus query features for accessing TrustyAI data
  - Enhacements to MetricsChart component making it more configurable

* Update key of request name to match trusty backend

* Remove unnecessary div and inline style from tooltip

* Remove 15 minutes refresh option

* Prefer optional prop to type union with undefined

* Move function definitions inline

* Prefer narrowing over type conversion

* Inline tab change handler

* Remove toolbar option from ApplicationsPage

* Inline domain calculator functions

* Move defaultDomainCalculator to utils

* Return null instead of undefined

* Use threshold label instead of index for key

* Add enum for tab keys

* Remove magic numbers from domain calculations

* Make ResponsePredicate mandatory and add predicate to useQueryRangeResourceData

* TrustyAI Client (#1318)

* Add support for insecure http requests in development mode

* Adds low level API client for TrustyAI service

* Adds TrustyAI high level API and contexts

* Get scheme of TrustyAI route from k8s data

* Add model bias configuration table (#1290)

* Add model bias configuration table

* rebase and remove mock data

* Update Trusty AI client to handle API changes (#1336) (#1337)

* Add bias metrics configuration modal (#1343)

* Add configuration modal

* address comments

* get rid of some TODOs and refine the route

* Multi-metric display on model bias screen (#1273) (#1349)

* Enhancements to model bias screen

  * Display of multiple bias charts simultaneously

  * Multi-select component, allowing free text, or select-from-list selection of chartst to display

  * Ability to collapse / expand individual charts

  * User selectable refresh rates of chart data

  * Chart selection and open / closed status is persisted to session cache for life of user's browser session

Display user defined threshold values on charts (#1163)

  * Clean up of bias chart logic

  * Displays thresholds chosen by user, or defaults if none.

  * Improves domain and threshold calculation based on user values or defaults

* Fix metrics submission issue and handle errors (#1378)

* Fix metrics submission issue and handle errors

* fix lint issue

* use error handler on GET functions

* Default and restrict threshold, add tooltips, default duplicate name and set feature flag (#1390)

* Default and restrict threshold, add tooltips, default duplicate name and set feature flag

* fix lint

* add tooltips and dropdown descriptions

* clear data when closing configuration modal

* really solve deleting issue, make empty table view a common component and apply it everywhere

* address comments

* Minor enhancements to bias chart (#1386) (#1399)

* Adds refresh interval options that match openshift observability dashboard

* Show first chart from list, if none selected when user first navigates to bias tab

* Use search icon instead of plus for nothing selected empty state

* Fix error with calculation of 30 days constant

* Deleted charts are removed from session storage

* Fixes issue with bias charts auto-refreshing with stale data (#1403) (#1404)

* Refactor prometheus queries to remove duplication

* Fix graph not refreshing issue

* Add code review suggestions

* Add performance metrics feature flag and refactor runtime server route (#1413)

* Add performance metrics feature flag, refactor runtime server route and solve layout issues

* revert some style changes

* Model serving metrics renaming (#1421)

* Adds support for TrustyAI Operator (#1443)

* Adds support for TrustyAI Operator (#1276)

* Changes from feedback

---------

Co-authored-by: Andrew Ballantyne <[email protected]>
Co-authored-by: Andrew Ballantyne <[email protected]>
Co-authored-by: Alex Creasy <[email protected]>
Co-authored-by: Alex Creasy <[email protected]>
@andrewballantyne andrewballantyne deleted the fix-metrics branch June 26, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Readdress our needs for Model Serving Metrics
5 participants