Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname) #355

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

frittentheke
Copy link

@frittentheke frittentheke commented Jul 8, 2024

Running into various issues with the dashboard (see #353) I started reworking the existing board.
This PR combines all my cleanups and fixes. It also includes the changes of PR #240 by @Levi080513

Fixes: #353, #236

…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values
1 participant