-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236
Comments
Can you try to use other metrics available on your vGPU? |
|
Can I submit a PR to fix it? |
@Levi080513 , sure you can submit PRs; we appreciate community contribution. |
frittentheke
added a commit
to frittentheke/dcgm-exporter
that referenced
this issue
Jul 8, 2024
…name) * Change PromQL queries to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>
frittentheke
added a commit
to frittentheke/dcgm-exporter
that referenced
this issue
Jul 8, 2024
…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>
frittentheke
added a commit
to frittentheke/dcgm-exporter
that referenced
this issue
Jul 8, 2024
…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently we use
DCGM_FI_DEV_GPU_TEMP
to obtain the instance/GPU list, but this metrics is not collected in vGPU clusters. This will prevent the dashboard from displaying properly.dcgm-exporter/grafana/dcgm-exporter-dashboard.json
Line 784 in 30d4ddc
dcgm-exporter/grafana/dcgm-exporter-dashboard.json
Line 761 in 30d4ddc
The text was updated successfully, but these errors were encountered: