NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

Levi080513 · 2024-01-22T05:53:52Z

Currently we use DCGM_FI_DEV_GPU_TEMP to obtain the instance/GPU list, but this metrics is not collected in vGPU clusters. This will prevent the dashboard from displaying properly.

dcgm-exporter/grafana/dcgm-exporter-dashboard.json

Line 784 in 30d4ddc

"query": "label_values(DCGM_FI_DEV_GPU_TEMP, gpu)",

dcgm-exporter/grafana/dcgm-exporter-dashboard.json

Line 761 in 30d4ddc

"query": "label_values(DCGM_FI_DEV_GPU_TEMP, instance)",

The text was updated successfully, but these errors were encountered:

nvvfedorov · 2024-01-25T19:58:53Z

Can you try to use other metrics available on your vGPU?

Levi080513 · 2024-01-26T02:39:59Z

DCGM_FI_DEV_GPU_UTIL metrics is work well.

Levi080513 · 2024-01-26T02:49:09Z

Can I submit a PR to fix it?

nvvfedorov · 2024-01-26T14:46:49Z

@Levi080513 , sure you can submit PRs; we appreciate community contribution.

…name) * Change PromQL queries to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>

Levi080513 linked a pull request Jan 29, 2024 that will close this issue

Fix grafana dashboard cannot display properly in vGPU cluster #240

Open

frittentheke mentioned this issue Jul 8, 2024

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname) #355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

Levi080513 commented Jan 22, 2024

nvvfedorov commented Jan 25, 2024

Levi080513 commented Jan 26, 2024

Levi080513 commented Jan 26, 2024

nvvfedorov commented Jan 26, 2024

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

Comments

Levi080513 commented Jan 22, 2024

nvvfedorov commented Jan 25, 2024

Levi080513 commented Jan 26, 2024

Levi080513 commented Jan 26, 2024

nvvfedorov commented Jan 26, 2024