Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

Open
Levi080513 opened this issue Jan 22, 2024 · 4 comments · May be fixed by #240
Open

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

Levi080513 opened this issue Jan 22, 2024 · 4 comments · May be fixed by #240

Comments

@Levi080513
Copy link

Currently we use DCGM_FI_DEV_GPU_TEMP to obtain the instance/GPU list, but this metrics is not collected in vGPU clusters. This will prevent the dashboard from displaying properly.

"query": "label_values(DCGM_FI_DEV_GPU_TEMP, gpu)",

"query": "label_values(DCGM_FI_DEV_GPU_TEMP, instance)",

@nvvfedorov
Copy link
Collaborator

Can you try to use other metrics available on your vGPU?

@Levi080513
Copy link
Author

DCGM_FI_DEV_GPU_UTIL metrics is work well.

@Levi080513
Copy link
Author

Can I submit a PR to fix it?

@nvvfedorov
Copy link
Collaborator

@Levi080513 , sure you can submit PRs; we appreciate community contribution.

frittentheke added a commit to frittentheke/dcgm-exporter that referenced this issue Jul 8, 2024
…name)

* Change PromQL queries to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <[email protected]>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this issue Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <[email protected]>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this issue Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants