DCGM Exporter in EKS p4d.24xlarge instance type controller error #387

camilopaezrios · 2024-09-05T19:41:35Z

What is the version?

3.4.2.

What happened?

I have EKS cluster to run some heavy GPU tasks and want to integrate monitoring with Datadog. I am stuck in deploying the DCGM exporter in my prod environment (multiple p4d.24xlarge) but worked in my dev environment (using a p3.2xlarge for cheaping a little) with the same AMI AL2_X86_64_GPU - amazon-eks-gpu-node-1.29-v20240729.
The error I am getting is:
level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc0005831e0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc000321360)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc00002e5a0}, 0xc00044db70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0002a0380)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0000274a0?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc0000274a0, 0xc0002a0380, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc000057400, {0x1cf3300?, 0x2a0c420}, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc00044df20?, {0xc000040150?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"

The installation is done via Helm as per this document https://docs.datadoghq.com/integrations/dcgm/?tab=kubernetes.
Using VERSION 3.4.2 rather than latest because it triggers an error #318

Variables DCGM_FI_DEV_COUNT, DCGM_FI_PROCESS_NAME, & DCGM_FI_CUDA_DRIVER_VERSION were commented to not report as triggers an error #318

What did you expect to happen?

Agent running properly

What is the GPU model?

p4d.24xlarge

What is the environment?

AWS EKS

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

camilopaezrios added the bug Something isn't working label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM Exporter in EKS p4d.24xlarge instance type controller error #387

DCGM Exporter in EKS p4d.24xlarge instance type controller error #387

camilopaezrios commented Sep 5, 2024 •

edited

Loading

DCGM Exporter in EKS p4d.24xlarge instance type controller error #387

DCGM Exporter in EKS p4d.24xlarge instance type controller error #387

Comments

camilopaezrios commented Sep 5, 2024 • edited Loading

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

camilopaezrios commented Sep 5, 2024 •

edited

Loading