You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have EKS cluster to run some heavy GPU tasks and want to integrate monitoring with Datadog. I am stuck in deploying the DCGM exporter in my prod environment (multiple p4d.24xlarge) but worked in my dev environment (using a p3.2xlarge for cheaping a little) with the same AMI AL2_X86_64_GPU - amazon-eks-gpu-node-1.29-v20240729.
The error I am getting is:
level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc0005831e0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc000321360)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc00002e5a0}, 0xc00044db70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0002a0380)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0000274a0?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc0000274a0, 0xc0002a0380, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc000057400, {0x1cf3300?, 0x2a0c420}, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc00044df20?, {0xc000040150?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"
What is the version?
3.4.2.
What happened?
I have EKS cluster to run some heavy GPU tasks and want to integrate monitoring with Datadog. I am stuck in deploying the DCGM exporter in my prod environment (multiple p4d.24xlarge) but worked in my dev environment (using a p3.2xlarge for cheaping a little) with the same AMI AL2_X86_64_GPU - amazon-eks-gpu-node-1.29-v20240729.
The error I am getting is:
level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc0005831e0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc000321360)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc00002e5a0}, 0xc00044db70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0002a0380)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0000274a0?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc0000274a0, 0xc0002a0380, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc000057400, {0x1cf3300?, 0x2a0c420}, {0xc000040150, 0x3, 0x3})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc00044df20?, {0xc000040150?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"
The installation is done via Helm as per this document https://docs.datadoghq.com/integrations/dcgm/?tab=kubernetes.
Using VERSION 3.4.2 rather than latest because it triggers an error #318
Variables DCGM_FI_DEV_COUNT, DCGM_FI_PROCESS_NAME, & DCGM_FI_CUDA_DRIVER_VERSION were commented to not report as triggers an error #318
What did you expect to happen?
Agent running properly
What is the GPU model?
p4d.24xlarge
What is the environment?
AWS EKS
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: