Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an initial set of uniform DCGM GPU metrics in dcgmreceiver. #219

Merged
merged 39 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
8bac1d2
Better assertions in tests.
igorpeshansky Jun 17, 2024
00976dd
Remove support for K80 (EOL).
igorpeshansky Jun 19, 2024
afd9512
Fix supported field filtering to only consider profiling fields.
igorpeshansky Jun 20, 2024
e78b829
Enabled option reorder.
igorpeshansky Jun 21, 2024
c33595c
Turn the GPU device metric attributes into resource attributes.
igorpeshansky May 24, 2024
89c3919
Get rid of dcgmNameToMetricName and use DCGM field names directly.
igorpeshansky May 29, 2024
70d1caf
More precise value errors.
igorpeshansky Jun 20, 2024
6c3748c
Add most metrics from the doc.
igorpeshansky May 29, 2024
3560661
Turn gpu.dcgm.sm.occupancy off by default.
igorpeshansky Jun 21, 2024
24a21cb
Ingest new metrics instead.
igorpeshansky May 30, 2024
5c28868
Add test data for H100.
igorpeshansky Jun 19, 2024
80430df
Remove old metrics.
igorpeshansky May 30, 2024
ef78211
Check for supported non-profiling fields by validating polled values.
igorpeshansky Jun 20, 2024
c7e015f
Pull in deltatorate processor.
igorpeshansky Jun 13, 2024
af0f4d9
Update attribute names; rename {pcie|nvlink}.traffic to {pcie|nvlink}…
igorpeshansky Jun 24, 2024
20f3b8d
Implement fallbacks.
igorpeshansky Jun 25, 2024
2ab8868
Skip test gracefully when pausing profiling not supported.
igorpeshansky Jun 27, 2024
4cc8a8b
Don't fail client tests on blank values.
igorpeshansky Jun 28, 2024
d0ad1ad
Fix supported field error handling.
igorpeshansky Jul 17, 2024
d19b09e
Fix lint errors.
igorpeshansky Jul 18, 2024
59d2a4a
Fix description of the network.io.direction label.
igorpeshansky Jul 23, 2024
97d2689
Avoid panics when tests run with no GPU.
igorpeshansky Jul 25, 2024
d47313e
Review feedback.
igorpeshansky Jul 25, 2024
2ad3400
Need to aggregate per device.
igorpeshansky Jul 28, 2024
e9f3300
Make gpu.dcgm.pcie.io and gpu.dcgm.nvlink.io cumulative.
igorpeshansky Jul 25, 2024
adb4c2e
Pull in cumulativetodelta processor.
igorpeshansky Jul 26, 2024
85bd141
Decouple the client from the receiver config.
igorpeshansky Jul 26, 2024
b82fa9b
Store the typed value directly in dcgmMetric; avoid unsafe.
igorpeshansky Jul 26, 2024
ccfd16e
Oops, temperature and clock frequency and energy consumption are int64.
igorpeshansky Jul 26, 2024
e7ffb3e
Scale energy consumption properly.
igorpeshansky Jul 26, 2024
e4a0026
More debug logging.
igorpeshansky Jul 26, 2024
e673697
Implement a rateIntegrator struct.
igorpeshansky Jul 26, 2024
b779893
Implement a cumulativeTracker struct.
igorpeshansky Jul 27, 2024
d8362b3
Really pull in the cumulativetodelta processor.
igorpeshansky Aug 14, 2024
dd39c32
Collect metrics asynchronously (#223)
quentinmit Sep 11, 2024
5de8b1c
Fix lint errors.
igorpeshansky Sep 11, 2024
ea0eba5
Fix data race in scraper_test.
igorpeshansky Sep 12, 2024
ac52c0f
Cleanups.
igorpeshansky Sep 13, 2024
5c0f8c4
Merge branch 'master' into igorpeshansky-dcgm-new-metrics
igorpeshansky Sep 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ require (
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/googlecloudexporter v0.102.0
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/googlemanagedprometheusexporter v0.102.0
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/pdatautil v0.102.0
github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatorateprocessor v0.102.0
github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.102.0
github.com/open-telemetry/opentelemetry-collector-contrib/processor/groupbyattrsprocessor v0.102.0
github.com/open-telemetry/opentelemetry-collector-contrib/processor/metricstransformprocessor v0.102.0
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -747,6 +747,8 @@ github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/prometh
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/prometheusremotewrite v0.102.0/go.mod h1:+Vlutd4t2XluxHYbIAfZiz3z5uWbsbiIUpipV5AnLtk=
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/winperfcounters v0.102.0 h1:adfJy3Sev2MaD6+plcmsSecpzy8h4MJT7eXEuif/2Ew=
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/winperfcounters v0.102.0/go.mod h1:FJmA939yem9GSEbqjCK6CXVbPfNPFKhvKnn+nWNpWio=
github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatorateprocessor v0.102.0 h1:mj3t9/FAQZjcZJA2kjgbpz2fSK9yD/pYpmqKEWpHJ1A=
github.com/open-telemetry/opentelemetry-collector-contrib/processor/deltatorateprocessor v0.102.0/go.mod h1:IIIjEblgrNISbDY7GPMMto9kEVIf0n9IeJoVru89kfY=
github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.102.0 h1:DaEYlVCn58GtkyYVK0IT/ZMjRFJ+BfmR0p9I0Eq42aQ=
github.com/open-telemetry/opentelemetry-collector-contrib/processor/filterprocessor v0.102.0/go.mod h1:u9x08rUCWdgI8Nle5XOMTCmxd0K26KTZvMMA5H8Xjyg=
github.com/open-telemetry/opentelemetry-collector-contrib/processor/groupbyattrsprocessor v0.102.0 h1:huh7V8uqMakQGdnbOqTSZihfoDeOIbNHfFt62HMsk5k=
Expand Down
283 changes: 214 additions & 69 deletions receiver/dcgmreceiver/client.go

Large diffs are not rendered by default.

182 changes: 105 additions & 77 deletions receiver/dcgmreceiver/client_gpu_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ import (
"testing"
"time"

"github.com/NVIDIA/go-dcgm/pkg/dcgm"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"go.uber.org/zap/zaptest"
Expand All @@ -48,8 +47,9 @@ type modelSupportedFields struct {
UnsupportedFields []string `yaml:"unsupported_fields"`
}

// TestSupportedFieldsWithGolden test getAllSupportedFields() against the golden
// files for the current GPU model
// TestSupportedFieldsWithGolden tests getSupportedRegularFields() and
// getSupportedProfilingFields() against the golden files for the current GPU
// model
func TestSupportedFieldsWithGolden(t *testing.T) {
config := createDefaultConfig().(*Config)
client, err := newClient(config, zaptest.NewLogger(t))
Expand All @@ -58,21 +58,19 @@ func TestSupportedFieldsWithGolden(t *testing.T) {
assert.NotEmpty(t, client.devicesModelName)
gpuModel := client.getDeviceModelName(0)
allFields := discoverRequestedFieldIDs(config)
supportedFields, err := getAllSupportedFields()
supportedRegularFields, err := getSupportedRegularFields(allFields, zaptest.NewLogger(t))
require.Nil(t, err)
enabledFields, unavailableFields := filterSupportedFields(allFields, supportedFields)
supportedProfilingFields, err := getSupportedProfilingFields()
require.Nil(t, err)
enabledFields, unavailableFields := filterSupportedFields(allFields, supportedRegularFields, supportedProfilingFields)

dcgmIDToNameMap := make(map[dcgm.Short]string, len(dcgm.DCGM_FI))
for fieldName, fieldID := range dcgm.DCGM_FI {
dcgmIDToNameMap[fieldID] = fieldName
}
var enabledFieldsString []string
var unavailableFieldsString []string
for _, f := range enabledFields {
enabledFieldsString = append(enabledFieldsString, dcgmIDToNameMap[f])
enabledFieldsString = append(enabledFieldsString, dcgmIDToName[f])
}
for _, f := range unavailableFields {
unavailableFieldsString = append(unavailableFieldsString, dcgmIDToNameMap[f])
unavailableFieldsString = append(unavailableFieldsString, dcgmIDToName[f])
}
m := modelSupportedFields{
Model: gpuModel,
Expand All @@ -83,7 +81,7 @@ func TestSupportedFieldsWithGolden(t *testing.T) {
if err != nil {
t.Fatal(err)
}
assert.Equal(t, len(dcgmNameToMetricName), len(client.enabledFieldIDs)+len(unavailableFieldsString))
assert.Equal(t, len(allFields), len(client.enabledFieldIDs)+len(unavailableFieldsString))
goldenPath := getModelGoldenFilePath(t, gpuModel)
golden.Assert(t, string(actual), goldenPath)
client.cleanup()
Expand All @@ -93,22 +91,6 @@ func TestSupportedFieldsWithGolden(t *testing.T) {
// file, given a GPU model string
func LoadExpectedMetrics(t *testing.T, model string) []string {
t.Helper()
dcgmNameToMetricNameMap := map[string]string{
"DCGM_FI_DEV_GPU_UTIL": "dcgm.gpu.utilization",
"DCGM_FI_DEV_FB_USED": "dcgm.gpu.memory.bytes_used",
"DCGM_FI_DEV_FB_FREE": "dcgm.gpu.memory.bytes_free",
"DCGM_FI_PROF_SM_ACTIVE": "dcgm.gpu.profiling.sm_utilization",
"DCGM_FI_PROF_SM_OCCUPANCY": "dcgm.gpu.profiling.sm_occupancy",
"DCGM_FI_PROF_PIPE_TENSOR_ACTIVE": "dcgm.gpu.profiling.tensor_utilization",
"DCGM_FI_PROF_DRAM_ACTIVE": "dcgm.gpu.profiling.dram_utilization",
"DCGM_FI_PROF_PIPE_FP64_ACTIVE": "dcgm.gpu.profiling.fp64_utilization",
"DCGM_FI_PROF_PIPE_FP32_ACTIVE": "dcgm.gpu.profiling.fp32_utilization",
"DCGM_FI_PROF_PIPE_FP16_ACTIVE": "dcgm.gpu.profiling.fp16_utilization",
"DCGM_FI_PROF_PCIE_TX_BYTES": "dcgm.gpu.profiling.pcie_sent_bytes",
"DCGM_FI_PROF_PCIE_RX_BYTES": "dcgm.gpu.profiling.pcie_received_bytes",
"DCGM_FI_PROF_NVLINK_TX_BYTES": "dcgm.gpu.profiling.nvlink_sent_bytes",
"DCGM_FI_PROF_NVLINK_RX_BYTES": "dcgm.gpu.profiling.nvlink_received_bytes",
}
goldenPath := getModelGoldenFilePath(t, model)
goldenFile, err := ioutil.ReadFile(goldenPath)
if err != nil {
Expand All @@ -121,7 +103,7 @@ func LoadExpectedMetrics(t *testing.T, model string) []string {
}
var expectedMetrics []string
for _, supported := range m.SupportedFields {
expectedMetrics = append(expectedMetrics, dcgmNameToMetricNameMap[supported])
expectedMetrics = append(expectedMetrics, supported)
}
return expectedMetrics
}
Expand Down Expand Up @@ -156,63 +138,109 @@ func TestCollectGpuProfilingMetrics(t *testing.T) {
expectedMetrics := LoadExpectedMetrics(t, client.devicesModelName[0])
var maxCollectionInterval = 60 * time.Second
before := time.Now().UnixMicro() - maxCollectionInterval.Microseconds()
metrics, err := client.collectDeviceMetrics()
deviceMetrics, err := client.collectDeviceMetrics()
after := time.Now().UnixMicro()
assert.Nil(t, err)

seenMetric := make(map[string]bool)
for _, metric := range metrics {
assert.GreaterOrEqual(t, metric.gpuIndex, uint(0))
assert.LessOrEqual(t, metric.gpuIndex, uint(32))

switch metric.name {
case "dcgm.gpu.profiling.tensor_utilization":
fallthrough
case "dcgm.gpu.profiling.dram_utilization":
fallthrough
case "dcgm.gpu.profiling.fp64_utilization":
fallthrough
case "dcgm.gpu.profiling.fp32_utilization":
fallthrough
case "dcgm.gpu.profiling.fp16_utilization":
fallthrough
case "dcgm.gpu.profiling.sm_occupancy":
fallthrough
case "dcgm.gpu.profiling.sm_utilization":
assert.GreaterOrEqual(t, metric.asFloat64(), float64(0.0))
assert.LessOrEqual(t, metric.asFloat64(), float64(1.0))
case "dcgm.gpu.utilization":
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), int64(100))
case "dcgm.gpu.memory.bytes_free":
fallthrough
case "dcgm.gpu.memory.bytes_used":
// arbitrary max of 10 TiB
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), int64(10485760))
case "dcgm.gpu.profiling.pcie_sent_bytes":
fallthrough
case "dcgm.gpu.profiling.pcie_received_bytes":
fallthrough
case "dcgm.gpu.profiling.nvlink_sent_bytes":
fallthrough
case "dcgm.gpu.profiling.nvlink_received_bytes":
// arbitrary max of 10 TiB/sec
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), int64(10995116277760))
default:
t.Errorf("Unexpected metric '%s'", metric.name)
assert.GreaterOrEqual(t, len(deviceMetrics), 0)
assert.LessOrEqual(t, len(deviceMetrics), 32)
for gpuIndex, metrics := range deviceMetrics {
for _, metric := range metrics {
switch metric.name {
case "DCGM_FI_PROF_GR_ENGINE_ACTIVE":
fallthrough
case "DCGM_FI_PROF_SM_ACTIVE":
fallthrough
case "DCGM_FI_PROF_SM_OCCUPANCY":
fallthrough
case "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE":
fallthrough
case "DCGM_FI_PROF_PIPE_FP64_ACTIVE":
fallthrough
case "DCGM_FI_PROF_PIPE_FP32_ACTIVE":
fallthrough
case "DCGM_FI_PROF_PIPE_FP16_ACTIVE":
fallthrough
case "DCGM_FI_PROF_DRAM_ACTIVE":
assert.GreaterOrEqual(t, metric.asFloat64(), float64(0.0))
assert.LessOrEqual(t, metric.asFloat64(), float64(1.0))
case "DCGM_FI_DEV_GPU_UTIL":
fallthrough
case "DCGM_FI_DEV_MEM_COPY_UTIL":
fallthrough
case "DCGM_FI_DEV_ENC_UTIL":
fallthrough
case "DCGM_FI_DEV_DEC_UTIL":
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), int64(100))
case "DCGM_FI_DEV_FB_FREE":
fallthrough
case "DCGM_FI_DEV_FB_USED":
fallthrough
case "DCGM_FI_DEV_FB_RESERVED":
// arbitrary max of 10 TiB
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), int64(10485760))
case "DCGM_FI_PROF_PCIE_TX_BYTES":
fallthrough
case "DCGM_FI_PROF_PCIE_RX_BYTES":
fallthrough
case "DCGM_FI_PROF_NVLINK_TX_BYTES":
fallthrough
case "DCGM_FI_PROF_NVLINK_RX_BYTES":
// arbitrary max of 10 TiB/sec
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), int64(10995116277760))
case "DCGM_FI_DEV_BOARD_LIMIT_VIOLATION":
fallthrough
case "DCGM_FI_DEV_LOW_UTIL_VIOLATION":
fallthrough
case "DCGM_FI_DEV_POWER_VIOLATION":
fallthrough
case "DCGM_FI_DEV_RELIABILITY_VIOLATION":
fallthrough
case "DCGM_FI_DEV_SYNC_BOOST_VIOLATION":
fallthrough
case "DCGM_FI_DEV_THERMAL_VIOLATION":
fallthrough
case "DCGM_FI_DEV_TOTAL_APP_CLOCKS_VIOLATION":
fallthrough
case "DCGM_FI_DEV_TOTAL_BASE_CLOCKS_VIOLATION":
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), time.Now().UnixMicro())
case "DCGM_FI_DEV_ECC_DBE_VOL_TOTAL":
fallthrough
case "DCGM_FI_DEV_ECC_SBE_VOL_TOTAL":
// arbitrary max of 100000000 errors
assert.GreaterOrEqual(t, metric.asInt64(), int64(0))
assert.LessOrEqual(t, metric.asInt64(), int64(100000000))
case "DCGM_FI_DEV_GPU_TEMP":
// arbitrary max of 100000 °C
assert.GreaterOrEqual(t, metric.asFloat64(), float64(0.0))
assert.LessOrEqual(t, metric.asFloat64(), float64(100000.0))
case "DCGM_FI_DEV_SM_CLOCK":
// arbitrary max of 100000 MHz
assert.GreaterOrEqual(t, metric.asFloat64(), float64(0.0))
assert.LessOrEqual(t, metric.asFloat64(), float64(100000.0))
case "DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION":
// TODO
case "DCGM_FI_DEV_POWER_USAGE":
// TODO
default:
t.Errorf("Unexpected metric '%s'", metric.name)
}

assert.GreaterOrEqual(t, metric.timestamp, before)
assert.LessOrEqual(t, metric.timestamp, after)

seenMetric[fmt.Sprintf("gpu{%d}.metric{%s}", gpuIndex, metric.name)] = true
}

assert.GreaterOrEqual(t, metric.timestamp, before)
assert.LessOrEqual(t, metric.timestamp, after)

seenMetric[fmt.Sprintf("gpu{%d}.metric{%s}", metric.gpuIndex, metric.name)] = true
}

for _, gpuIndex := range client.deviceIndices {
for _, metric := range expectedMetrics {
assert.Equal(t, seenMetric[fmt.Sprintf("gpu{%d}.metric{%s}", gpuIndex, metric)], true)
assert.True(t, seenMetric[fmt.Sprintf("gpu{%d}.metric{%s}", gpuIndex, metric)], fmt.Sprintf("%s on gpu %d", metric, gpuIndex))
}
}
client.cleanup()
Expand Down
2 changes: 1 addition & 1 deletion receiver/dcgmreceiver/client_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ import (
func TestNewDcgmClientOnInitializationError(t *testing.T) {
realDcgmInit := dcgmInit
defer func() { dcgmInit = realDcgmInit }()
dcgmInit = func(args ...string) (func(), error) {
dcgmInit = func(...string) (func(), error) {
return nil, fmt.Errorf("No DCGM client library *OR* No DCGM connection")
}

Expand Down
2 changes: 2 additions & 0 deletions receiver/dcgmreceiver/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,6 @@ type Config struct {
scraperhelper.ControllerConfig `mapstructure:",squash"`
confignet.TCPAddrConfig `mapstructure:",squash"`
Metrics metadata.MetricsConfig `mapstructure:"metrics"`
retryBlankValues bool
maxRetries int
}
Loading
Loading