Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect metrics asynchronously #223

Merged
merged 31 commits into from
Sep 11, 2024
Merged
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
e4e0d4b
Checkpoint (refactor scraping to a separate goroutine)
quentinmit Aug 5, 2024
795f60a
scraper.go now compiles
quentinmit Aug 5, 2024
ede6247
goimports
quentinmit Aug 5, 2024
69982a6
client_gpu_test compiles
quentinmit Aug 5, 2024
1ffd000
Update scraper_gpu_test
quentinmit Aug 6, 2024
7cdb057
Only use points after the start of the client
quentinmit Aug 6, 2024
039d832
Wait for GPU metrics to be present
quentinmit Aug 6, 2024
03dc917
Make shutdown behavior reliable
quentinmit Aug 7, 2024
3a803c4
Adapt RateIntegrator tests for metricStats
quentinmit Aug 7, 2024
81a6cc6
Remove unused rateIntegrator type
quentinmit Aug 7, 2024
0d9fb28
Update cumulativeTracker test
quentinmit Aug 7, 2024
3a6e6af
Remove cumulativeTracker implementation
quentinmit Aug 7, 2024
36fbe39
Ensure a client connection is open when trying to pause/resume profiling
quentinmit Aug 7, 2024
131d921
Remove more unused code
quentinmit Aug 7, 2024
43ac034
Address review comments
quentinmit Aug 7, 2024
0243a5e
goimports
quentinmit Aug 7, 2024
6d7147c
Review comment
quentinmit Aug 7, 2024
0cfaee1
Add comment on how integration works
quentinmit Aug 7, 2024
e84a661
Calculate buffer duration since we no longer get a full buffer, and e…
quentinmit Aug 8, 2024
5513b42
goimports
quentinmit Aug 8, 2024
48fc30b
Explicitly test scrape collection interval
quentinmit Aug 9, 2024
82acc2d
Initialize the dcgm library in testprofilepause
quentinmit Aug 10, 2024
1a0c33c
Reuse client constructor's device group
quentinmit Aug 10, 2024
6cb898d
Merge branch 'igorpeshansky-dcgm-new-metrics' into quentin-dcgm-new-m…
igorpeshansky Aug 14, 2024
ff45911
getSupportedRegularFields is unnecessary now that we expect blank values
quentinmit Aug 27, 2024
f249d19
Violation timers are in ns, not µs
quentinmit Aug 27, 2024
26a7362
Suppress blank value warnings
quentinmit Aug 28, 2024
395c2f7
Sort field lists before checking goldens
quentinmit Sep 3, 2024
5c3bdd5
Generate golden metric lists from actual scraping, to detect metrics …
quentinmit Sep 4, 2024
76ef57f
Use a different error message and rate limit specifically for 'field …
quentinmit Sep 9, 2024
79e19d6
Add missing continue
quentinmit Sep 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions receiver/dcgmreceiver/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ type dcgmClient struct {
devices map[uint]deviceMetrics
lastSuccessfulPoll time.Time

deviceMetricToFailedQueryCount map[string]uint64
deviceMetricToFailedQueryCount map[string]int
pollingInterval time.Duration
retryBlankValues bool
maxRetries int
Expand Down Expand Up @@ -113,7 +113,7 @@ func newClient(settings *dcgmClientSettings, logger *zap.Logger) (*dcgmClient, e
deviceGroup: deviceGroup,
devices: map[uint]deviceMetrics{},
lastSuccessfulPoll: time.Now(),
deviceMetricToFailedQueryCount: make(map[string]uint64),
deviceMetricToFailedQueryCount: make(map[string]int),
quentinmit marked this conversation as resolved.
Show resolved Hide resolved
pollingInterval: settings.pollingInterval,
retryBlankValues: settings.retryBlankValues,
maxRetries: settings.maxRetries,
Expand Down Expand Up @@ -303,8 +303,8 @@ func (client *dcgmClient) collect() (time.Duration, error) {
}
fieldValues, pollTime, err := dcgmGetValuesSince(client.deviceGroup, client.enabledFieldGroup, client.lastSuccessfulPoll)
if err != nil {
msg := fmt.Sprintf("Unable to poll DCGM daemon for on %s", err)
client.issueWarningForFailedQueryUptoThreshold("all-profiling-metrics", msg)
msg := fmt.Sprintf("Unable to poll DCGM daemon for metrics: %s", err)
client.issueWarningForFailedQueryUptoThreshold("all-profiling-metrics", maxWarningsForFailedDeviceMetricQuery, msg)
return 0, err
}
client.logger.Debugf("Got %d field values over %s", len(fieldValues), pollTime.Sub(client.lastSuccessfulPoll))
Expand All @@ -328,9 +328,11 @@ func (client *dcgmClient) collect() (time.Duration, error) {
if err := isValidValue(fieldValue); err == errBlankValue {
// Blank values are expected at startup.
continue
} else if err == errNotSupported {
client.issueWarningForFailedQueryUptoThreshold(dcgmName, 1, fmt.Sprintf("Field '%s' is not supported. Metric '%s' will not be collected", dcgmName, dcgmName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Metric '%s' will not be collected part is redundant (in the profiling metrics as well). But I can fix that after we merge this PR…

} else if err != nil {
msg := fmt.Sprintf("Received invalid value (ts %d gpu %d) %s: %v", fieldValue.Ts, gpuIndex, dcgmName, err)
client.issueWarningForFailedQueryUptoThreshold(fmt.Sprintf("device%d.%s", gpuIndex, dcgmName), msg)
client.issueWarningForFailedQueryUptoThreshold(fmt.Sprintf("device%d.%s", gpuIndex, dcgmName), maxWarningsForFailedDeviceMetricQuery, msg)
continue
}
if fieldValue.Ts < oldestTs {
Expand Down Expand Up @@ -367,13 +369,13 @@ func (client *dcgmClient) getDeviceMetrics() map[uint]deviceMetrics {
return out
}

func (client *dcgmClient) issueWarningForFailedQueryUptoThreshold(dcgmName string, reason string) {
func (client *dcgmClient) issueWarningForFailedQueryUptoThreshold(dcgmName string, limit int, reason string) {
client.deviceMetricToFailedQueryCount[dcgmName]++

failedCount := client.deviceMetricToFailedQueryCount[dcgmName]
if failedCount <= maxWarningsForFailedDeviceMetricQuery {
client.logger.Warnf("Unable to query '%s' on '%s'", dcgmName, reason)
if failedCount == maxWarningsForFailedDeviceMetricQuery {
if failedCount <= limit {
client.logger.Warnf("%s", reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.Warnf("%s", reason) is the same as logger.Warn(reason) (https://pkg.go.dev/go.uber.org/zap#SugaredLogger.Warn). But I can fix that after we merge this PR…

if limit > 1 && failedCount == limit {
client.logger.Warnf("Surpressing further device query warnings for '%s'", dcgmName)
}
}
Expand Down