-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: add telemetry #46716
backupccl: add telemetry #46716
Conversation
Release note: none. Release justification: extremely minor change.
50634ed
to
674c190
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @mwang1026)
pkg/ccl/backupccl/backup_job.go, line 580 at r2 (raw file):
telemetry.CountBucketed("backup.size-mb.inc", sizeMb) telemetry.CountBucketed("backup.speed-mbps.inc.total", mbps) telemetry.CountBucketed("backup.speed-mbps.full.per-node", mbps/int64(numClusterNodes))
Should this be backup.speed-mbps.inc.per-node
? (s/full/inc/)
pkg/ccl/backupccl/backup_job.go, line 608 at r2 (raw file):
func (b *backupResumer) OnFailOrCancel(ctx context.Context, phs interface{}) error { telemetry.Count("backup.total.failed") telemetry.CountBucketed("backup.duration-sec.failed",
Similarly to my previous comment, is it interesting to consider which types of backups are failing/being cancelled. It would also be nice if there was a way to differentiate between failing and cancelling...
pkg/ccl/backupccl/restore_job.go, line 1125 at r2 (raw file):
// change stuff to delete the keys in the background. func (r *restoreResumer) OnFailOrCancel(ctx context.Context, phs interface{}) error { telemetry.Count("restore.total.failed")
Would it be useful to track how many of the failed restores were full-cluster vs not? (Since w'ere tracking the percentage of restores that are full cluster restores.)
This adds counters to usage of various features of BACKUP and RESTORE: - if encryption is used - if full-cluster is used - if incremental is used, either explicitly or automatically - total, succeeded and failed counts as well as runtime, size and throughput These should help us measure how BACKUP and RESTORE are used and are behaving in the wild and where to direct our future efforts. Release note (enterprise change): BACKUP and RESTORE now collect some anonymous telemetry on throughput and feature usage. Release justification: low-risk, high impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mwang1026 and @pbardea)
pkg/ccl/backupccl/backup_job.go, line 580 at r2 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Should this be
backup.speed-mbps.inc.per-node
? (s/full/inc/)
Done.
pkg/ccl/backupccl/backup_job.go, line 608 at r2 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Similarly to my previous comment, is it interesting to consider which types of backups are failing/being cancelled. It would also be nice if there was a way to differentiate between failing and cancelling...
Agreed, though I think those would be in addition to the top level count, so I'm inclined to proceed with this for now (in the interest of making 20.1 branch cut) and add that later.
pkg/ccl/backupccl/restore_job.go, line 1125 at r2 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Would it be useful to track how many of the failed restores were full-cluster vs not? (Since w'ere tracking the percentage of restores that are full cluster restores.)
ditto above.
Also, to act on that info, we'd probably need more info about the actual failure too, so we might need to rely on bug reports here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TFTR! bors r+ |
Build failed (retrying...) |
Build succeeded |
This adds counters to usage of various features of BACKUP and RESTORE:
These should help us measure how BACKUP and RESTORE are used and are behaving in the wild and where to direct our future efforts.
Fixes #46518
Fixes #46520
Fixes #46521
Release note (enterprise change): BACKUP and RESTORE now collect some anonymous telemetry on throughput and feature usage.
Release justification: low-risk, high impact.