-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vtbackup, mysqlctl: detailed backup and restore metrics #11979
vtbackup, mysqlctl: detailed backup and restore metrics #11979
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
e79b3f5
to
651a278
Compare
go/cmd/vtbackup/vtbackup.go
Outdated
@@ -120,6 +121,11 @@ var ( | |||
detachedMode bool | |||
keepAliveTimeout = 0 * time.Second | |||
disableRedoLog = false | |||
durationByPhase = stats.NewGaugesWithSingleLabel( | |||
"duration_seconds", | |||
"How long it took vtbackup to perform a each phase of operation (in seconds).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because vtbackup also picks up stats from mysqlctl
, there will be some overlap between this metric and those metrics. I think that's OK, personally, but if we want I can take more care to avoid any overlap.
@@ -92,9 +92,6 @@ var ( | |||
// backupCompressBlocks is the number of blocks that are processed | |||
// once before the writer blocks | |||
backupCompressBlocks = 2 | |||
|
|||
backupDuration = stats.NewGauge("backup_duration_seconds", "How long it took to complete the last backup operation (in seconds)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not deleted, just moved to backupstats
package.
// Take the backup, and either AbortBackup or EndBackup. | ||
usable, err := be.ExecuteBackup(ctx, params, bh) | ||
beParams := params.Copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit awkward, and copying structs makes me uncomfortable. Is there a way to lint for exhaustive field copying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking out loud here.. can we have tow stats , enginestats and storagestats in backupparam.. so you won't end up copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we did that approach we might end up needing three stats, one for the engine, one for the storage, and one for the controlling function (mysqlctl.Backup
).
Overall I think I am OK with the copy, and one thing I didn't know about Golang when I wrote the comment above is that this kind of struct creation is exhaustive:
a1 := A{
"hello",
true,
3,
}
It won't let you omit any struct fields. So I at least feel good about that. The only risk now is if someone swaps the order of two struct fields that have the same type 😬
@@ -100,7 +100,7 @@ func TestExecuteBackup(t *testing.T) { | |||
oldDeadline := setBuiltinBackupMysqldDeadline(time.Second) | |||
defer setBuiltinBackupMysqldDeadline(oldDeadline) | |||
|
|||
bh := filebackupstorage.FileBackupHandle{} | |||
bh := filebackupstorage.NewBackupHandle(nil, "", "", false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary, but makes a bit easier to avoid nil dereferences.
651a278
to
57e3842
Compare
57e3842
to
9bf42a3
Compare
@@ -191,6 +192,7 @@ func (tm *TabletManager) restoreDataLocked(ctx context.Context, logger logutil.L | |||
Shard: tablet.Shard, | |||
StartTime: logutil.ProtoToTime(request.BackupTime), | |||
DryRun: request.DryRun, | |||
Stats: backupstats.RestoreStats(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Stats
is not set then mysqlctl
will use backupstats.NopStats
. This way any out-of-tree code can opt in to the new stats, or not (and not have to worry about nil deference errors).
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
go/ioutil/meter.go
Outdated
duration time.Duration | ||
} | ||
|
||
// Bytes reports the total bytes read in calls to f so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this refers to the argument f
in func measure
, but it's not very helpful here. Let me reword this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve code comments in ioutil/meter
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rsajwani thanks for all the helpful suggestions. I think I addressed all of your feedback, ready for another look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Max. This is awesome work.
Description
Addresses #11977.
I would like to have better instrumentation on backups, in particular backups generated by
vtbackup
. While backup stats are exposed viaservenv
since #11388 (which is great!) ideally I would like more fine-grained stats on:Design
There were a couple design goals that shaped the way the code is written.
No breaking interface changes
After discussion with Deepthi we decided to introduce a breaking change after all.
A quick Google/GitHub search didn't reveal any out-of-treebackupstorage
plugins, but the way the backup/restore APIs are laid out makes it seem like they were designed to support out-of-tree plugins.I tried to keep that in mind when writing this PR, in particular by not making any changes that would require anyone using out-of-tree plugins to make code changes when they upgrade to a Vitess version with these changes.Separate policy and mechanism
It wouldn't be great if every out-of-tree
backupstorage
generated metrics in ways that conflicted with each other or varied widely from way Vitess users are used to consuming in-tree metrics.In this PR, I tried to create a minimal stats mechanism that can be used by in-tree and out-of-tree code, but where the policies for stats (metric names and labels, stats sink, etc.) are kept in-tree and under the control of the Vitess user.
This approach seems similar in spirit to what is already being done with
BackupParams.Logger
andRestoreParams.Logger
.Changes
This PR adds several new metrics:
vtbackup_duration_by_phase_seconds
withphase
label{vtbackup,vttablet}_backup_bytes
withcomponent
,implementation
, andoperation
labels{vtbackup,vttablet}_backup_count
withcomponent
,implementation
, andoperation
labels{vtbackup,vttablet}_backup_duration_nanoseconds
withcomponent
,implementation
, andoperation
labels{vtbackup,vttablet}_restore_bytes
withcomponent
,implementation
, andoperation
labels{vtbackup,vttablet}_restore_count
withcomponent
,implementation
, andoperation
labels{vtbackup,vttablet}_restore_duration_nanoseconds
withcomponent
,implementation
, andoperation
labelsIt also deprecates these older backup/restore metrics:
{vtbackup,vttablet}_backup_duration_seconds
{vtbackup,vttablet}_restore_duration_seconds
Notes
Changes to
vtbackup
:duration_seconds
metric which reports durations of additional phases not covered bymysqlctl
.initmysqld
,initialbackup
,restorelastbackup
,catchupreplication
, etc.Changes to
mysqlctl
:backup_bytes
,backup_count
,backup_duration_nanoseconds
,restore_bytes
,restore_count
,restore_duration_nanoseconds
metrics.component
,implementation
, andoperation
.-
= unscoped, top-level,backupstorage
,backupengine
, etc.) across different implementations (s3
,file
, etc.), and across different operations (backup
,restore
,read
,compress
,encrypt
).Other notes:
nanoseconds
on those two new metrics? Because as we're reporting on read/write times for individual files, if all the files are small and take less than a second to process then they end up reporting as zero.Whybackupengine.(Parameterizable)
? I wasn't sure how safe it would be to break any of the APIs likeBackupEngine
andBackupStorage
. Figured it was better to introduce changes this way until I get some guidance.Samples
Sample metrics generated by running
vtbackup
from this branch against the local example cluster. Processed withjq
and sorted for readability.Performance
At @deepthi suggestion I compared performance of backups on
main
versus this branch.I set up the commerce example cluster, and created a table with ~20 GiB of data, then ran:
Multiple times with this branch and
main
, comparing the values ofvttablet_backup_duration_seconds
.main
branch (7fc1b48)Assuming the differences aren't due to vagaries of CPU and disk usage on my Mac M1, this branch adds a roughly 3.8% performance overhead.
Out-of-scope
This PR doesn't add new metrics to all backup engines or storage engines. Would like to get buy-in on this approach (or a different one) first, and then expand whatever approach we adopt in follow-on PRs to cover additional backup engines & storages.
Checklist