Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vtbackup, mysqlctl: detailed backup and restore metrics #11979

Merged
merged 23 commits into from
Feb 14, 2023

Conversation

maxenglander
Copy link
Collaborator

@maxenglander maxenglander commented Dec 16, 2022

Description

Addresses #11977.

I would like to have better instrumentation on backups, in particular backups generated by vtbackup. While backup stats are exposed via servenv since #11388 (which is great!) ideally I would like more fine-grained stats on:

  • Download last backup.
  • Decompress last backup.
  • Catchup to primary with replication.
  • Compress new backup.
  • Upload new backup.

Design

There were a couple design goals that shaped the way the code is written.

No breaking interface changes

After discussion with Deepthi we decided to introduce a breaking change after all.

A quick Google/GitHub search didn't reveal any out-of-tree backupstorage plugins, but the way the backup/restore APIs are laid out makes it seem like they were designed to support out-of-tree plugins.

I tried to keep that in mind when writing this PR, in particular by not making any changes that would require anyone using out-of-tree plugins to make code changes when they upgrade to a Vitess version with these changes.

Separate policy and mechanism

It wouldn't be great if every out-of-tree backupstorage generated metrics in ways that conflicted with each other or varied widely from way Vitess users are used to consuming in-tree metrics.

In this PR, I tried to create a minimal stats mechanism that can be used by in-tree and out-of-tree code, but where the policies for stats (metric names and labels, stats sink, etc.) are kept in-tree and under the control of the Vitess user.

This approach seems similar in spirit to what is already being done with BackupParams.Logger and RestoreParams.Logger.

Changes

This PR adds several new metrics:

  • vtbackup_duration_by_phase_seconds with phase label
  • {vtbackup,vttablet}_backup_bytes with component, implementation, and operation labels
  • {vtbackup,vttablet}_backup_count with component, implementation, and operation labels
  • {vtbackup,vttablet}_backup_duration_nanoseconds with component, implementation, and operation labels
  • {vtbackup,vttablet}_restore_bytes with component, implementation, and operation labels
  • {vtbackup,vttablet}_restore_count with component, implementation, and operation labels
  • {vtbackup,vttablet}_restore_duration_nanoseconds with component, implementation, and operation labels

It also deprecates these older backup/restore metrics:

  • {vtbackup,vttablet}_backup_duration_seconds
  • {vtbackup,vttablet}_restore_duration_seconds

Notes

Changes to vtbackup:

  • Add new stat duration_seconds metric which reports durations of additional phases not covered by mysqlctl.
  • Reports phases like initmysqld, initialbackup, restorelastbackup, catchupreplication, etc.

Changes to mysqlctl:

  • Add backup_bytes, backup_count, backup_duration_nanoseconds, restore_bytes, restore_count, restore_duration_nanoseconds metrics.
  • These metrics each have three labels: component, implementation, and operation.
  • The idea is that we can re-use these metric across different components (- = unscoped, top-level, backupstorage, backupengine, etc.) across different implementations (s3, file, etc.), and across different operations (backup, restore, read, compress, encrypt).

Other notes:

  • An alternative design would be to just have each engine/plugin declare its own stats, or have a set of global stats that each engine/plugin/package can import.
  • Why nanoseconds on those two new metrics? Because as we're reporting on read/write times for individual files, if all the files are small and take less than a second to process then they end up reporting as zero.
  • Why backupengine.(Parameterizable)? I wasn't sure how safe it would be to break any of the APIs like BackupEngine and BackupStorage. Figured it was better to introduce changes this way until I get some guidance.

Samples

Sample metrics generated by running vtbackup from this branch against the local example cluster. Processed with jq and sorted for readability.

{
  "BackupBytes": {
    "BackupEngine.Builtin.Source:Read": 4777,
    "BackupEngine.Builtin.Compressor:Write": 4616,
    "BackupEngine.Builtin.Destination:Write": 162,
    "BackupStorage.File.File:Write": 163
  },
  "BackupCount": {
    "-.-.Backup": 1,
    "BackupEngine.Builtin.Source:Open": 161,
    "BackupEngine.Builtin.Source:Close": 322,
    "BackupEngine.Builtin.Compressor:Close": 161,
    "BackupEngine.Builtin.Destination:Open": 161,
    "BackupEngine.Builtin.Destination:Close": 322
  },
  "BackupDurationNanoseconds": {
    "-.-.Backup": 4188508542,
    "BackupEngine.Builtin.Source:Open": 10649832,
    "BackupEngine.Builtin.Source:Read": 55901067,
    "BackupEngine.Builtin.Source:Close": 960826,
    "BackupEngine.Builtin.Compressor:Write": 278358826,
    "BackupEngine.Builtin.Compressor:Close": 79358372,
    "BackupEngine.Builtin.Destination:Open": 16456627,
    "BackupEngine.Builtin.Destination:Write": 11021043,
    "BackupEngine.Builtin.Destination:Close": 17144630,
    "BackupStorage.File.File:Write": 10743169
  },
  "DurationByPhaseSeconds": {
    "InitMySQLd": 2,
    "RestoreLastBackup": 6,
    "CatchUpReplication": 1,
    "TakeNewBackup": 4
  },
  "RestoreBytes": {
    "BackupEngine.Builtin.Source:Read": 1095,
    "BackupEngine.Builtin.Decompressor:Read": 950,
    "BackupEngine.Builtin.Destination:Write": 209,
    "BackupStorage.File.File:Read": 1113
  },
  "RestoreCount": {
    "-.-.Restore": 1,
    "BackupEngine.Builtin.Source:Open": 161,
    "BackupEngine.Builtin.Source:Close": 322,
    "BackupEngine.Builtin.Decompressor:Close": 161,
    "BackupEngine.Builtin.Destination:Open": 161,
    "BackupEngine.Builtin.Destination:Close": 322
  },
  "RestoreDurationNanoseconds": {
    "-.-.Restore": 6204765541,
    "BackupEngine.Builtin.Source:Open": 10542539,
    "BackupEngine.Builtin.Source:Read": 104658370,
    "BackupEngine.Builtin.Source:Close": 773038,
    "BackupEngine.Builtin.Decompressor:Read": 165692120,
    "BackupEngine.Builtin.Decompressor:Close": 51040,
    "BackupEngine.Builtin.Destination:Open": 22715122,
    "BackupEngine.Builtin.Destination:Write": 41679581,
    "BackupEngine.Builtin.Destination:Close": 26954624,
    "BackupStorage.File.File:Read": 102416075
  },
  "backup_duration_seconds": 4,
  "restore_duration_seconds": 6
}

Performance

At @deepthi suggestion I compared performance of backups on main versus this branch.

I set up the commerce example cluster, and created a table with ~20 GiB of data, then ran:

$ vtctlclient BackupShard -- --allow_primary --concurrency=1 commerce/0

Multiple times with this branch and main, comparing the values of vttablet_backup_duration_seconds.

# main branch (7fc1b48) this branch (d5364e6)
1 21 21
2 20 23
3 22 21
4 21 22
5 21 22

Assuming the differences aren't due to vagaries of CPU and disk usage on my Mac M1, this branch adds a roughly 3.8% performance overhead.

Out-of-scope

This PR doesn't add new metrics to all backup engines or storage engines. Would like to get buy-in on this approach (or a different one) first, and then expand whatever approach we adopt in follow-on PRs to cover additional backup engines & storages.

Checklist

  • Implement.
  • Write tests.
  • Docs

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Dec 16, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@maxenglander maxenglander added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Backup and Restore labels Dec 16, 2022
@maxenglander maxenglander force-pushed the maxeng-details-backup-stats branch 4 times, most recently from e79b3f5 to 651a278 Compare December 17, 2022 02:14
@@ -120,6 +121,11 @@ var (
detachedMode bool
keepAliveTimeout = 0 * time.Second
disableRedoLog = false
durationByPhase = stats.NewGaugesWithSingleLabel(
"duration_seconds",
"How long it took vtbackup to perform a each phase of operation (in seconds).",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because vtbackup also picks up stats from mysqlctl, there will be some overlap between this metric and those metrics. I think that's OK, personally, but if we want I can take more care to avoid any overlap.

go/stats/timings.go Outdated Show resolved Hide resolved
@@ -92,9 +92,6 @@ var (
// backupCompressBlocks is the number of blocks that are processed
// once before the writer blocks
backupCompressBlocks = 2

backupDuration = stats.NewGauge("backup_duration_seconds", "How long it took to complete the last backup operation (in seconds)")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not deleted, just moved to backupstats package.

// Take the backup, and either AbortBackup or EndBackup.
usable, err := be.ExecuteBackup(ctx, params, bh)
beParams := params.Copy()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward, and copying structs makes me uncomfortable. Is there a way to lint for exhaustive field copying?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking out loud here.. can we have tow stats , enginestats and storagestats in backupparam.. so you won't end up copy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we did that approach we might end up needing three stats, one for the engine, one for the storage, and one for the controlling function (mysqlctl.Backup).

Overall I think I am OK with the copy, and one thing I didn't know about Golang when I wrote the comment above is that this kind of struct creation is exhaustive:

a1 := A{
	"hello",
	true,
	3,
}

It won't let you omit any struct fields. So I at least feel good about that. The only risk now is if someone swaps the order of two struct fields that have the same type 😬

@@ -100,7 +100,7 @@ func TestExecuteBackup(t *testing.T) {
oldDeadline := setBuiltinBackupMysqldDeadline(time.Second)
defer setBuiltinBackupMysqldDeadline(oldDeadline)

bh := filebackupstorage.FileBackupHandle{}
bh := filebackupstorage.NewBackupHandle(nil, "", "", false)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but makes a bit easier to avoid nil dereferences.

@@ -191,6 +192,7 @@ func (tm *TabletManager) restoreDataLocked(ctx context.Context, logger logutil.L
Shard: tablet.Shard,
StartTime: logutil.ProtoToTime(request.BackupTime),
DryRun: request.DryRun,
Stats: backupstats.RestoreStats(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Stats is not set then mysqlctl will use backupstats.NopStats. This way any out-of-tree code can opt in to the new stats, or not (and not have to worry about nil deference errors).

Signed-off-by: Max Englander <[email protected]>
duration time.Duration
}

// Bytes reports the total bytes read in calls to f so far.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this refers to the argument f in func measure, but it's not very helpful here. Let me reword this.

Copy link
Collaborator Author

@maxenglander maxenglander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve code comments in ioutil/meter

go/ioutil/meter.go Outdated Show resolved Hide resolved
go/ioutil/meter.go Outdated Show resolved Hide resolved
go/ioutil/meter.go Outdated Show resolved Hide resolved
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Copy link
Collaborator Author

@maxenglander maxenglander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rsajwani thanks for all the helpful suggestions. I think I addressed all of your feedback, ready for another look!

doc/releasenotes/17_0_0_summary.md Show resolved Hide resolved
Copy link
Contributor

@rsajwani rsajwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Max. This is awesome work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Backup and Restore Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants