Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic Snapshotting / Sticky Volume Migration #3563

Merged
merged 5 commits into from
Dec 5, 2017
Merged

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Nov 17, 2017

Includes #3539

Makes sticky migrations atomic: either all data is copied or none is. Previously partial data could be copied.

@schmichael schmichael changed the title [WIP] Atomic Snapshotting Atomic Snapshotting / Sticky Volume Migration Nov 17, 2017
Typeflag: tar.TypeReg,
}
err = tw.WriteHeader(&fooHdr)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert library.

// Now write to its shared dir
allocDirI, err := s.client.GetAllocFS(alloc.ID)
if err != nil {
t.Fatalf("unable to find alloc dir: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert library.

if err := allocDir.Snapshot(ioutil.Discard); err == nil {
t.Errorf("expected Snapshot() to fail but it did not")
} else {
s.logger.Printf("[DEBUG] agent.test: snapshot returned error: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't err nil in the else case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I flipped the normal if/else order and handled the nil case first. Will fip to the usual order.

t.Parallel()
dest, err := ioutil.TempDir("", "nomadtest-")
if err != nil {
t.Fatalf("err: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert test package- same for all other tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that said, the tests look great!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer this style as assert uses t.Error not t.Fatal so your test continues and usually panics due to the non-error value being nil.

...but maybe I'm the only one who gets confused by this? 😕

@@ -841,6 +841,14 @@ func (r *AllocRunner) Run() {

// Soft-fail on migration errors
r.logger.Printf("[WARN] client: alloc %q error while migrating data from previous alloc: %v", r.allocID, err)

// Recreate alloc dir to ensure a clean slate
r.allocDir.Destroy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can allocDir ever be nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, allocDir is initialized in NewAllocRunner and RestoreState includes (hopefully sufficient!) checks to prevent restoring a nil alloc dir. (Good question as I believe this did cause issues in the past!)

@@ -463,6 +463,9 @@ func (p *remotePrevAlloc) streamAllocDir(ctx context.Context, resp io.ReadCloser
}
}

// if we see this file, there was an error on the remote side
errorFilename := allocdir.SnapshotErrorFilename(p.prevAllocID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something obvious, but I can't find where allocdir is initialized. Can it ever be nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocdir is sneakily referring to a package here... kind of unfortunate I use allocDir a lot as a variable and allocdir a lot as a package... 😬

Typeflag: tar.TypeReg,
}

if err := tw.WriteHeader(&hdr); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can tw ever be nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's always initialized in the single caller to writeError.

// writeError writes a special file to a tar archive with the error encountered
// during snapshotting. See Snapshot().
func writeError(tw *tar.Writer, allocID string, err error) error {
contents := []byte(fmt.Sprintf("Error snapshotting: %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be at the beginning of the function? Maybe put it closer to where it is actually used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to put its length into the Header, so it has to stay here.

@@ -242,6 +265,7 @@ func (d *AllocDir) Destroy() error {
mErr.Errors = append(mErr.Errors, fmt.Errorf("failed to remove alloc dir %q: %v", d.AllocDir, err))
}

d.built = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an explanatory comment for why this flag is changed?

If an alloc dir is being GC'd (removed) during snapshotting the walk
func will be passed an error. Previously we didn't check for an error so
a panic would occur when we'd try to use a nil `fileInfo`.
Test that snapshot errors don't return a valid tar currently fails.
@@ -41,7 +41,9 @@ job "docs" {
- `migrate` `(bool: false)` - When `sticky` is true, this specifies that the
Nomad client should make a best-effort attempt to migrate the data from a
remote machine if placement cannot be made on the original node. During data
migration, the task will block starting until the data migration has completed.
migration, the task will block starting until the data migration has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the API as well

@@ -24,6 +24,10 @@ const (
)

var (
// SnapshotErrorTime is the sentinel time that will be used on the
// error file written by Snapshot when it encounters as error.
SnapshotErrorTime = time.Date(2000, 0, 0, 0, 0, 0, 0, time.UTC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should have a IsSnapshotErrorFile method that checks this as well as the file name. Right now this is unused on the validation path

@@ -841,6 +841,14 @@ func (r *AllocRunner) Run() {

// Soft-fail on migration errors
r.logger.Printf("[WARN] client: alloc %q error while migrating data from previous alloc: %v", r.allocID, err)

// Recreate alloc dir to ensure a clean slate
r.allocDir.Destroy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for error here. Probably can't continue if the destroy fails as well!

@schmichael schmichael merged commit b53f24b into master Dec 5, 2017
@schmichael schmichael deleted the b-snapshot-atomic branch December 5, 2017 17:16
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants