-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic Snapshotting / Sticky Volume Migration #3563
Conversation
e2972ce
to
e7191e2
Compare
Typeflag: tar.TypeReg, | ||
} | ||
err = tw.WriteHeader(&fooHdr) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assert library.
// Now write to its shared dir | ||
allocDirI, err := s.client.GetAllocFS(alloc.ID) | ||
if err != nil { | ||
t.Fatalf("unable to find alloc dir: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assert library.
command/agent/alloc_endpoint_test.go
Outdated
if err := allocDir.Snapshot(ioutil.Discard); err == nil { | ||
t.Errorf("expected Snapshot() to fail but it did not") | ||
} else { | ||
s.logger.Printf("[DEBUG] agent.test: snapshot returned error: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't err nil in the else case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I flipped the normal if/else order and handled the nil case first. Will fip to the usual order.
t.Parallel() | ||
dest, err := ioutil.TempDir("", "nomadtest-") | ||
if err != nil { | ||
t.Fatalf("err: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assert test package- same for all other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that said, the tests look great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer this style as assert
uses t.Error
not t.Fatal
so your test continues and usually panics due to the non-error value being nil.
...but maybe I'm the only one who gets confused by this? 😕
@@ -841,6 +841,14 @@ func (r *AllocRunner) Run() { | |||
|
|||
// Soft-fail on migration errors | |||
r.logger.Printf("[WARN] client: alloc %q error while migrating data from previous alloc: %v", r.allocID, err) | |||
|
|||
// Recreate alloc dir to ensure a clean slate | |||
r.allocDir.Destroy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can allocDir ever be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, allocDir is initialized in NewAllocRunner and RestoreState includes (hopefully sufficient!) checks to prevent restoring a nil alloc dir. (Good question as I believe this did cause issues in the past!)
@@ -463,6 +463,9 @@ func (p *remotePrevAlloc) streamAllocDir(ctx context.Context, resp io.ReadCloser | |||
} | |||
} | |||
|
|||
// if we see this file, there was an error on the remote side | |||
errorFilename := allocdir.SnapshotErrorFilename(p.prevAllocID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something obvious, but I can't find where allocdir is initialized. Can it ever be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocdir
is sneakily referring to a package here... kind of unfortunate I use allocDir
a lot as a variable and allocdir
a lot as a package... 😬
Typeflag: tar.TypeReg, | ||
} | ||
|
||
if err := tw.WriteHeader(&hdr); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can tw ever be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's always initialized in the single caller to writeError.
// writeError writes a special file to a tar archive with the error encountered | ||
// during snapshotting. See Snapshot(). | ||
func writeError(tw *tar.Writer, allocID string, err error) error { | ||
contents := []byte(fmt.Sprintf("Error snapshotting: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be at the beginning of the function? Maybe put it closer to where it is actually used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to put its length into the Header, so it has to stay here.
client/allocdir/alloc_dir.go
Outdated
@@ -242,6 +265,7 @@ func (d *AllocDir) Destroy() error { | |||
mErr.Errors = append(mErr.Errors, fmt.Errorf("failed to remove alloc dir %q: %v", d.AllocDir, err)) | |||
} | |||
|
|||
d.built = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an explanatory comment for why this flag is changed?
If an alloc dir is being GC'd (removed) during snapshotting the walk func will be passed an error. Previously we didn't check for an error so a panic would occur when we'd try to use a nil `fileInfo`.
Test that snapshot errors don't return a valid tar currently fails.
as per PR comments
a0ebf56
to
1ffb189
Compare
@@ -41,7 +41,9 @@ job "docs" { | |||
- `migrate` `(bool: false)` - When `sticky` is true, this specifies that the | |||
Nomad client should make a best-effort attempt to migrate the data from a | |||
remote machine if placement cannot be made on the original node. During data | |||
migration, the task will block starting until the data migration has completed. | |||
migration, the task will block starting until the data migration has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the API as well
@@ -24,6 +24,10 @@ const ( | |||
) | |||
|
|||
var ( | |||
// SnapshotErrorTime is the sentinel time that will be used on the | |||
// error file written by Snapshot when it encounters as error. | |||
SnapshotErrorTime = time.Date(2000, 0, 0, 0, 0, 0, 0, time.UTC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should have a IsSnapshotErrorFile
method that checks this as well as the file name. Right now this is unused on the validation path
@@ -841,6 +841,14 @@ func (r *AllocRunner) Run() { | |||
|
|||
// Soft-fail on migration errors | |||
r.logger.Printf("[WARN] client: alloc %q error while migrating data from previous alloc: %v", r.allocID, err) | |||
|
|||
// Recreate alloc dir to ensure a clean slate | |||
r.allocDir.Destroy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for error here. Probably can't continue if the destroy fails as well!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Includes #3539
Makes sticky migrations atomic: either all data is copied or none is. Previously partial data could be copied.