-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client persist state using bolt-db and more efficient write patterns #2610
Conversation
This PR removes deepcopying of the job attached to the allocation in the alloc runner. This operation is called very often so removing reflect from the code path and the potentially large number of mallocs need to create a job reduced memory and cpu pressure.
This reverts commit 4d6a012.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good. Super excited to have boltdb's API/semantics instead of our tree of flat files. I wish it was all wrapped up in a tidy interface, but as I say in a comment I understand how that moves the goalpost considerably and is out of scope for what this PR is looking to accomplish (more efficient persistence).
// doesn't change over the life-cycle of the alloc_runner. | ||
type allocRunnerImmutableState struct { | ||
Version string | ||
Alloc *structs.Allocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.alloc is mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pulled the mutable parts of it into the allocRunnerMutableState
. You can think of this one as the allocation received by the client initially
client/alloc_runner.go
Outdated
var snap allocRunnerState | ||
if err := restoreState(r.stateFilePath(), &snap); err != nil { | ||
if err := pre060RestoreState(oldPath, &snap); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check for the new state location first and only fallback if it doesn't exist to avoid races where we delete the old state files and crash before writing the new version.
client/alloc_runner.go
Outdated
if err := restoreState(r.stateFilePath(), &snap); err != nil { | ||
if err := pre060RestoreState(oldPath, &snap); err == nil { | ||
// Restore fields | ||
r.logger.Printf("[DEBUG] client: restoring pre v0.6.0 alloc runner state for alloc %q", r.alloc.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] I wonder if we should elevate this to [INFO]
since users should only see it once-per-alloc when upgrading and won't have an opportunity to bump their log level from INFO to DEBUG if they need this log line to debug something after-the-fact. (Although whether or not this code was hit should be apparent from other factors as well - so w/e)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
client/alloc_runner.go
Outdated
} | ||
|
||
// #2132 Upgrade path: if snap.AllocDir is nil, try to convert old | ||
// Context struct to new AllocDir struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this migration was added in the 0.5.x release cycle and therefore can be removed in 0.7. If you're in this code again mind adding // TODO Remove in 0.7
for me? Sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
} | ||
|
||
// Write the immutable data | ||
if !r.immutablePersisted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will skip saving updated allocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier comment. This is how we get a lot of the efficiency gain.
oldPath := r.pre060StateFilePath() | ||
if err := pre060RestoreState(oldPath, &snap); err == nil { | ||
// Delete the old state | ||
os.RemoveAll(oldPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Persist new before removing old?
client/task_runner.go
Outdated
if err := codec.NewEncoder(&buf, structs.MsgpackHandle).Encode(&snap); err != nil { | ||
return fmt.Errorf("failed to serialize snapshot: %v", err) | ||
} | ||
r.persistLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put back into defer as there's a race (albeit tiny) between this line and calling stateDB.Batch()
that could reorder SaveState calls.
|
||
// Store the hash that was persisted | ||
tx.OnCommit(func() { | ||
r.persistedHash = h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to acquire persistLock
client/util.go
Outdated
@@ -78,15 +80,17 @@ func shuffleStrings(list []string) { | |||
|
|||
// persistState is used to help with saving state | |||
func persistState(path string, data interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused and can be removed now right?
If it's still called from tests perhaps move to a _test.go
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -131,7 +141,7 @@ func TestConsul_Integration(t *testing.T) { | |||
serviceClient.Run() | |||
close(consulRan) | |||
}() | |||
tr := client.NewTaskRunner(logger, conf, logUpdate, taskDir, alloc, task, vclient, serviceClient) | |||
tr := client.NewTaskRunner(logger, conf, db, logUpdate, taskDir, alloc, task, vclient, serviceClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wish our use of boltdb was wrapped in a typed state manager we could put behind an interface and mock for tests. There's no reason for these tests to actually exercise any of the state code.
It's fine it's that out of scope as I understand it'd require refactoring basically every change in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't actually that bad. The state is just mapped to a temp file. Honestly until a solidified API is formed it isn't worth the effort.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR introduces bolt-db to the client and handles the upgrade path from the old state storage mechanism to the new one.
Further, write amplification has been drastically reduced.
The following is a comparison of this PR versus 0.5.6:
Very large job running 600 sleep tasks all as their own group:
Old:
Never actually finishes!
New:
Job with a sleep group with count 2000:
Old:
~7 minute runtime
New:
~3.5 minute runtime!