Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client persist state using bolt-db and more efficient write patterns #2610

Merged
merged 14 commits into from
May 9, 2017

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented May 3, 2017

This PR introduces bolt-db to the client and handles the upgrade path from the old state storage mechanism to the new one.

Further, write amplification has been drastically reduced.

The following is a comparison of this PR versus 0.5.6:
Very large job running 600 sleep tasks all as their own group:
Old:
screen shot 2017-05-03 at 11 22 46 am
Never actually finishes!

New:
screen shot 2017-05-03 at 11 22 54 am

Job with a sleep group with count 2000:
Old:
screen shot 2017-05-03 at 11 23 02 am
~7 minute runtime

New:
screen shot 2017-05-03 at 11 23 09 am
~3.5 minute runtime!

dadgar added 10 commits May 1, 2017 14:50
This PR removes deepcopying of the job attached to the allocation in the
alloc runner. This operation is called very often so removing reflect
from the code path and the potentially large number of mallocs need to
create a job reduced memory and cpu pressure.
This reverts commit 4d6a012.
@dadgar dadgar requested a review from schmichael May 8, 2017 20:35
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good. Super excited to have boltdb's API/semantics instead of our tree of flat files. I wish it was all wrapped up in a tidy interface, but as I say in a comment I understand how that moves the goalpost considerably and is out of scope for what this PR is looking to accomplish (more efficient persistence).

// doesn't change over the life-cycle of the alloc_runner.
type allocRunnerImmutableState struct {
Version string
Alloc *structs.Allocation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.alloc is mutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pulled the mutable parts of it into the allocRunnerMutableState. You can think of this one as the allocation received by the client initially

var snap allocRunnerState
if err := restoreState(r.stateFilePath(), &snap); err != nil {
if err := pre060RestoreState(oldPath, &snap); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check for the new state location first and only fallback if it doesn't exist to avoid races where we delete the old state files and crash before writing the new version.

if err := restoreState(r.stateFilePath(), &snap); err != nil {
if err := pre060RestoreState(oldPath, &snap); err == nil {
// Restore fields
r.logger.Printf("[DEBUG] client: restoring pre v0.6.0 alloc runner state for alloc %q", r.alloc.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] I wonder if we should elevate this to [INFO] since users should only see it once-per-alloc when upgrading and won't have an opportunity to bump their log level from INFO to DEBUG if they need this log line to debug something after-the-fact. (Although whether or not this code was hit should be apparent from other factors as well - so w/e)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// #2132 Upgrade path: if snap.AllocDir is nil, try to convert old
// Context struct to new AllocDir struct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this migration was added in the 0.5.x release cycle and therefore can be removed in 0.7. If you're in this code again mind adding // TODO Remove in 0.7 for me? Sorry!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

}

// Write the immutable data
if !r.immutablePersisted {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will skip saving updated allocs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier comment. This is how we get a lot of the efficiency gain.

oldPath := r.pre060StateFilePath()
if err := pre060RestoreState(oldPath, &snap); err == nil {
// Delete the old state
os.RemoveAll(oldPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persist new before removing old?

if err := codec.NewEncoder(&buf, structs.MsgpackHandle).Encode(&snap); err != nil {
return fmt.Errorf("failed to serialize snapshot: %v", err)
}
r.persistLock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put back into defer as there's a race (albeit tiny) between this line and calling stateDB.Batch() that could reorder SaveState calls.


// Store the hash that was persisted
tx.OnCommit(func() {
r.persistedHash = h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to acquire persistLock

client/util.go Outdated
@@ -78,15 +80,17 @@ func shuffleStrings(list []string) {

// persistState is used to help with saving state
func persistState(path string, data interface{}) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused and can be removed now right?

If it's still called from tests perhaps move to a _test.go file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -131,7 +141,7 @@ func TestConsul_Integration(t *testing.T) {
serviceClient.Run()
close(consulRan)
}()
tr := client.NewTaskRunner(logger, conf, logUpdate, taskDir, alloc, task, vclient, serviceClient)
tr := client.NewTaskRunner(logger, conf, db, logUpdate, taskDir, alloc, task, vclient, serviceClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wish our use of boltdb was wrapped in a typed state manager we could put behind an interface and mock for tests. There's no reason for these tests to actually exercise any of the state code.

It's fine it's that out of scope as I understand it'd require refactoring basically every change in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't actually that bad. The state is just mapped to a temp file. Honestly until a solidified API is formed it isn't worth the effort.

@dadgar dadgar merged commit d11ca57 into master May 9, 2017
@dadgar dadgar deleted the f-bolt-db branch May 9, 2017 20:01
@github-actions
Copy link

github-actions bot commented Apr 1, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants