Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize all models into cluster metadata #1499
Serialize all models into cluster metadata #1499
Changes from 14 commits
64a5e98
d0d82f0
ec35d3d
47e8647
a0cb2be
aacc030
a205ad2
c58ef98
775dfd2
c8dcfb3
9eb9c45
df9f3d8
59c1456
0299567
3508c79
231452e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to discuss a little bit here some thoughts:
So before, the logic was that the model index is the only source of truth until the model is actually created. Then the model-metadata is a source of truth as well.
Now, we are saying that the model metadata will always lag behind the model system index (i.e. on any call to change model state, we will first persist in system index and then update in cluster state). This is going to open up to the possibility that model index and cluster state fall out of sync on failure. We may need to think about how we handle different scenarios.
I think a model's state can be updated in one of the following ways:
Are there other cases you can think of that might be of concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 The goal of this was to treat the cluster metadata as the main source of truth in case of node drops I believe. This would decrease the chance of nodes being out of sync with the cluster since they could always access the model metadata from the cluster metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a temporary integration test trying to create a race condition between a model finishing training and entering the created state in ModelDao and then deleting the model immediately. All behavior was as expected and the model was deleted from the cluster metadata as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanbogan can you push to another branch and link to the test? I want to take a quick look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 I already deleted it off the local, do you want me to recreate it?