-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize all models into cluster metadata #1499
Changes from 5 commits
64a5e98
d0d82f0
ec35d3d
47e8647
a0cb2be
aacc030
a205ad2
c58ef98
775dfd2
c8dcfb3
9eb9c45
df9f3d8
59c1456
0299567
3508c79
231452e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,10 +109,9 @@ protected void updateModelsNewCluster() throws IOException, InterruptedException | |
if (modelDao.isCreated()) { | ||
List<String> modelIds = searchModelIds(); | ||
for (String modelId : modelIds) { | ||
Model model = modelDao.get(modelId); | ||
ModelMetadata modelMetadata = model.getModelMetadata(); | ||
ModelMetadata modelMetadata = modelDao.getMetadata(modelId); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this going to be backwards compatible? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I should probably add a version check, since they won't be in the metadata on older clusters if it's in created state There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I'm not sure if there is a way to make it backwards compatible, since the old models wouldn't be in the cluster metadata. I think I have to revert to the get call |
||
if (modelMetadata.getState().equals(ModelState.TRAINING)) { | ||
updateModelStateAsFailed(model, "Training failed to complete as cluster crashed"); | ||
updateModelStateAsFailed(modelId, modelMetadata, "Training failed to complete as cluster crashed"); | ||
} | ||
} | ||
} | ||
|
@@ -123,11 +122,10 @@ protected void updateModelsNodesRemoved(List<DiscoveryNode> removedNodes) throws | |
List<String> modelIds = searchModelIds(); | ||
for (DiscoveryNode removedNode : removedNodes) { | ||
for (String modelId : modelIds) { | ||
Model model = modelDao.get(modelId); | ||
ModelMetadata modelMetadata = model.getModelMetadata(); | ||
ModelMetadata modelMetadata = modelDao.getMetadata(modelId); | ||
if (modelMetadata.getNodeAssignment().equals(removedNode.getEphemeralId()) | ||
&& modelMetadata.getState().equals(ModelState.TRAINING)) { | ||
updateModelStateAsFailed(model, "Training failed to complete as node dropped"); | ||
updateModelStateAsFailed(modelId, modelMetadata, "Training failed to complete as node dropped"); | ||
} | ||
} | ||
} | ||
|
@@ -158,9 +156,11 @@ public void onFailure(Exception e) { | |
return modelIds; | ||
} | ||
|
||
private void updateModelStateAsFailed(Model model, String msg) throws IOException { | ||
model.getModelMetadata().setState(ModelState.FAILED); | ||
model.getModelMetadata().setError(msg); | ||
private void updateModelStateAsFailed(String modelId, ModelMetadata modelMetadata, String msg) throws IOException, ExecutionException, | ||
InterruptedException { | ||
modelMetadata.setState(ModelState.FAILED); | ||
modelMetadata.setError(msg); | ||
Model model = new Model(modelMetadata, null, modelId); | ||
modelDao.update(model, new ActionListener<IndexResponse>() { | ||
@Override | ||
public void onResponse(IndexResponse indexResponse) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to discuss a little bit here some thoughts:
So before, the logic was that the model index is the only source of truth until the model is actually created. Then the model-metadata is a source of truth as well.
Now, we are saying that the model metadata will always lag behind the model system index (i.e. on any call to change model state, we will first persist in system index and then update in cluster state). This is going to open up to the possibility that model index and cluster state fall out of sync on failure. We may need to think about how we handle different scenarios.
I think a model's state can be updated in one of the following ways:
Are there other cases you can think of that might be of concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 The goal of this was to treat the cluster metadata as the main source of truth in case of node drops I believe. This would decrease the chance of nodes being out of sync with the cluster since they could always access the model metadata from the cluster metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a temporary integration test trying to create a race condition between a model finishing training and entering the created state in ModelDao and then deleting the model immediately. All behavior was as expected and the model was deleted from the cluster metadata as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanbogan can you push to another branch and link to the test? I want to take a quick look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 I already deleted it off the local, do you want me to recreate it?