-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize all models into cluster metadata #1499
Serialize all models into cluster metadata #1499
Conversation
…teListener Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1499 +/- ##
============================================
+ Coverage 84.92% 85.06% +0.13%
+ Complexity 1375 1280 -95
============================================
Files 172 168 -4
Lines 5605 5229 -376
Branches 553 494 -59
============================================
- Hits 4760 4448 -312
+ Misses 612 573 -39
+ Partials 233 208 -25 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Ryan Bogan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -109,10 +109,9 @@ protected void updateModelsNewCluster() throws IOException, InterruptedException | |||
if (modelDao.isCreated()) { | |||
List<String> modelIds = searchModelIds(); | |||
for (String modelId : modelIds) { | |||
Model model = modelDao.get(modelId); | |||
ModelMetadata modelMetadata = model.getModelMetadata(); | |||
ModelMetadata modelMetadata = modelDao.getMetadata(modelId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be backwards compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably add a version check, since they won't be in the metadata on older clusters if it's in created state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not sure if there is a way to make it backwards compatible, since the old models wouldn't be in the cluster metadata. I think I have to revert to the get call
} else { | ||
onIndexListener = onMetaListener; | ||
} | ||
ActionListener<IndexResponse> onIndexListener = getUpdateModelMetadataListener(model.getModelMetadata(), onMetaListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to discuss a little bit here some thoughts:
So before, the logic was that the model index is the only source of truth until the model is actually created. Then the model-metadata is a source of truth as well.
Now, we are saying that the model metadata will always lag behind the model system index (i.e. on any call to change model state, we will first persist in system index and then update in cluster state). This is going to open up to the possibility that model index and cluster state fall out of sync on failure. We may need to think about how we handle different scenarios.
I think a model's state can be updated in one of the following ways:
- Training process (this will be in sync - single thread, synchronous process)
- Model deleted (we block the model delete if the model is not in the created or error state - we should confirm that we check this from cluster metadata before blocking)
- Node drop (if a node drops, the offending node will know not to change anything in the state correct?)
Are there other cases you can think of that might be of concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 The goal of this was to treat the cluster metadata as the main source of truth in case of node drops I believe. This would decrease the chance of nodes being out of sync with the cluster since they could always access the model metadata from the cluster metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a temporary integration test trying to create a race condition between a model finishing training and entering the created state in ModelDao and then deleting the model immediately. All behavior was as expected and the model was deleted from the cluster metadata as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanbogan can you push to another branch and link to the test? I want to take a quick look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 I already deleted it off the local, do you want me to recreate it?
@ryanbogan and check from here: k-NN/src/main/java/org/opensearch/knn/index/query/KNNWeight.java Lines 213 to 221 in 7a144b8
|
and here:
|
I'll add a check to get the state |
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Windows failures are unrelated to this PR |
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor comments. overall looks good
src/main/java/org/opensearch/knn/index/mapper/ModelFieldMapper.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan Bogan <[email protected]>
BWC failure unrelated to this change: #1622 |
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan Bogan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
if (modelMetadata == null) { | ||
throw new RuntimeException("Model \"" + modelId + "\" does not exist."); | ||
if (!ModelUtil.isModelCreated(modelMetadata)) { | ||
throw new RuntimeException("Model \"" + modelId + "\" is not created."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please use String.format for string concatenation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martin-gaievski this comes up a couple times because we do not do it for a lot of existing code. Do you know if its possible to add something to spotless to automatically fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be possible, we did some custom template for spotless in neural-search, please check this PR opensearch-project/neural-search#515
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martin-gaievski is there a comment for it or file? Im not seeing it in formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can't see anything for string.format, maybe it's not possible. @vibrantvarun do you know by any chance if we can automate fix for String.format(locale, "")
?
* Remove transport calls in TrainingJobRunner and TrainingJobClusterStateListener Signed-off-by: Ryan Bogan <[email protected]> * Fix tests Signed-off-by: Ryan Bogan <[email protected]> * Add changelog Signed-off-by: Ryan Bogan <[email protected]> * Fix CMake Faiss bug Signed-off-by: Ryan Bogan <[email protected]> * Add state checks for existing cluster metadata calls Signed-off-by: Ryan Bogan <[email protected]> * Remove CMake bug fix Signed-off-by: Ryan Bogan <[email protected]> * Fix changelog Signed-off-by: Ryan Bogan <[email protected]> * Fix failing tests Signed-off-by: Ryan Bogan <[email protected]> * Refactor and add two more created state checks Signed-off-by: Ryan Bogan <[email protected]> * Rebase and fix new tests Signed-off-by: Ryan Bogan <[email protected]> * Refactor created checks and modify error messages Signed-off-by: Ryan Bogan <[email protected]> * Refactor cluster state listener transport calls Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> (cherry picked from commit dc8eb6b)
* Remove transport calls in TrainingJobRunner and TrainingJobClusterStateListener Signed-off-by: Ryan Bogan <[email protected]> * Fix tests Signed-off-by: Ryan Bogan <[email protected]> * Add changelog Signed-off-by: Ryan Bogan <[email protected]> * Fix CMake Faiss bug Signed-off-by: Ryan Bogan <[email protected]> * Add state checks for existing cluster metadata calls Signed-off-by: Ryan Bogan <[email protected]> * Remove CMake bug fix Signed-off-by: Ryan Bogan <[email protected]> * Fix changelog Signed-off-by: Ryan Bogan <[email protected]> * Fix failing tests Signed-off-by: Ryan Bogan <[email protected]> * Refactor and add two more created state checks Signed-off-by: Ryan Bogan <[email protected]> * Rebase and fix new tests Signed-off-by: Ryan Bogan <[email protected]> * Refactor created checks and modify error messages Signed-off-by: Ryan Bogan <[email protected]> * Refactor cluster state listener transport calls Signed-off-by: Ryan Bogan <[email protected]> --------- Signed-off-by: Ryan Bogan <[email protected]> (cherry picked from commit dc8eb6b) Co-authored-by: Ryan Bogan <[email protected]>
Description
This PR removes a check that only serialized models into cluster metadata when they were in the "created" state. This means that any model in "training" or "failed" states could only be accessed through blocking transport requests. This PR also minimizes the transport calls in
TrainingJobClusterStateListener
, instead accessing model metadata from the cluster metadata.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.