Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize all models into cluster metadata #1499

Merged
merged 16 commits into from
Apr 23, 2024

Conversation

ryanbogan
Copy link
Member

Description

This PR removes a check that only serialized models into cluster metadata when they were in the "created" state. This means that any model in "training" or "failed" states could only be accessed through blocking transport requests. This PR also minimizes the transport calls in TrainingJobClusterStateListener, instead accessing model metadata from the cluster metadata.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Feb 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.06%. Comparing base (771c4b5) to head (a0cb2be).
Report is 9 commits behind head on main.

❗ Current head a0cb2be differs from pull request most recent head 231452e. Consider uploading reports for the commit 231452e to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1499      +/-   ##
============================================
+ Coverage     84.92%   85.06%   +0.13%     
+ Complexity     1375     1280      -95     
============================================
  Files           172      168       -4     
  Lines          5605     5229     -376     
  Branches        553      494      -59     
============================================
- Hits           4760     4448     -312     
+ Misses          612      573      -39     
+ Partials        233      208      -25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -109,10 +109,9 @@ protected void updateModelsNewCluster() throws IOException, InterruptedException
if (modelDao.isCreated()) {
List<String> modelIds = searchModelIds();
for (String modelId : modelIds) {
Model model = modelDao.get(modelId);
ModelMetadata modelMetadata = model.getModelMetadata();
ModelMetadata modelMetadata = modelDao.getMetadata(modelId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be backwards compatible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably add a version check, since they won't be in the metadata on older clusters if it's in created state

Copy link
Member Author

@ryanbogan ryanbogan Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure if there is a way to make it backwards compatible, since the old models wouldn't be in the cluster metadata. I think I have to revert to the get call

} else {
onIndexListener = onMetaListener;
}
ActionListener<IndexResponse> onIndexListener = getUpdateModelMetadataListener(model.getModelMetadata(), onMetaListener);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to discuss a little bit here some thoughts:

So before, the logic was that the model index is the only source of truth until the model is actually created. Then the model-metadata is a source of truth as well.

Now, we are saying that the model metadata will always lag behind the model system index (i.e. on any call to change model state, we will first persist in system index and then update in cluster state). This is going to open up to the possibility that model index and cluster state fall out of sync on failure. We may need to think about how we handle different scenarios.

I think a model's state can be updated in one of the following ways:

  1. Training process (this will be in sync - single thread, synchronous process)
  2. Model deleted (we block the model delete if the model is not in the created or error state - we should confirm that we check this from cluster metadata before blocking)
  3. Node drop (if a node drops, the offending node will know not to change anything in the state correct?)

Are there other cases you can think of that might be of concern?

Copy link
Member Author

@ryanbogan ryanbogan Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15 The goal of this was to treat the cluster metadata as the main source of truth in case of node drops I believe. This would decrease the chance of nodes being out of sync with the cluster since they could always access the model metadata from the cluster metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a temporary integration test trying to create a race condition between a model finishing training and entering the created state in ModelDao and then deleting the model immediately. All behavior was as expected and the model was deleted from the cluster metadata as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanbogan can you push to another branch and link to the test? I want to take a quick look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15 I already deleted it off the local, do you want me to recreate it?

@jmazanec15
Copy link
Member

@ryanbogan and check from here:

String modelId = fieldInfo.getAttribute(MODEL_ID);
if (modelId != null) {
ModelMetadata modelMetadata = modelDao.getMetadata(modelId);
if (modelMetadata == null) {
throw new RuntimeException("Model \"" + modelId + "\" does not exist.");
}
knnEngine = modelMetadata.getKnnEngine();
spaceType = modelMetadata.getSpaceType();

@jmazanec15
Copy link
Member

and here:

import org.opensearch.knn.indices.ModelMetadata;

@ryanbogan
Copy link
Member Author

@ryanbogan and check from here:

String modelId = fieldInfo.getAttribute(MODEL_ID);
if (modelId != null) {
ModelMetadata modelMetadata = modelDao.getMetadata(modelId);
if (modelMetadata == null) {
throw new RuntimeException("Model \"" + modelId + "\" does not exist.");
}
knnEngine = modelMetadata.getKnnEngine();
spaceType = modelMetadata.getSpaceType();

I'll add a check to get the state

jni/CMakeLists.txt Outdated Show resolved Hide resolved
@ryanbogan
Copy link
Member Author

Windows failures are unrelated to this PR

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments. overall looks good

@ryanbogan
Copy link
Member Author

BWC failure unrelated to this change: #1622

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

if (modelMetadata == null) {
throw new RuntimeException("Model \"" + modelId + "\" does not exist.");
if (!ModelUtil.isModelCreated(modelMetadata)) {
throw new RuntimeException("Model \"" + modelId + "\" is not created.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use String.format for string concatenation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-gaievski this comes up a couple times because we do not do it for a lot of existing code. Do you know if its possible to add something to spotless to automatically fix this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be possible, we did some custom template for spotless in neural-search, please check this PR opensearch-project/neural-search#515

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-gaievski is there a comment for it or file? Im not seeing it in formatting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can't see anything for string.format, maybe it's not possible. @vibrantvarun do you know by any chance if we can automate fix for String.format(locale, "")?

@ryanbogan ryanbogan merged commit dc8eb6b into opensearch-project:main Apr 23, 2024
49 of 51 checks passed
@ryanbogan ryanbogan deleted the serialize_all_models branch April 23, 2024 17:27
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 23, 2024
* Remove transport calls in TrainingJobRunner and TrainingJobClusterStateListener

Signed-off-by: Ryan Bogan <[email protected]>

* Fix tests

Signed-off-by: Ryan Bogan <[email protected]>

* Add changelog

Signed-off-by: Ryan Bogan <[email protected]>

* Fix CMake Faiss bug

Signed-off-by: Ryan Bogan <[email protected]>

* Add state checks for existing cluster metadata calls

Signed-off-by: Ryan Bogan <[email protected]>

* Remove CMake bug fix

Signed-off-by: Ryan Bogan <[email protected]>

* Fix changelog

Signed-off-by: Ryan Bogan <[email protected]>

* Fix failing tests

Signed-off-by: Ryan Bogan <[email protected]>

* Refactor and add two more created state checks

Signed-off-by: Ryan Bogan <[email protected]>

* Rebase and fix new tests

Signed-off-by: Ryan Bogan <[email protected]>

* Refactor created checks and modify error messages

Signed-off-by: Ryan Bogan <[email protected]>

* Refactor cluster state listener transport calls

Signed-off-by: Ryan Bogan <[email protected]>

---------

Signed-off-by: Ryan Bogan <[email protected]>
(cherry picked from commit dc8eb6b)
ryanbogan added a commit that referenced this pull request Apr 23, 2024
* Remove transport calls in TrainingJobRunner and TrainingJobClusterStateListener

Signed-off-by: Ryan Bogan <[email protected]>

* Fix tests

Signed-off-by: Ryan Bogan <[email protected]>

* Add changelog

Signed-off-by: Ryan Bogan <[email protected]>

* Fix CMake Faiss bug

Signed-off-by: Ryan Bogan <[email protected]>

* Add state checks for existing cluster metadata calls

Signed-off-by: Ryan Bogan <[email protected]>

* Remove CMake bug fix

Signed-off-by: Ryan Bogan <[email protected]>

* Fix changelog

Signed-off-by: Ryan Bogan <[email protected]>

* Fix failing tests

Signed-off-by: Ryan Bogan <[email protected]>

* Refactor and add two more created state checks

Signed-off-by: Ryan Bogan <[email protected]>

* Rebase and fix new tests

Signed-off-by: Ryan Bogan <[email protected]>

* Refactor created checks and modify error messages

Signed-off-by: Ryan Bogan <[email protected]>

* Refactor cluster state listener transport calls

Signed-off-by: Ryan Bogan <[email protected]>

---------

Signed-off-by: Ryan Bogan <[email protected]>
(cherry picked from commit dc8eb6b)

Co-authored-by: Ryan Bogan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants