Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ML fault tolerance #3803

Merged
merged 16 commits into from
May 1, 2023
Merged

Add ML fault tolerance #3803

merged 16 commits into from
May 1, 2023

Conversation

Naarcha-AWS
Copy link
Collaborator

Fixes #2654

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Naarcha-AWS <[email protected]>
@Naarcha-AWS Naarcha-AWS added 3 - Tech review PR: Tech review in progress v2.7.0 labels Apr 18, 2023
@Naarcha-AWS Naarcha-AWS self-assigned this Apr 18, 2023
Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
@Naarcha-AWS Naarcha-AWS added 4 - Doc review PR: Doc review in progress 3 - Tech review PR: Tech review in progress and removed 3 - Tech review PR: Tech review in progress 4 - Doc review PR: Doc review in progress labels Apr 24, 2023
@Naarcha-AWS
Copy link
Collaborator Author

Need to update the PR with the following information:

Signed-off-by: Naarcha-AWS <[email protected]>
Copy link
Contributor

@ylwu-amzn ylwu-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot!

Signed-off-by: Naarcha-AWS <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

If you want to reserve the memory of other ML nodes within your cluster, you can load your model into a specific node(s) by specifying the `node_ids` in the request body:
If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model into a specific node(s) by specifying the `node_ids` in the request body:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model into a specific node(s) by specifying the `node_ids` in the request body:
If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model to a specific node(s) by specifying the `node_ids` in the request body:

}
}
}
```

### Response: Unload all models from specific nodes
### Response: Undeploy all models from specific nodes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Response: Undeploy all models from specific nodes
### Response: Undeploying all models from specific nodes


```json
{
"model_ids": ["KDo2ZYQB-v9VEDwdjkZ4"]
}
```

### Response: Unload specific models from all nodes
### Response: Undeploy specific models from all nodes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Response: Undeploy specific models from all nodes
### Response: Undeploying specific models from all nodes

Naarcha-AWS and others added 4 commits April 26, 2023 11:44
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
@Naarcha-AWS Naarcha-AWS added the release-notes PR: Include this PR in the automated release notes label Apr 26, 2023
Signed-off-by: Naarcha-AWS <[email protected]>
@Naarcha-AWS Naarcha-AWS merged commit bce2e33 into main May 1, 2023
@Naarcha-AWS Naarcha-AWS deleted the ml-fault-tolerance branch May 1, 2023 13:16
vagimeli pushed a commit that referenced this pull request May 4, 2023
* Add ML fault tolerance

Signed-off-by: Naarcha-AWS <[email protected]>

* Rework Profile API sentence

Signed-off-by: Naarcha-AWS <[email protected]>

* Fix link

Signed-off-by: Naarcha-AWS <[email protected]>

* Add review feedback

Signed-off-by: Naarcha-AWS <[email protected]>

* Add technical feedback for ML. Change API names

Signed-off-by: Naarcha-AWS <[email protected]>

* Add final ML node setting

Signed-off-by: Naarcha-AWS <[email protected]>

* Add more technical feedback

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Chris Moore <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update cluster-settings.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update api.md

Signed-off-by: Naarcha-AWS <[email protected]>

---------

Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Co-authored-by: Chris Moore <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
vagimeli added a commit that referenced this pull request May 4, 2023
harshavamsi pushed a commit to harshavamsi/documentation-website that referenced this pull request Oct 31, 2023
* Add ML fault tolerance

Signed-off-by: Naarcha-AWS <[email protected]>

* Rework Profile API sentence

Signed-off-by: Naarcha-AWS <[email protected]>

* Fix link

Signed-off-by: Naarcha-AWS <[email protected]>

* Add review feedback

Signed-off-by: Naarcha-AWS <[email protected]>

* Add technical feedback for ML. Change API names

Signed-off-by: Naarcha-AWS <[email protected]>

* Add final ML node setting

Signed-off-by: Naarcha-AWS <[email protected]>

* Add more technical feedback

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Chris Moore <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update cluster-settings.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _ml-commons-plugin/api.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Update api.md

Signed-off-by: Naarcha-AWS <[email protected]>

---------

Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Co-authored-by: Chris Moore <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Doc review PR: Doc review in progress release-notes PR: Include this PR in the automated release notes v2.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] [ml-commons] Support model auto-reload (fault tolerance)
5 participants