Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ML fault tolerance #3803

Merged
merged 16 commits into from
May 1, 2023
Merged
72 changes: 36 additions & 36 deletions _ml-commons-plugin/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,12 +115,12 @@ The API returns information on the model, the algorithm used, and the content fo
}
```

## Upload a model
## Register a model

Use the upload operation to upload a custom model to a model index. ML Commons splits the model into smaller chunks and saves those chunks in the model's index.
Use the Register operation to register a custom model to a model index. ML Commons splits the model into smaller chunks and saves those chunks in the model's index.

```json
POST /_plugins/_ml/models/_upload
POST /_plugins/_ml/models/_register
```

### Request fields
Expand All @@ -137,10 +137,10 @@ Field | Data type | Description

### Example

The following example request uploads version `1.0.0` of an NLP sentence transformation model named `all-MiniLM-L6-v2`.
The following example request registers a version `1.0.0` of an NLP sentence transformation model named `all-MiniLM-L6-v2`.

```json
POST /_plugins/_ml/models/_upload
POST /_plugins/_ml/models/_register
{
"name": "all-MiniLM-L6-v2",
"version": "1.0.0",
Expand All @@ -166,43 +166,43 @@ OpenSearch responds with the `task_id` and task `status`.
}
```

To see the status of your model upload, enter the `task_id` into the [task API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#get-task-information). Use the `model_id` from the task response once the upload is complete. For example:
To see the status of your model registration, enter the `task_id` in the [task API] ...

```json
{
"model_id" : "WWQI44MBbzI2oUKAvNUt",
"task_type" : "UPLOAD_MODEL",
"function_name" : "TEXT_EMBEDDING",
"state" : "COMPLETED",
"state" : "REGISTERED",
"worker_node" : "KzONM8c8T4Od-NoUANQNGg",
"create_time" : 1665961344003,
"last_update_time" : 1665961373047,
"is_async" : true
}
```

## Load model
## Deploy model

The load model operation reads the model's chunks from the model index, then creates an instance of the model to cache into memory. This operation requires the `model_id`.
The deploy model operation reads the model's chunks from the model index, then creates an instance of the model to cache into memory. This operation requires the `model_id`.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /_plugins/_ml/models/<model_id>/_load
POST /_plugins/_ml/models/<model_id>/_deploy
```

### Example: Load into all available ML nodes
### Example: Deploy into all available ML nodes
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

In this example request, OpenSearch loads the model into any available OpenSearch ML node:
In this example request, OpenSearch deploys the model into any available OpenSearch ML node:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy in. But maybe it's common usage in this scenario.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be "deploys...to".

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_deploy
```

### Example: Load into a specific node
### Example: Deploy into a specific node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Deploying to"?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

If you want to reserve the memory of other ML nodes within your cluster, you can load your model into a specific node(s) by specifying the `node_ids` in the request body:
If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model into a specific node(s) by specifying the `node_ids` in the request body:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"deploy your model to"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model into a specific node(s) by specifying the `node_ids` in the request body:
If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model to a specific node(s) by specifying the `node_ids` in the request body:

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_deploy
{
"node_ids": ["4PLK7KJWReyX0oWKnBA8nA"]
}
Expand All @@ -213,87 +213,87 @@ POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
```json
{
"task_id" : "hA8P44MBhyWuIwnfvTKP",
"status" : "CREATED"
"status" : "DEPLOYING"
}
```

## Unload a model
## Undeploying a model

To unload a model from memory, use the unload operation.
To undeploy a model from memory, use the undeploy operation.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /_plugins/_ml/models/<model_id>/_unload
POST /_plugins/_ml/models/<model_id>/_undeploy
```

### Example: Unload model from all ML nodes
### Example: Undeploy model from all ML nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Undeploying a model..."?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /_plugins/_ml/models/MGqJhYMBbbh0ushjm8p_/_unload
POST /_plugins/_ml/models/MGqJhYMBbbh0ushjm8p_/_undeploy
```

### Response: Unload model from all ML nodes
### Response: Undeploy model from all ML nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Undeploying a model..."?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"s5JwjZRqTY6nOT0EvFwVdA": {
"stats": {
"MGqJhYMBbbh0ushjm8p_": "unloaded"
"MGqJhYMBbbh0ushjm8p_": "UNDEPLOYED"
}
}
}
```

### Example: Unload specific models from specific nodes
### Example: Undeploy specific models from specific nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Undeploying"?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /_plugins/_ml/models/_unload
POST /_plugins/_ml/models/_undeploy
{
"node_ids": ["sv7-3CbwQW-4PiIsDOfLxQ"],
"model_ids": ["KDo2ZYQB-v9VEDwdjkZ4"]
}
```


### Response: Unload specific models from specific nodes
### Response: Undeploy specific models from specific nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Undeploying"?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"sv7-3CbwQW-4PiIsDOfLxQ" : {
"stats" : {
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded"
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED"
}
}
}
```

### Response: Unload all models from specific nodes
### Response: Undeploy all models from specific nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Undeploying"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Response: Undeploy all models from specific nodes
### Response: Undeploying all models from specific nodes


```json
{
"sv7-3CbwQW-4PiIsDOfLxQ" : {
"stats" : {
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded",
"-8o8ZYQBvrLMaN0vtwzN" : "unloaded"
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED",
"-8o8ZYQBvrLMaN0vtwzN" : "UNDEPLOYED"
}
}
}
```

### Example: Unload specific models from all nodes
### Example: Undeploy specific models from all nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Undeploying"?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"model_ids": ["KDo2ZYQB-v9VEDwdjkZ4"]
}
```

### Response: Unload specific models from all nodes
### Response: Undeploy specific models from all nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Undeploying"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Response: Undeploy specific models from all nodes
### Response: Undeploying specific models from all nodes


```json
{
"sv7-3CbwQW-4PiIsDOfLxQ" : {
"stats" : {
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded"
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED"
}
}
}
Expand Down Expand Up @@ -390,7 +390,7 @@ POST /_plugins/_ml/models/_search

## Delete model

Deletes a model based on the model_id
Deletes a model based on the `model_id`.

```json
DELETE /_plugins/_ml/models/<model_id>
Expand Down Expand Up @@ -473,7 +473,7 @@ GET /_plugins/_ml/profile
"KzONM8c8T4Od-NoUANQNGg" : { # node id
"models" : {
"WWQI44MBbzI2oUKAvNUt" : { # model id
"model_state" : "LOADED", # model status
"model_state" : "DEPLOYED", # model status
"predictor" : "org.opensearch.ml.engine.algorithms.text_embedding.TextEmbeddingModel@592814c9",
"worker_nodes" : [ # routing table
"KzONM8c8T4Od-NoUANQNGg"
Expand Down
64 changes: 62 additions & 2 deletions _ml-commons-plugin/cluster-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ plugins.ml_commons.max_ml_task_per_node: 10

## Set number of ML models per node

Sets the number of ML models that can be loaded on to each ML node. When set to `0`, no ML models can load on any node.
Sets the number of ML models that can be deployed on to each ML node. When set to `0`, no ML models can deploy on any node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy into, on to a node. Do these have different meanings?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be "deploy to".

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Setting

Expand All @@ -74,7 +74,7 @@ plugins.ml_commons.max_model_on_node: 10

## Set sync job intervals

When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.
When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly deployed or undeployed models on each node. When set to `0`, ML Commons immediately stops sync up jobs.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm that "profile" is intentionally lowercase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uppercased it



### Setting
Expand Down Expand Up @@ -186,3 +186,63 @@ plugins.ml_commons.native_memory_threshold: 90

- Default value: 90
- Value range: [0, 100]

## Allow custom deployment plans

When enabled, grants users the ability to deploy models to specific ML nodes according to that user's permissions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be phrased as a complete sentence (add the noun)?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Setting

```
plugins.ml_commons.allow_custom_deployment_plan: false
```

### Values

- Default value: false
- Value range: [false, true]

## Enable auto redeploy

Automatically redeploys deployed or partially deployed models upon cluster failure. If all ML nodes inside a cluster crash, the model switches to the `DEPLOYED_FAILED` state, and the model must be deployed manually.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be phrased as a complete sentence (add the noun)?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Setting

```
plugins.ml_commons.model_auto_redeploy.enable: false
```

### Values

- Default value: false
- Value range: [false, true]

## Set retires for auto redeploy

Sets the limit for the number of times a deployed or partially deployed model will try and redeploy when ML nodes in a cluster fail or new ML nodes join the cluster.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Setting

```
plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 3
```

### Values

- Default value: 3
- Value range: [0, 100]

## Set auto redeploy success ratio

Sets the ratio of success for the auto redeployment of a model based on the available ML nodes in the cluster. For example, if ML nodes crash inside a cluster, the auto redeploy protocol adds another node or retires a crashed node. If our ratio is `0.7` and 70% of all ML nodes successfully redeploy the model on auto redeploy activation, the redeployment is a success. If the model redeploys on less the 70% of available ML nodes, the auto redeploy retries until the redeployment succeeds or OpenSearch reaches [the maximum number of retries](#set-retires-for-auto-redeploy).
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the first sentence be phrased as a complete sentence (add the noun)?


### Setting

```
plugins.ml_commons.model_auto_redeploy_success_ratio: 0.8
```

### Values

- Default value: 0.8
- Value range: [0, 1]