Skip to content

Commit

Permalink
Revert "Add ML fault tolerance (#3803)"
Browse files Browse the repository at this point in the history
This reverts commit d2d2267.
  • Loading branch information
vagimeli committed May 4, 2023
1 parent 9d07b52 commit c9ffb23
Show file tree
Hide file tree
Showing 2 changed files with 51 additions and 111 deletions.
98 changes: 49 additions & 49 deletions _ml-commons-plugin/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ In order to train tasks through the API, three inputs are required.
- Model hyper parameters: Adjust these parameters to make the model train better.
- Input data: The data input that trains the ML model, or applies the ML models to predictions. You can input data in two ways, query against your index or use data frame.

## Training a model
## Train model

Training can occur both synchronously and asynchronously.

Expand Down Expand Up @@ -96,7 +96,7 @@ For asynchronous responses, the API returns the task_id, which can be used to ge
}
```

## Getting model information
## Get model information

You can retrieve information on your model using the model_id.

Expand All @@ -115,12 +115,12 @@ The API returns information on the model, the algorithm used, and the content fo
}
```

## Registering a model
## Upload a model

Use the register operation to register a custom model to a model index. ML Commons splits the model into smaller chunks and saves those chunks in the model's index.
Use the upload operation to upload a custom model to a model index. ML Commons splits the model into smaller chunks and saves those chunks in the model's index.

```json
POST /_plugins/_ml/models/_register
POST /_plugins/_ml/models/_upload
```

### Request fields
Expand All @@ -137,10 +137,10 @@ Field | Data type | Description

### Example

The following example request registers a version `1.0.0` of an NLP sentence transformation model named `all-MiniLM-L6-v2`.
The following example request uploads version `1.0.0` of an NLP sentence transformation model named `all-MiniLM-L6-v2`.

```json
POST /_plugins/_ml/models/_register
POST /_plugins/_ml/models/_upload
{
"name": "all-MiniLM-L6-v2",
"version": "1.0.0",
Expand All @@ -166,43 +166,43 @@ OpenSearch responds with the `task_id` and task `status`.
}
```

To see the status of your model registration, enter the `task_id` in the [task API] ...
To see the status of your model upload, enter the `task_id` into the [task API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#get-task-information). Use the `model_id` from the task response once the upload is complete. For example:

```json
{
"model_id" : "WWQI44MBbzI2oUKAvNUt",
"task_type" : "UPLOAD_MODEL",
"function_name" : "TEXT_EMBEDDING",
"state" : "REGISTERED",
"state" : "COMPLETED",
"worker_node" : "KzONM8c8T4Od-NoUANQNGg",
"create_time" : 1665961344003,
"last_update_time" : 1665961373047,
"is_async" : true
}
```

## Deploying a model
## Load model

The deploy model operation reads the model's chunks from the model index and then creates an instance of the model to cache into memory. This operation requires the `model_id`.
The load model operation reads the model's chunks from the model index, then creates an instance of the model to cache into memory. This operation requires the `model_id`.

```json
POST /_plugins/_ml/models/<model_id>/_deploy
POST /_plugins/_ml/models/<model_id>/_load
```

### Example: Deploying to all available ML nodes
### Example: Load into all available ML nodes

In this example request, OpenSearch deploys the model to any available OpenSearch ML node:
In this example request, OpenSearch loads the model into any available OpenSearch ML node:

```json
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_deploy
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
```

### Example: Deploying to a specific node
### Example: Load into a specific node

If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model to a specific node(s) by specifying the `node_ids` in the request body:
If you want to reserve the memory of other ML nodes within your cluster, you can load your model into a specific node(s) by specifying the `node_ids` in the request body:

```json
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_deploy
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
{
"node_ids": ["4PLK7KJWReyX0oWKnBA8nA"]
}
Expand All @@ -213,93 +213,93 @@ POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_deploy
```json
{
"task_id" : "hA8P44MBhyWuIwnfvTKP",
"status" : "DEPLOYING"
"status" : "CREATED"
}
```

## Undeploying a model
## Unload a model

To undeploy a model from memory, use the undeploy operation:
To unload a model from memory, use the unload operation.

```json
POST /_plugins/_ml/models/<model_id>/_undeploy
POST /_plugins/_ml/models/<model_id>/_unload
```

### Example: Undeploying model from all ML nodes
### Example: Unload model from all ML nodes

```json
POST /_plugins/_ml/models/MGqJhYMBbbh0ushjm8p_/_undeploy
POST /_plugins/_ml/models/MGqJhYMBbbh0ushjm8p_/_unload
```

### Response: Undeploying a model from all ML nodes
### Response: Unload model from all ML nodes

```json
{
"s5JwjZRqTY6nOT0EvFwVdA": {
"stats": {
"MGqJhYMBbbh0ushjm8p_": "UNDEPLOYED"
"MGqJhYMBbbh0ushjm8p_": "unloaded"
}
}
}
```

### Example: Undeploying specific models from specific nodes
### Example: Unload specific models from specific nodes

```json
POST /_plugins/_ml/models/_undeploy
POST /_plugins/_ml/models/_unload
{
"node_ids": ["sv7-3CbwQW-4PiIsDOfLxQ"],
"model_ids": ["KDo2ZYQB-v9VEDwdjkZ4"]
}
```


### Response: Undeploying specific models from specific nodes
### Response: Unload specific models from specific nodes

```json
{
"sv7-3CbwQW-4PiIsDOfLxQ" : {
"stats" : {
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED"
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded"
}
}
}
```

### Response: Undeploying all models from specific nodes
### Response: Unload all models from specific nodes

```json
{
"sv7-3CbwQW-4PiIsDOfLxQ" : {
"stats" : {
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED",
"-8o8ZYQBvrLMaN0vtwzN" : "UNDEPLOYED"
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded",
"-8o8ZYQBvrLMaN0vtwzN" : "unloaded"
}
}
}
```

### Example: Undeploying specific models from all nodes
### Example: Unload specific models from all nodes

```json
{
"model_ids": ["KDo2ZYQB-v9VEDwdjkZ4"]
}
```

### Response: Undeploying specific models from all nodes
### Response: Unload specific models from all nodes

```json
{
"sv7-3CbwQW-4PiIsDOfLxQ" : {
"stats" : {
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED"
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded"
}
}
}
```

## Searching for a model
## Search model

Use this command to search models you've already created.

Expand All @@ -309,7 +309,7 @@ POST /_plugins/_ml/models/_search
{query}
```

### Example: Querying all models
### Example: Query all models

```json
POST /_plugins/_ml/models/_search
Expand All @@ -321,7 +321,7 @@ POST /_plugins/_ml/models/_search
}
```

### Example: Querying models with algorithm "FIT_RCF"
### Example: Query models with algorithm "FIT_RCF"

```json
POST /_plugins/_ml/models/_search
Expand Down Expand Up @@ -388,9 +388,9 @@ POST /_plugins/_ml/models/_search
}
```

## Deleting a model
## Delete model

Deletes a model based on the `model_id`.
Deletes a model based on the model_id

```json
DELETE /_plugins/_ml/models/<model_id>
Expand All @@ -414,9 +414,9 @@ The API returns the following:
}
```

## Returning model profile information
## Profile

The profile operation returns runtime information on ML tasks and models. The profile operation can help debug issues with models at runtime.
Returns runtime information on ML tasks and models. This operation can help debug issues with models at runtime.


```json
Expand Down Expand Up @@ -444,7 +444,7 @@ task_ids | string | Returns runtime data for a specific task. You can string tog
return_all_tasks | boolean | Determines whether or not a request returns all tasks. When set to `false` task profiles are left out of the response.
return_all_models | boolean | Determines whether or not a profile request returns all models. When set to `false` model profiles are left out of the response.

### Example: Returning all tasks and models on a specific node
### Example: Return all tasks and models on a specific node

```json
GET /_plugins/_ml/profile
Expand All @@ -455,7 +455,7 @@ GET /_plugins/_ml/profile
}
```

### Response: Returning all tasks and models on a specific node
### Response: Return all tasks and models on a specific node

```json
{
Expand All @@ -473,7 +473,7 @@ GET /_plugins/_ml/profile
"KzONM8c8T4Od-NoUANQNGg" : { # node id
"models" : {
"WWQI44MBbzI2oUKAvNUt" : { # model id
"model_state" : "DEPLOYED", # model status
"model_state" : "LOADED", # model status
"predictor" : "org.opensearch.ml.engine.algorithms.text_embedding.TextEmbeddingModel@592814c9",
"worker_nodes" : [ # routing table
"KzONM8c8T4Od-NoUANQNGg"
Expand Down Expand Up @@ -790,7 +790,7 @@ POST /_plugins/_ml/_train_predict/kmeans
}
```

## Getting task information
## Get task information

You can retrieve information about a task using the task_id.

Expand All @@ -814,7 +814,7 @@ The response includes information about the task.
}
```

## Searching for a task
## Search task

Search tasks based on parameters indicated in the request body.

Expand Down Expand Up @@ -905,7 +905,7 @@ GET /_plugins/_ml/tasks/_search
}
```

## Deleting a task
## Delete task

Delete a task based on the task_id.

Expand Down
64 changes: 2 additions & 62 deletions _ml-commons-plugin/cluster-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ plugins.ml_commons.max_ml_task_per_node: 10

## Set number of ML models per node

Sets the number of ML models that can be deployed to each ML node. When set to `0`, no ML models can deploy on any node.
Sets the number of ML models that can be loaded on to each ML node. When set to `0`, no ML models can load on any node.

### Setting

Expand All @@ -74,7 +74,7 @@ plugins.ml_commons.max_model_on_node: 10

## Set sync job intervals

When returning runtime information with the [Profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly deployed or undeployed models on each node. When set to `0`, ML Commons immediately stops sync-up jobs.
When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.


### Setting
Expand Down Expand Up @@ -186,63 +186,3 @@ plugins.ml_commons.native_memory_threshold: 90

- Default value: 90
- Value range: [0, 100]

## Allow custom deployment plans

When enabled, this setting grants users the ability to deploy models to specific ML nodes according to that user's permissions.

### Setting

```
plugins.ml_commons.allow_custom_deployment_plan: false
```

### Values

- Default value: false
- Value range: [false, true]

## Enable auto redeploy

This setting automatically redeploys deployed or partially deployed models upon cluster failure. If all ML nodes inside a cluster crash, the model switches to the `DEPLOYED_FAILED` state, and the model must be deployed manually.

### Setting

```
plugins.ml_commons.model_auto_redeploy.enable: false
```

### Values

- Default value: false
- Value range: [false, true]

## Set retires for auto redeploy

This setting sets the limit for the number of times a deployed or partially deployed model will try and redeploy when ML nodes in a cluster fail or new ML nodes join the cluster.

### Setting

```
plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 3
```

### Values

- Default value: 3
- Value range: [0, 100]

## Set auto redeploy success ratio

This setting sets the ratio of success for the auto-redeployment of a model based on the available ML nodes in a cluster. For example, if ML nodes crash inside a cluster, the auto redeploy protocol adds another node or retires a crashed node. If the ratio is `0.7` and 70% of all ML nodes successfully redeploy the model on auto-redeploy activation, the redeployment is a success. If the model redeploys on fewer than 70% of available ML nodes, the auto-redeploy retries until the redeployment succeeds or OpenSearch reaches [the maximum number of retries](#set-retires-for-auto-redeploy).

### Setting

```
plugins.ml_commons.model_auto_redeploy_success_ratio: 0.8
```

### Values

- Default value: 0.8
- Value range: [0, 1]

0 comments on commit c9ffb23

Please sign in to comment.