Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ML fault tolerance #3803

Merged
merged 16 commits into from
May 1, 2023
Merged
37 changes: 37 additions & 0 deletions _ml-commons-plugin/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,43 @@ GET /_plugins/_ml/profile
}
```

### Example: Return auto deploy and node information

When the [auto redeploy]({{site.url}}{{site.baseurl}}/ml-commons-plugin/cluster-settings#enable-auto-redeploy) cluster setting is set to `true`, the profile API returns additional deployment information, including deployment time, retry count, and worker node IDs where the model is deployed.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

```json
{
"name": "all-mpnet-base-v2",
"algorithm": "TEXT_EMBEDDING",
"model_version": "1",
"model_format": "TORCH_SCRIPT",
"model_state": "DEPLOYED",
"model_content_size_in_bytes": 404999545,
"model_content_hash_value": "fe72818b76a91e154776e4737b1fb0db255c091e8123117ad8758d9f7be6e594",
"model_config": {
"model_type": "bert",
"embedding_dimension": 768,
"framework_type": "SENTENCE_TRANSFORMERS"
},
"created_time": 1681642820665,
"last_updated_time": 1681646576370,
"last_registered_time": 1681642837416,
"last_deployed_time": 1681646576370,
"auto_redeploy_retry_times": 0,
"total_chunks": 41,
"planning_worker_node_count": 6,
"current_worker_node_count": 6,
"planning_worker_nodes": [
"Liz28BgFTo--u0ZXVtmOaQ",
"jPPe_s9vQq-cKgZrn6hN-w",
"gN8IFfxdT4mnPdc6WW9ung",
"lCUgCEiASWKfRNTYoKo9Ng",
"ZCBneXv6SG2VdHQvLAnEUg",
"F483VPEbQlaQoEW_F2H-gQ"
]
}
```


## Predict

Expand Down
32 changes: 32 additions & 0 deletions _ml-commons-plugin/cluster-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,35 @@ plugins.ml_commons.native_memory_threshold: 90

- Default value: 90
- Value range: [0, 100]

## Enable auto redeploy

Automatically redeploys deployed or partially deployed models upon cluster failure. If all ML nodes inside inside a cluster crash, auto model redeployment fails, and the model must be deployed manually.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

### Setting

```
plugins.ml_commons.model_auto_redeploy.enable: false
```

### Values

- Default value: false
- Value range: [false, true]

## Set retires for auto redeploy

Sets the limit for the number of times a previously deployed model will try and redeploy upon cluster failure.

### Setting

```
plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 3
```

### Values

- Default value: 3
- Value range: [0, 100]