Release Pre-Trained ML models #2676

ylwu-amzn · 2022-09-30T17:04:56Z

Is your feature request related to a problem? Please describe

We are going to support uploading deep learning models in ml-commons in 2.4, check details on opensearch-project/ml-commons#302. That feature requires user train model outside OpenSearch cluster, then upload model to OpenSearch cluster via ml-commons API. To make it easy for user to start using this feature, we are going to publish some pre-trained models, so user don't need to train model first, just provide model name and ml-commons will load from OpenSearch artifact repo.

Describe the solution you'd like

Publish Pre-Trained ML models to OpenSearch artifact repo, just like OpenSearch tar and DataPreffer.

OpenSearch: https://artifacts.opensearch.org/releases/bundle/opensearch/2.3.0/opensearch-2.3.0-linux-x64.tar.gz
DataPrepper: https://artifacts.opensearch.org/data-prepper/1.5.1/opensearch-data-prepper-jdk-1.5.1-linux-x64.tar.gz

We can upload model artifact and use similar URL. Currently we need to upload two files for one pre-trained model: config.json and model binary zip file.

config.json contains model configuration. It is small, just contains dozens of lines
model binary zip file could be 10s MB to 100s MB, generally less than 500MB.

https://artifacts.opensearch.org/models/ml-models/<model_source>/<model_name>/<version>/<model format>/<model_artifact_zip>

# model source : could be "huggingface" or "amazon"
# model_name: e.g. "sentence-transformers/msmarco-distilbert-base-tas-b"
# model format: e.g. "sentence_transformers", "huggingface_transformers", "onnx"
# version: follow https://semver.org/, e.g  1.0.0
# model_artifact_zip: model_name.replace("/", "_") + version + model_format + ".zip". We add version and model format to make it easy to distinguish if user download multiple version/formats of same model.

For example

https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/msmarco-distilbert-base-tas-b/1.0.1/onnx/sentence-transformers_msmarco-distilbert-base-tas-b-1.0.1-onnx.zip

https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/msmarco-distilbert-base-tas-b/1.0.1/onnx/config.json

Describe alternatives you've considered

Build a new public S3 repo for model.

Additional context

No response

The text was updated successfully, but these errors were encountered:

peterzhuamazon · 2022-09-30T17:26:30Z

I have some initial discussions with @ylwu-amzn and it is concluded that manual publishing with 2PR would be the best solution to start with.

Considering the frequency they need to update the models, we can make this into a jenkinsfile as part of the universal build.

Thanks.

dblock · 2022-09-30T18:48:24Z

Are those pre-trained models "built"? Meaning that a developer checks in model source, then some training happens, then it gets released? If that's the case you definitely want an automated process in GitHub CI that builds and releases the model that is reproducible - don't build/train models on the developer machine! You can then open PRs automatically via GHA if 2PR is how you want to release.

ylwu-amzn · 2022-09-30T21:59:48Z

hi, @dblock , thanks, good question.
The deep learning model generally needs a lot computation capacity to train. For example, the model may take hours (even days for very big model) to train on powerful GPU instances. The models are written in Python. The model artifact we are uploading is not python code. We need to export the model to some portable format, then add some config info to config.json. For the first phase, we are going to export several Huggingface models into portable format and upload to OpenSearch repo. Talked with @peterzhuamazon , we can build some workflow to automatically pull model from ml-commons S3 repo(we don't have that repo now, will build one) when a new tag created, then publish to OpenSearch artifact repo.

For pre-trained models, they are trained with big general data set, not for specific business domain. For example, the NLP model may be trained with the public wikipedia content. So the performance may not be the best for user's business domain like retail. But should be good for user to try our new ML feature. User may need to train their own model or find other better model for their own business, that's out of our scope.

dblock · 2022-10-03T18:05:39Z

Where are you training these big models? Where are these GPU instances coming from?
I think I am saying that training should happen via automation.

ylwu-amzn · 2022-10-03T21:17:41Z

Where are you training these big models? Where are these GPU instances coming from? I think I am saying that training should happen via automation.

Some models are from public Huggingface pre-trained model. Our scientist team will also train some model. GPU instances are in our AWS internal account.

training should happen via automation.

Sorry, not quite got your point. Yes, we can make the model training automated. Actually we are building a python ML library to make model training process easier. User can use that python code to train model outside of OpenSearch cluster and user should own their model training infrastructure/pipeline. So training should happen via automation, but we just provide tools (python code), don't own the whole automation workflow.

gaiksaya · 2023-01-10T00:30:39Z

Are those pre-trained models "built"? Meaning that a developer checks in model source, then some training happens, then it gets released? If that's the case you definitely want an automated process in GitHub CI that builds and releases the model that is reproducible - don't build/train models on the developer machine! You can then open PRs automatically via GHA if 2PR is how you want to release.

From offline conversation this process is not automated today. @dblock @bbarani can you take a look at the @ylwu-amzn 's response in above comment before we move forward?
Thanks!

ylwu-amzn · 2023-01-10T17:41:55Z

We have released the python ML library https://pypi.org/project/opensearch-py-ml/. Discussed with @peterzhuamazon , seems it's possible to automate tracing public Huggingface models by leveraging current Jenkins, we will put model tracing logic in "opensearch-py-ml". For Amazon pre-trained model, we will discuss with scientist team to check if it's possible to automate or not.

The huggingface models and our "opensearch-py-ml" repo are public and are using Apache-2.0 license. If we are only support Huggingface models in first phase and it's automated in "opensearch-py-ml", should we also get security approval for every Huggingface model we are going to publish? @bbarani

bbarani · 2023-01-10T17:54:23Z

@ylwu-amzn we would like to make sure that the models are trained in a secured environment as we will be publishing it as production artifact to the community. We need to make sure that these models are not generated manually rather we need to automate both the training as well production upload process. Security review will help close out the gaps not just with the model rather the generation of these models as well.

CEHENKLE · 2023-01-10T18:09:32Z

@ylwu-amzn I think we need a more fleshed out proposal from your team about what you're trying to accomplish. This feels like like a repository for artifacts and more like a workflow/build system addition. It also seems like there have been some offline conversations that I'd like to capture.

Thanks!

ylwu-amzn · 2023-01-10T23:59:39Z

For huggingface model, these are not trained by our team. They are trained by other company and published on huggingface with apache-2.0 license. You can consider it as our OpenSearch github repo.
I will ask security team to review first.

@CEHENKLE will schedule a meeting when we have security approval.

gaiksaya · 2023-02-10T19:59:03Z

We have created a semi-automated workflow that will take the models path, version as input. Download, sign and upload to artifacts.opensearch.org.
See the above linked PR.

gaiksaya · 2023-02-16T18:58:20Z

The models have been released. Closing this issue.

ylwu-amzn added enhancement New Enhancement untriaged Issues that have not yet been triaged labels Sep 30, 2022

peterzhuamazon self-assigned this Sep 30, 2022

peterzhuamazon added release clients plugins and removed untriaged Issues that have not yet been triaged enhancement New Enhancement clients labels Sep 30, 2022

peterzhuamazon changed the title ~~[RELEASE] Pre-Trained ML models~~ Release Pre-Trained ML models Sep 30, 2022

peterzhuamazon mentioned this issue Oct 5, 2022

[WIP] New Jenkins workflow for ml models publishing opensearch-project/ml-commons#457

Closed

5 tasks

bbarani assigned zelinh and unassigned peterzhuamazon Feb 3, 2023

bbarani added this to OpenSearch Engineering Effectiveness Feb 10, 2023

bbarani moved this to In Progress in OpenSearch Engineering Effectiveness Feb 10, 2023

bbarani removed this from OpenSearch Engineering Effectiveness Feb 10, 2023

bbarani added this to OpenSearch Engineering Effectiveness Feb 10, 2023

bbarani moved this to In Progress in OpenSearch Engineering Effectiveness Feb 10, 2023

gaiksaya assigned gaiksaya and unassigned zelinh Feb 10, 2023

gaiksaya mentioned this issue Feb 10, 2023

Add release workflow for ml-models opensearch-project/opensearch-py-ml#77

Merged

5 tasks

bbarani removed the plugins label Feb 10, 2023

gaiksaya closed this as completed Feb 16, 2023

github-project-automation bot moved this from In Progress to Done in OpenSearch Engineering Effectiveness Feb 16, 2023

bbarani removed this from OpenSearch Engineering Effectiveness Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Pre-Trained ML models #2676

Release Pre-Trained ML models #2676

ylwu-amzn commented Sep 30, 2022 •

edited

Loading

peterzhuamazon commented Sep 30, 2022

dblock commented Sep 30, 2022

ylwu-amzn commented Sep 30, 2022 •

edited

Loading

dblock commented Oct 3, 2022

ylwu-amzn commented Oct 3, 2022

gaiksaya commented Jan 10, 2023

ylwu-amzn commented Jan 10, 2023 •

edited

Loading

bbarani commented Jan 10, 2023 •

edited

Loading

CEHENKLE commented Jan 10, 2023

ylwu-amzn commented Jan 10, 2023

gaiksaya commented Feb 10, 2023

gaiksaya commented Feb 16, 2023

Release Pre-Trained ML models #2676

Release Pre-Trained ML models #2676

Comments

ylwu-amzn commented Sep 30, 2022 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

peterzhuamazon commented Sep 30, 2022

dblock commented Sep 30, 2022

ylwu-amzn commented Sep 30, 2022 • edited Loading

dblock commented Oct 3, 2022

ylwu-amzn commented Oct 3, 2022

gaiksaya commented Jan 10, 2023

ylwu-amzn commented Jan 10, 2023 • edited Loading

bbarani commented Jan 10, 2023 • edited Loading

CEHENKLE commented Jan 10, 2023

ylwu-amzn commented Jan 10, 2023

gaiksaya commented Feb 10, 2023

gaiksaya commented Feb 16, 2023

ylwu-amzn commented Sep 30, 2022 •

edited

Loading

ylwu-amzn commented Sep 30, 2022 •

edited

Loading

ylwu-amzn commented Jan 10, 2023 •

edited

Loading

bbarani commented Jan 10, 2023 •

edited

Loading