Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Pre-Trained ML models #2676

Closed
ylwu-amzn opened this issue Sep 30, 2022 · 12 comments
Closed

Release Pre-Trained ML models #2676

ylwu-amzn opened this issue Sep 30, 2022 · 12 comments
Assignees
Labels

Comments

@ylwu-amzn
Copy link
Contributor

ylwu-amzn commented Sep 30, 2022

Is your feature request related to a problem? Please describe

We are going to support uploading deep learning models in ml-commons in 2.4, check details on opensearch-project/ml-commons#302. That feature requires user train model outside OpenSearch cluster, then upload model to OpenSearch cluster via ml-commons API. To make it easy for user to start using this feature, we are going to publish some pre-trained models, so user don't need to train model first, just provide model name and ml-commons will load from OpenSearch artifact repo.

Describe the solution you'd like

Publish Pre-Trained ML models to OpenSearch artifact repo, just like OpenSearch tar and DataPreffer.

  • OpenSearch: https://artifacts.opensearch.org/releases/bundle/opensearch/2.3.0/opensearch-2.3.0-linux-x64.tar.gz
  • DataPrepper: https://artifacts.opensearch.org/data-prepper/1.5.1/opensearch-data-prepper-jdk-1.5.1-linux-x64.tar.gz

We can upload model artifact and use similar URL. Currently we need to upload two files for one pre-trained model: config.json and model binary zip file.

config.json contains model configuration. It is small, just contains dozens of lines
model binary zip file could be 10s MB to 100s MB, generally less than 500MB.

https://artifacts.opensearch.org/models/ml-models/<model_source>/<model_name>/<version>/<model format>/<model_artifact_zip>

# model source : could be "huggingface" or "amazon"
# model_name: e.g. "sentence-transformers/msmarco-distilbert-base-tas-b"
# model format: e.g. "sentence_transformers", "huggingface_transformers", "onnx"
# version: follow https://semver.org/, e.g  1.0.0
# model_artifact_zip: model_name.replace("/", "_") + version + model_format + ".zip". We add version and model format to make it easy to distinguish if user download multiple version/formats of same model.

For example

https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/msmarco-distilbert-base-tas-b/1.0.1/onnx/sentence-transformers_msmarco-distilbert-base-tas-b-1.0.1-onnx.zip

https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/msmarco-distilbert-base-tas-b/1.0.1/onnx/config.json

Describe alternatives you've considered

Build a new public S3 repo for model.

Additional context

No response

@ylwu-amzn ylwu-amzn added enhancement New Enhancement untriaged Issues that have not yet been triaged labels Sep 30, 2022
@peterzhuamazon peterzhuamazon self-assigned this Sep 30, 2022
@peterzhuamazon peterzhuamazon added release clients plugins and removed untriaged Issues that have not yet been triaged enhancement New Enhancement clients labels Sep 30, 2022
@peterzhuamazon
Copy link
Member

I have some initial discussions with @ylwu-amzn and it is concluded that manual publishing with 2PR would be the best solution to start with.

Considering the frequency they need to update the models, we can make this into a jenkinsfile as part of the universal build.

Thanks.

@dblock
Copy link
Member

dblock commented Sep 30, 2022

Are those pre-trained models "built"? Meaning that a developer checks in model source, then some training happens, then it gets released? If that's the case you definitely want an automated process in GitHub CI that builds and releases the model that is reproducible - don't build/train models on the developer machine! You can then open PRs automatically via GHA if 2PR is how you want to release.

@peterzhuamazon peterzhuamazon changed the title [RELEASE] Pre-Trained ML models Release Pre-Trained ML models Sep 30, 2022
@ylwu-amzn
Copy link
Contributor Author

ylwu-amzn commented Sep 30, 2022

hi, @dblock , thanks, good question.
The deep learning model generally needs a lot computation capacity to train. For example, the model may take hours (even days for very big model) to train on powerful GPU instances. The models are written in Python. The model artifact we are uploading is not python code. We need to export the model to some portable format, then add some config info to config.json. For the first phase, we are going to export several Huggingface models into portable format and upload to OpenSearch repo. Talked with @peterzhuamazon , we can build some workflow to automatically pull model from ml-commons S3 repo(we don't have that repo now, will build one) when a new tag created, then publish to OpenSearch artifact repo.

For pre-trained models, they are trained with big general data set, not for specific business domain. For example, the NLP model may be trained with the public wikipedia content. So the performance may not be the best for user's business domain like retail. But should be good for user to try our new ML feature. User may need to train their own model or find other better model for their own business, that's out of our scope.

@dblock
Copy link
Member

dblock commented Oct 3, 2022

Where are you training these big models? Where are these GPU instances coming from?
I think I am saying that training should happen via automation.

@ylwu-amzn
Copy link
Contributor Author

Where are you training these big models? Where are these GPU instances coming from? I think I am saying that training should happen via automation.

Some models are from public Huggingface pre-trained model. Our scientist team will also train some model. GPU instances are in our AWS internal account.

training should happen via automation.

Sorry, not quite got your point. Yes, we can make the model training automated. Actually we are building a python ML library to make model training process easier. User can use that python code to train model outside of OpenSearch cluster and user should own their model training infrastructure/pipeline. So training should happen via automation, but we just provide tools (python code), don't own the whole automation workflow.

@gaiksaya
Copy link
Member

Are those pre-trained models "built"? Meaning that a developer checks in model source, then some training happens, then it gets released? If that's the case you definitely want an automated process in GitHub CI that builds and releases the model that is reproducible - don't build/train models on the developer machine! You can then open PRs automatically via GHA if 2PR is how you want to release.

From offline conversation this process is not automated today. @dblock @bbarani can you take a look at the @ylwu-amzn 's response in above comment before we move forward?
Thanks!

@ylwu-amzn
Copy link
Contributor Author

ylwu-amzn commented Jan 10, 2023

We have released the python ML library https://pypi.org/project/opensearch-py-ml/. Discussed with @peterzhuamazon , seems it's possible to automate tracing public Huggingface models by leveraging current Jenkins, we will put model tracing logic in "opensearch-py-ml". For Amazon pre-trained model, we will discuss with scientist team to check if it's possible to automate or not.

The huggingface models and our "opensearch-py-ml" repo are public and are using Apache-2.0 license. If we are only support Huggingface models in first phase and it's automated in "opensearch-py-ml", should we also get security approval for every Huggingface model we are going to publish? @bbarani

@bbarani
Copy link
Member

bbarani commented Jan 10, 2023

@ylwu-amzn we would like to make sure that the models are trained in a secured environment as we will be publishing it as production artifact to the community. We need to make sure that these models are not generated manually rather we need to automate both the training as well production upload process. Security review will help close out the gaps not just with the model rather the generation of these models as well.

@CEHENKLE
Copy link
Member

@ylwu-amzn I think we need a more fleshed out proposal from your team about what you're trying to accomplish. This feels like like a repository for artifacts and more like a workflow/build system addition. It also seems like there have been some offline conversations that I'd like to capture.

Thanks!

@ylwu-amzn
Copy link
Contributor Author

For huggingface model, these are not trained by our team. They are trained by other company and published on huggingface with apache-2.0 license. You can consider it as our OpenSearch github repo.
I will ask security team to review first.

@CEHENKLE will schedule a meeting when we have security approval.

@gaiksaya
Copy link
Member

We have created a semi-automated workflow that will take the models path, version as input. Download, sign and upload to artifacts.opensearch.org.
See the above linked PR.

@bbarani bbarani removed the plugins label Feb 10, 2023
@gaiksaya
Copy link
Member

The models have been released. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants