-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release Pre-Trained ML models #2676
Comments
I have some initial discussions with @ylwu-amzn and it is concluded that manual publishing with 2PR would be the best solution to start with. Considering the frequency they need to update the models, we can make this into a jenkinsfile as part of the universal build. Thanks. |
Are those pre-trained models "built"? Meaning that a developer checks in model source, then some training happens, then it gets released? If that's the case you definitely want an automated process in GitHub CI that builds and releases the model that is reproducible - don't build/train models on the developer machine! You can then open PRs automatically via GHA if 2PR is how you want to release. |
hi, @dblock , thanks, good question. For pre-trained models, they are trained with big general data set, not for specific business domain. For example, the NLP model may be trained with the public wikipedia content. So the performance may not be the best for user's business domain like retail. But should be good for user to try our new ML feature. User may need to train their own model or find other better model for their own business, that's out of our scope. |
Where are you training these big models? Where are these GPU instances coming from? |
Some models are from public Huggingface pre-trained model. Our scientist team will also train some model. GPU instances are in our AWS internal account.
Sorry, not quite got your point. Yes, we can make the model training automated. Actually we are building a python ML library to make model training process easier. User can use that python code to train model outside of OpenSearch cluster and user should own their model training infrastructure/pipeline. So training should happen via automation, but we just provide tools (python code), don't own the whole automation workflow. |
From offline conversation this process is not automated today. @dblock @bbarani can you take a look at the @ylwu-amzn 's response in above comment before we move forward? |
We have released the python ML library https://pypi.org/project/opensearch-py-ml/. Discussed with @peterzhuamazon , seems it's possible to automate tracing public Huggingface models by leveraging current Jenkins, we will put model tracing logic in "opensearch-py-ml". For Amazon pre-trained model, we will discuss with scientist team to check if it's possible to automate or not. The huggingface models and our "opensearch-py-ml" repo are public and are using Apache-2.0 license. If we are only support Huggingface models in first phase and it's automated in "opensearch-py-ml", should we also get security approval for every Huggingface model we are going to publish? @bbarani |
@ylwu-amzn we would like to make sure that the models are trained in a secured environment as we will be publishing it as production artifact to the community. We need to make sure that these models are not generated manually rather we need to automate both the training as well production upload process. Security review will help close out the gaps not just with the model rather the generation of these models as well. |
@ylwu-amzn I think we need a more fleshed out proposal from your team about what you're trying to accomplish. This feels like like a repository for artifacts and more like a workflow/build system addition. It also seems like there have been some offline conversations that I'd like to capture. Thanks! |
For huggingface model, these are not trained by our team. They are trained by other company and published on huggingface with apache-2.0 license. You can consider it as our OpenSearch github repo. @CEHENKLE will schedule a meeting when we have security approval. |
We have created a semi-automated workflow that will take the models path, version as input. Download, sign and upload to artifacts.opensearch.org. |
The models have been released. Closing this issue. |
Is your feature request related to a problem? Please describe
We are going to support uploading deep learning models in ml-commons in 2.4, check details on opensearch-project/ml-commons#302. That feature requires user train model outside OpenSearch cluster, then upload model to OpenSearch cluster via ml-commons API. To make it easy for user to start using this feature, we are going to publish some pre-trained models, so user don't need to train model first, just provide model name and ml-commons will load from OpenSearch artifact repo.
Describe the solution you'd like
Publish Pre-Trained ML models to OpenSearch artifact repo, just like OpenSearch tar and DataPreffer.
https://artifacts.opensearch.org/releases/bundle/opensearch/2.3.0/opensearch-2.3.0-linux-x64.tar.gz
https://artifacts.opensearch.org/data-prepper/1.5.1/opensearch-data-prepper-jdk-1.5.1-linux-x64.tar.gz
We can upload model artifact and use similar URL. Currently we need to upload two files for one pre-trained model:
config.json
and model binary zip file.config.json
contains model configuration. It is small, just contains dozens of linesmodel binary zip file could be 10s MB to 100s MB, generally less than 500MB.
For example
https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/msmarco-distilbert-base-tas-b/1.0.1/onnx/sentence-transformers_msmarco-distilbert-base-tas-b-1.0.1-onnx.zip
https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/msmarco-distilbert-base-tas-b/1.0.1/onnx/config.json
Describe alternatives you've considered
Build a new public S3 repo for model.
Additional context
No response
The text was updated successfully, but these errors were encountered: