Skip to content

Commit

Permalink
Init a sparse model auto tracing workflow. (#394)
Browse files Browse the repository at this point in the history
* Init a sparse model auto tracing workflow.
Signed-off-by: conggguan <[email protected]>

* Change the minimum-approvals of sparse model uploader to 2. Add some test case. Remove some redundant lines.

Signed-off-by: conggguan <[email protected]>

* Fix some test cases.

Signed-off-by: conggguan <[email protected]>

* Remove the temp test jupyter notebook.

Signed-off-by: conggguan <[email protected]>

* Change the variable name of inner model, and optimize the license verification.

Signed-off-by: conggguan <[email protected]>

* Address some comments, and nox format.

Signed-off-by: conggguan <[email protected]>

* Fix a bug for NeuralSparseModel's init. And remove a redundant save_pretrained.

Signed-off-by: conggguan <[email protected]>

* [Fix] Deleted some redundant code caused a faiure test case, fixed it.

Signed-off-by: conggguan <[email protected]>

* [Style]:Run a nox -s format to make format identical.

Signed-off-by: conggguan <[email protected]>

* [Fix] Simplify the SparseEncodingModel and fix a bug for multiple texts embeddings.

Signed-off-by: conggguan <[email protected]>

* [Fix] Make register_and_deploy_sparse_encoding_model return proper list but not single map.

Signed-off-by: conggguan <[email protected]>

* [Fix] Fix a bug for register_and_deploy_sparse_encoding_model, it now generate correct list of embedding of input texts.

Signed-off-by: conggguan <[email protected]>

* [Fix] Fix sparse encoding mdoel's test_check_required_fields test case.

Signed-off-by: conggguan <[email protected]>

* [Fix] Renamed a unproper variable name.

Signed-off-by: conggguan <[email protected]>

* [Refactor] Add some comments and extract some constants to a new file.

Signed-off-by: conggguan <[email protected]>

* [Refactor] Simplify and reuse some code from model auto tracing.

Signed-off-by: conggguan <[email protected]>

* [Refactor] Simplify and reuse some code from model auto tracing.

Signed-off-by: conggguan <[email protected]>

* [Refactor] Add a function comments and merge the sparse model trace workflow and dense.

Signed-off-by: conggguan <[email protected]>

* [Refactor] Merge the sparse and dense model's ci branch.

Signed-off-by: conggguan <[email protected]>

* [Refactor] Change for more common API, add a line of comments.

Signed-off-by: conggguan <[email protected]>

---------

Signed-off-by: conggguan <[email protected]>
  • Loading branch information
conggguan authored Aug 2, 2024
1 parent ec7e023 commit 3b18ac8
Show file tree
Hide file tree
Showing 16 changed files with 1,503 additions and 253 deletions.
20 changes: 17 additions & 3 deletions .ci/run-repository.sh
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ elif [[ "$TASK_TYPE" == "doc" ]]; then

docker cp opensearch-py-ml-doc-runner:/code/opensearch-py-ml/docs/build/ ./docs/
docker rm opensearch-py-ml-doc-runner
elif [[ "$TASK_TYPE" == "trace" ]]; then
elif [[ "$TASK_TYPE" == "SentenceTransformerTrace" || "$TASK_TYPE" == "SparseTrace" ]]; then
# Set up OpenSearch cluster & Run model autotracing (Invoked by model_uploader.yml workflow)
echo -e "\033[34;1mINFO:\033[0m MODEL_ID: ${MODEL_ID}\033[0m"
echo -e "\033[34;1mINFO:\033[0m MODEL_VERSION: ${MODEL_VERSION}\033[0m"
Expand All @@ -74,6 +74,17 @@ elif [[ "$TASK_TYPE" == "trace" ]]; then
echo -e "\033[34;1mINFO:\033[0m POOLING_MODE: ${POOLING_MODE:-N/A}\033[0m"
echo -e "\033[34;1mINFO:\033[0m MODEL_DESCRIPTION: ${MODEL_DESCRIPTION:-N/A}\033[0m"

if [[ "$TASK_TYPE" == "SentenceTransformerTrace" ]]; then
NOX_TRACE_TYPE="trace"
EXTRA_ARGS="-ed ${EMBEDDING_DIMENSION} -pm ${POOLING_MODE}"
elif [[ "$TASK_TYPE" == "SparseTrace" ]]; then
NOX_TRACE_TYPE="sparsetrace"
EXTRA_ARGS=""
else
echo "Unknown TASK_TYPE: $TASK_TYPE"
exit 1
fi

docker run \
--network=${network_name} \
--env "STACK_VERSION=${STACK_VERSION}" \
Expand All @@ -84,9 +95,12 @@ elif [[ "$TASK_TYPE" == "trace" ]]; then
--env "TEST_TYPE=server" \
--name opensearch-py-ml-trace-runner \
opensearch-project/opensearch-py-ml \
nox -s "trace-${PYTHON_VERSION}" -- ${MODEL_ID} ${MODEL_VERSION} ${TRACING_FORMAT} -ed ${EMBEDDING_DIMENSION} -pm ${POOLING_MODE} -md ${MODEL_DESCRIPTION:+"$MODEL_DESCRIPTION"}

nox -s "${NOX_TRACE_TYPE}-${PYTHON_VERSION}" -- ${MODEL_ID} ${MODEL_VERSION} ${TRACING_FORMAT} ${EXTRA_ARGS} -md ${MODEL_DESCRIPTION:+"$MODEL_DESCRIPTION"}

# To upload a model, we need the model artifact, description, license files into local path
# trace_output should include description and license file.
docker cp opensearch-py-ml-trace-runner:/code/opensearch-py-ml/upload/ ./upload/
docker cp opensearch-py-ml-trace-runner:/code/opensearch-py-ml/trace_output/ ./trace_output/
# Delete the docker image
docker rm opensearch-py-ml-trace-runner
fi
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* @dhrubo-os @greaa-aws @ylwu-amzn @b4sjoo @jngz-es @rbhavna
* @dhrubo-os @greaa-aws @ylwu-amzn @b4sjoo @jngz-es @rbhavna
29 changes: 19 additions & 10 deletions .github/workflows/model_uploader.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,21 @@ on:
required: true
type: string
tracing_format:
description: "Model format for auto-tracing (torch_script/onnx)"
description: "Model format for auto-tracing (torch_script/onnx), now the sparse model only support torchscript model."
required: true
type: choice
options:
- "BOTH"
- "TORCH_SCRIPT"
- "ONNX"
model_type:
description: "Model type for auto-tracing (SentenceTransformer/Sparse)"
required: true
type: choice
options:
- "SentenceTransformer"
- "Sparse"
default: "SentenceTransformer"
embedding_dimension:
description: "(Optional) Embedding Dimension (Specify here if it does not exist in original config.json file, or you want to overwrite it.)"
required: false
Expand Down Expand Up @@ -66,14 +74,14 @@ jobs:
run: |
model_id=${{ github.event.inputs.model_id }}
echo "model_folder=ml-models/${{github.event.inputs.model_source}}/${model_id}" >> $GITHUB_OUTPUT
echo "sentence_transformer_folder=ml-models/${{github.event.inputs.model_source}}/${model_id%%/*}/" >> $GITHUB_OUTPUT
echo "model_prefix_folder=ml-models/${{github.event.inputs.model_source}}/${model_id%%/*}/" >> $GITHUB_OUTPUT
- name: Initiate workflow_info
id: init_workflow_info
run: |
embedding_dimension=${{ github.event.inputs.embedding_dimension }}
pooling_mode=${{ github.event.inputs.pooling_mode }}
model_description="${{ github.event.inputs.model_description }}"
model_type=${{ github.event.inputs.model_type }}
workflow_info="
============= Workflow Details ==============
- Workflow Name: ${{ github.workflow }}
Expand All @@ -84,6 +92,7 @@ jobs:
========= Workflow Input Information =========
- Model ID: ${{ github.event.inputs.model_id }}
- Model Version: ${{ github.event.inputs.model_version }}
- Model Type: ${{ github.event.inputs.model_type }}
- Tracing Format: ${{ github.event.inputs.tracing_format }}
- Embedding Dimension: ${embedding_dimension:-N/A}
- Pooling Mode: ${pooling_mode:-N/A}
Expand All @@ -103,7 +112,7 @@ jobs:
echo "unverified=- [ ] :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0 :warning:" >> $GITHUB_OUTPUT
outputs:
model_folder: ${{ steps.init_folders.outputs.model_folder }}
sentence_transformer_folder: ${{ steps.init_folders.outputs.sentence_transformer_folder }}
model_prefix_folder: ${{ steps.init_folders.outputs.model_prefix_folder }}
workflow_info: ${{ steps.init_workflow_info.outputs.workflow_info }}
verified_license_line: ${{ steps.init_license_line.outputs.verified }}
unverified_license_line: ${{ steps.init_license_line.outputs.unverified }}
Expand Down Expand Up @@ -133,7 +142,7 @@ jobs:
if: github.event.inputs.allow_overwrite == 'NO' && (github.event.inputs.tracing_format == 'TORCH_SCRIPT' || github.event.inputs.tracing_format == 'BOTH')
run: |
TORCH_FILE_PATH=$(python utils/model_uploader/save_model_file_path_to_env.py \
${{ needs.init-workflow-var.outputs.sentence_transformer_folder }} ${{ github.event.inputs.model_id }} \
${{ needs.init-workflow-var.outputs.model_prefix_folder }} ${{ github.event.inputs.model_id }} \
${{ github.event.inputs.model_version }} TORCH_SCRIPT)
aws s3api head-object --bucket ${{ secrets.MODEL_BUCKET }} --key $TORCH_FILE_PATH > /dev/null 2>&1 || TORCH_MODEL_NOT_EXIST=true
if [[ -z $TORCH_MODEL_NOT_EXIST ]]
Expand All @@ -145,7 +154,7 @@ jobs:
if: github.event.inputs.allow_overwrite == 'NO' && (github.event.inputs.tracing_format == 'ONNX' || github.event.inputs.tracing_format == 'BOTH')
run: |
ONNX_FILE_PATH=$(python utils/model_uploader/save_model_file_path_to_env.py \
${{ needs.init-workflow-var.outputs.sentence_transformer_folder }} ${{ github.event.inputs.model_id }} \
${{ needs.init-workflow-var.outputs.model_prefix_folder }} ${{ github.event.inputs.model_id }} \
${{ github.event.inputs.model_version }} ONNX)
aws s3api head-object --bucket ${{ secrets.MODEL_BUCKET }} --key $ONNX_FILE_PATH > /dev/null 2>&1 || ONNX_MODEL_NOT_EXIST=true
if [[ -z $ONNX_MODEL_NOT_EXIST ]]
Expand All @@ -168,7 +177,7 @@ jobs:
cluster: ["opensearch"]
secured: ["true"]
entry:
- { opensearch_version: 2.7.0 }
- { opensearch_version: 2.11.0 }
steps:
- name: Checkout
uses: actions/checkout@v3
Expand All @@ -181,7 +190,7 @@ jobs:
echo "POOLING_MODE=${{ github.event.inputs.pooling_mode }}" >> $GITHUB_ENV
echo "MODEL_DESCRIPTION=${{ github.event.inputs.model_description }}" >> $GITHUB_ENV
- name: Autotracing ${{ matrix.cluster }} secured=${{ matrix.secured }} version=${{matrix.entry.opensearch_version}}
run: "./.ci/run-tests ${{ matrix.cluster }} ${{ matrix.secured }} ${{ matrix.entry.opensearch_version }} trace"
run: "./.ci/run-tests ${{ matrix.cluster }} ${{ matrix.secured }} ${{ matrix.entry.opensearch_version }} ${{github.event.inputs.model_type}}Trace"
- name: Limit Model Size to 2GB
run: |
upload_size_in_binary_bytes=$(ls -lR ./upload/ | awk '{ SUM += $5} END {print SUM}')
Expand Down Expand Up @@ -226,7 +235,7 @@ jobs:
- name: Dryrun model uploading
id: dryrun_model_uploading
run: |
dryrun_output=$(aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.sentence_transformer_folder }} --dryrun \
dryrun_output=$(aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.model_prefix_folder }} --dryrun \
| sed 's|s3://${{ secrets.MODEL_BUCKET }}/|s3://(MODEL_BUCKET)/|'
)
echo "dryrun_output<<EOF" >> $GITHUB_OUTPUT
Expand Down Expand Up @@ -301,7 +310,7 @@ jobs:
- name: Copy Files to the Bucket
id: copying_to_bucket
run: |
aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.sentence_transformer_folder }}
aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.model_prefix_folder }}
echo "upload_time=$(TZ='America/Los_Angeles' date "+%Y-%m-%d %T")" >> $GITHUB_OUTPUT
outputs:
upload_time: ${{ steps.copying_to_bucket.outputs.upload_time }}
Expand Down
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Add support for model profiles by @rawwar in ([#358](https://github.com/opensearch-project/opensearch-py-ml/pull/358))
- Support for security default admin credential changes in 2.12.0 in ([#365](https://github.com/opensearch-project/opensearch-py-ml/pull/365))
- adding cross encoder models in the pre-trained traced list ([#378](https://github.com/opensearch-project/opensearch-py-ml/pull/378))

- Add workflows and scripts for sparse encoding model tracing and uploading process by @conggguan in ([#394](https://github.com/opensearch-project/opensearch-py-ml/pull/394))

### Changed
- Modify ml-models.JenkinsFile so that it takes model format into account and can be triggered with generic webhook by @thanawan-atc in ([#211](https://github.com/opensearch-project/opensearch-py-ml/pull/211))
Expand Down
17 changes: 17 additions & 0 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,3 +166,20 @@ def trace(session):
"utils/model_uploader/model_autotracing.py",
*(session.posargs),
)


@nox.session(python=["3.9"])
def sparsetrace(session):
session.install(
"-r",
"requirements-dev.txt",
"--timeout",
"1500",
)
session.install(".")

session.run(
"python",
"utils/model_uploader/sparse_model_autotracing.py",
*(session.posargs),
)
8 changes: 7 additions & 1 deletion opensearch_py_ml/ml_commons/ml_common_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
MODEL_CHUNK_MAX_SIZE = 10_000_000
MODEL_MAX_SIZE = 4_000_000_000
BUF_SIZE = 65536 # lets read stuff in 64kb chunks!
TIMEOUT = 120 # timeout for synchronous method calls in seconds
TIMEOUT = 240 # timeout for synchronous method calls in seconds
META_API_ENDPOINT = "models/meta"
MODEL_NAME_FIELD = "name"
MODEL_VERSION_FIELD = "version"
Expand All @@ -24,6 +24,12 @@
FRAMEWORK_TYPE = "framework_type"
MODEL_CONTENT_HASH_VALUE = "model_content_hash_value"
MODEL_GROUP_ID = "model_group_id"
MODEL_FUNCTION_NAME = "function_name"
MODEL_TASK_TYPE = "model_task_type"
# URL of the license file for the OpenSearch project
LICENSE_URL = "https://github.com/opensearch-project/opensearch-py-ml/raw/main/LICENSE"
# Name of the function used for sparse encoding
SPARSE_ENCODING_FUNCTION_NAME = "SPARSE_ENCODING"


def _generate_model_content_hash_value(model_file_path: str) -> str:
Expand Down
18 changes: 18 additions & 0 deletions opensearch_py_ml/ml_commons/ml_commons_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,24 @@ def get_model_info(self, model_id: str) -> object:
url=API_URL,
)

def generate_model_inference(self, model_id: str, request_body: dict) -> object:
"""
Generates inference result for the given input using the specified request body.
:param model_id: Unique ID of the model.
:type model_id: string
:param request_body: Request body to send to the API.
:type request_body: dict
:return: Returns a JSON object `inference_results` containing the results for the given input.
:rtype: object
"""
API_URL = f"{ML_BASE_URI}/models/{model_id}/_predict/"
return self._client.transport.perform_request(
method="POST",
url=API_URL,
body=request_body,
)

def generate_embedding(self, model_id: str, sentences: List[str]) -> object:
"""
This method return embedding for given sentences (using ml commons _predict api)
Expand Down
9 changes: 8 additions & 1 deletion opensearch_py_ml/ml_commons/model_uploader.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@
MODEL_CONTENT_HASH_VALUE,
MODEL_CONTENT_SIZE_IN_BYTES_FIELD,
MODEL_FORMAT_FIELD,
MODEL_FUNCTION_NAME,
MODEL_GROUP_ID,
MODEL_MAX_SIZE,
MODEL_NAME_FIELD,
MODEL_TASK_TYPE,
MODEL_TYPE,
MODEL_VERSION_FIELD,
TOTAL_CHUNKS_FIELD,
Expand Down Expand Up @@ -167,6 +169,7 @@ def _check_mandatory_field(self, model_meta: dict) -> bool:
"""

if model_meta:

if not model_meta.get(MODEL_NAME_FIELD):
raise ValueError(f"{MODEL_NAME_FIELD} can not be empty")
if not model_meta.get(MODEL_VERSION_FIELD):
Expand All @@ -178,7 +181,11 @@ def _check_mandatory_field(self, model_meta: dict) -> bool:
if not model_meta.get(TOTAL_CHUNKS_FIELD):
raise ValueError(f"{TOTAL_CHUNKS_FIELD} can not be empty")
if not model_meta.get(MODEL_CONFIG_FIELD):
raise ValueError(f"{MODEL_CONFIG_FIELD} can not be empty")
if (
model_meta.get(MODEL_FUNCTION_NAME) != "SPARSE_ENCODING"
and model_meta.get(MODEL_TASK_TYPE) != "SPARSE_ENCODING"
):
raise ValueError(f"{MODEL_CONFIG_FIELD} can not be empty")
else:
if not isinstance(model_meta.get(MODEL_CONFIG_FIELD), dict):
raise TypeError(
Expand Down
3 changes: 2 additions & 1 deletion opensearch_py_ml/ml_models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@

from .metrics_correlation.mcorr import MCorr
from .sentencetransformermodel import SentenceTransformerModel
from .sparse_encoding_model import SparseEncodingModel

__all__ = ["SentenceTransformerModel", "MCorr"]
__all__ = ["SentenceTransformerModel", "MCorr", "SparseEncodingModel"]
117 changes: 117 additions & 0 deletions opensearch_py_ml/ml_models/base_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# SPDX-License-Identifier: Apache-2.0
# The OpenSearch Contributors require contributions made to
# this file be licensed under the Apache-2.0 license or a
# compatible open source license.
# Any modifications Copyright OpenSearch Contributors. See
# GitHub history for details.
import json
import os
from abc import ABC, abstractmethod
from zipfile import ZipFile

import requests

from opensearch_py_ml.ml_commons.ml_common_utils import (
LICENSE_URL,
SPARSE_ENCODING_FUNCTION_NAME,
)


class BaseUploadModel(ABC):
"""
A base class for uploading models to OpenSearch pretrained model hub.
"""

def __init__(
self, model_id: str, folder_path: str = None, overwrite: bool = False
) -> None:
self.model_id = model_id
self.folder_path = folder_path
self.overwrite = overwrite

@abstractmethod
def save_as_pt(self, *args, **kwargs):
pass

@abstractmethod
def save_as_onnx(self, *args, **kwargs):
pass

@abstractmethod
def make_model_config_json(
self,
version_number: str,
model_format: str,
description: str,
) -> str:
pass

def _fill_null_truncation_field(
self,
save_json_folder_path: str,
max_length: int,
) -> None:
"""
Fill truncation field in tokenizer.json when it is null
:param save_json_folder_path:
path to save model json file, e.g, "home/save_pre_trained_model_json/")
:type save_json_folder_path: string
:param max_length:
maximum sequence length for model
:type max_length: int
:return: no return value expected
:rtype: None
"""
tokenizer_file_path = os.path.join(save_json_folder_path, "tokenizer.json")
with open(tokenizer_file_path) as user_file:
parsed_json = json.load(user_file)
if "truncation" not in parsed_json or parsed_json["truncation"] is None:
parsed_json["truncation"] = {
"direction": "Right",
"max_length": max_length,
"strategy": "LongestFirst",
"stride": 0,
}
with open(tokenizer_file_path, "w") as file:
json.dump(parsed_json, file, indent=2)

def _add_apache_license_to_model_zip_file(self, model_zip_file_path: str):
"""
Add Apache-2.0 license file to the model zip file at model_zip_file_path
:param model_zip_file_path:
Path to the model zip file
:type model_zip_file_path: string
:return: no return value expected
:rtype: None
"""
r = requests.get(LICENSE_URL)
assert r.status_code == 200, "Failed to add license file to the model zip file"

with ZipFile(str(model_zip_file_path), "a") as zipObj:
zipObj.writestr("LICENSE", r.content)


class SparseModel(BaseUploadModel, ABC):
"""
Class for autotracing the Sparse Encoding model.
"""

def __init__(
self,
model_id: str,
folder_path: str = "./model_files/",
overwrite: bool = False,
):
super().__init__(model_id, folder_path, overwrite)
self.model_id = model_id
self.folder_path = folder_path
self.overwrite = overwrite
self.function_name = SPARSE_ENCODING_FUNCTION_NAME

def pre_process(self):
pass

def post_process(self):
pass
Loading

0 comments on commit 3b18ac8

Please sign in to comment.