Init a sparse model auto tracing workflow. (#394)

* Init a sparse model auto tracing workflow. Signed-off-by: conggguan <[email protected]> * Change the minimum-approvals of sparse model uploader to 2. Add some test case. Remove some redundant lines. Signed-off-by: conggguan <[email protected]> * Fix some test cases. Signed-off-by: conggguan <[email protected]> * Remove the temp test jupyter notebook. Signed-off-by: conggguan <[email protected]> * Change the variable name of inner model, and optimize the license verification. Signed-off-by: conggguan <[email protected]> * Address some comments, and nox format. Signed-off-by: conggguan <[email protected]> * Fix a bug for NeuralSparseModel's init. And remove a redundant save_pretrained. Signed-off-by: conggguan <[email protected]> * [Fix] Deleted some redundant code caused a faiure test case, fixed it. Signed-off-by: conggguan <[email protected]> * [Style]:Run a nox -s format to make format identical. Signed-off-by: conggguan <[email protected]> * [Fix] Simplify the SparseEncodingModel and fix a bug for multiple texts embeddings. Signed-off-by: conggguan <[email protected]> * [Fix] Make register_and_deploy_sparse_encoding_model return proper list but not single map. Signed-off-by: conggguan <[email protected]> * [Fix] Fix a bug for register_and_deploy_sparse_encoding_model, it now generate correct list of embedding of input texts. Signed-off-by: conggguan <[email protected]> * [Fix] Fix sparse encoding mdoel's test_check_required_fields test case. Signed-off-by: conggguan <[email protected]> * [Fix] Renamed a unproper variable name. Signed-off-by: conggguan <[email protected]> * [Refactor] Add some comments and extract some constants to a new file. Signed-off-by: conggguan <[email protected]> * [Refactor] Simplify and reuse some code from model auto tracing. Signed-off-by: conggguan <[email protected]> * [Refactor] Simplify and reuse some code from model auto tracing. Signed-off-by: conggguan <[email protected]> * [Refactor] Add a function comments and merge the sparse model trace workflow and dense. Signed-off-by: conggguan <[email protected]> * [Refactor] Merge the sparse and dense model's ci branch. Signed-off-by: conggguan <[email protected]> * [Refactor] Change for more common API, add a line of comments. Signed-off-by: conggguan <[email protected]> --------- Signed-off-by: conggguan <[email protected]>
opensearch-project · Aug 2, 2024 · 3b18ac8 · 3b18ac8
1 parent ec7e023
commit 3b18ac8
Show file tree

Hide file tree

Showing 16 changed files with 1,503 additions and 253 deletions.
diff --git a/.ci/run-repository.sh b/.ci/run-repository.sh
@@ -65,7 +65,7 @@ elif [[ "$TASK_TYPE" == "doc" ]]; then
 
   docker cp opensearch-py-ml-doc-runner:/code/opensearch-py-ml/docs/build/ ./docs/
   docker rm opensearch-py-ml-doc-runner
-elif [[ "$TASK_TYPE" == "trace" ]]; then
+elif [[ "$TASK_TYPE" == "SentenceTransformerTrace" || "$TASK_TYPE" == "SparseTrace" ]]; then
   # Set up OpenSearch cluster & Run model autotracing (Invoked by model_uploader.yml workflow)
   echo -e "\033[34;1mINFO:\033[0m MODEL_ID: ${MODEL_ID}\033[0m"
   echo -e "\033[34;1mINFO:\033[0m MODEL_VERSION: ${MODEL_VERSION}\033[0m"
@@ -74,6 +74,17 @@ elif [[ "$TASK_TYPE" == "trace" ]]; then
   echo -e "\033[34;1mINFO:\033[0m POOLING_MODE: ${POOLING_MODE:-N/A}\033[0m"
   echo -e "\033[34;1mINFO:\033[0m MODEL_DESCRIPTION: ${MODEL_DESCRIPTION:-N/A}\033[0m"
 
+  if [[ "$TASK_TYPE" == "SentenceTransformerTrace" ]]; then
+      NOX_TRACE_TYPE="trace"
+      EXTRA_ARGS="-ed ${EMBEDDING_DIMENSION} -pm ${POOLING_MODE}"
+  elif [[ "$TASK_TYPE" == "SparseTrace" ]]; then
+      NOX_TRACE_TYPE="sparsetrace"
+      EXTRA_ARGS=""
+  else
+      echo "Unknown TASK_TYPE: $TASK_TYPE"
+      exit 1
+  fi
+
   docker run \
   --network=${network_name} \
   --env "STACK_VERSION=${STACK_VERSION}" \
@@ -84,9 +95,12 @@ elif [[ "$TASK_TYPE" == "trace" ]]; then
   --env "TEST_TYPE=server" \
   --name opensearch-py-ml-trace-runner \
   opensearch-project/opensearch-py-ml \
-  nox -s "trace-${PYTHON_VERSION}" -- ${MODEL_ID} ${MODEL_VERSION} ${TRACING_FORMAT} -ed ${EMBEDDING_DIMENSION} -pm ${POOLING_MODE} -md ${MODEL_DESCRIPTION:+"$MODEL_DESCRIPTION"}
-
+  nox -s "${NOX_TRACE_TYPE}-${PYTHON_VERSION}" -- ${MODEL_ID} ${MODEL_VERSION} ${TRACING_FORMAT} ${EXTRA_ARGS} -md ${MODEL_DESCRIPTION:+"$MODEL_DESCRIPTION"}
+
+  # To upload a model, we need the model artifact, description, license files into local path
+  # trace_output should include description and license file.
   docker cp opensearch-py-ml-trace-runner:/code/opensearch-py-ml/upload/ ./upload/
   docker cp opensearch-py-ml-trace-runner:/code/opensearch-py-ml/trace_output/ ./trace_output/
+  # Delete the docker image
   docker rm opensearch-py-ml-trace-runner
 fi
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1 +1 @@
-*   @dhrubo-os  @greaa-aws @ylwu-amzn @b4sjoo @jngz-es @rbhavna
+* @dhrubo-os @greaa-aws @ylwu-amzn @b4sjoo @jngz-es @rbhavna
diff --git a/.github/workflows/model_uploader.yml b/.github/workflows/model_uploader.yml
@@ -17,13 +17,21 @@ on:
         required: true
         type: string
       tracing_format:
-        description: "Model format for auto-tracing (torch_script/onnx)"
+        description: "Model format for auto-tracing (torch_script/onnx), now the sparse model only support torchscript model."
         required: true
         type: choice
         options:
         - "BOTH"
         - "TORCH_SCRIPT"
         - "ONNX"
+      model_type:
+        description: "Model type for auto-tracing (SentenceTransformer/Sparse)"
+        required: true
+        type: choice
+        options:
+          - "SentenceTransformer"
+          - "Sparse"
+        default: "SentenceTransformer"
       embedding_dimension:
         description: "(Optional) Embedding Dimension (Specify here if it does not exist in original config.json file, or you want to overwrite it.)"
         required: false
@@ -66,14 +74,14 @@ jobs:
       run: |
         model_id=${{ github.event.inputs.model_id }}
         echo "model_folder=ml-models/${{github.event.inputs.model_source}}/${model_id}" >> $GITHUB_OUTPUT
-        echo "sentence_transformer_folder=ml-models/${{github.event.inputs.model_source}}/${model_id%%/*}/" >> $GITHUB_OUTPUT
+        echo "model_prefix_folder=ml-models/${{github.event.inputs.model_source}}/${model_id%%/*}/" >> $GITHUB_OUTPUT
     - name: Initiate workflow_info
       id: init_workflow_info
       run: |
         embedding_dimension=${{ github.event.inputs.embedding_dimension }}
         pooling_mode=${{ github.event.inputs.pooling_mode }}
         model_description="${{ github.event.inputs.model_description }}"
-        
+        model_type=${{ github.event.inputs.model_type }}
         workflow_info="
         ============= Workflow Details ==============
         - Workflow Name: ${{ github.workflow }}
@@ -84,6 +92,7 @@ jobs:
         ========= Workflow Input Information =========
         - Model ID: ${{ github.event.inputs.model_id }}
         - Model Version: ${{ github.event.inputs.model_version }}
+        - Model Type: ${{ github.event.inputs.model_type }}
         - Tracing Format: ${{ github.event.inputs.tracing_format }}
         - Embedding Dimension: ${embedding_dimension:-N/A}
         - Pooling Mode: ${pooling_mode:-N/A}
@@ -103,7 +112,7 @@ jobs:
         echo "unverified=- [ ]  :warning: The license cannot be verified. Please confirm by yourself that the model is licensed under Apache 2.0  :warning:" >> $GITHUB_OUTPUT
     outputs:
       model_folder: ${{ steps.init_folders.outputs.model_folder }}
-      sentence_transformer_folder: ${{ steps.init_folders.outputs.sentence_transformer_folder }}
+      model_prefix_folder: ${{ steps.init_folders.outputs.model_prefix_folder }}
       workflow_info: ${{ steps.init_workflow_info.outputs.workflow_info }}
       verified_license_line: ${{ steps.init_license_line.outputs.verified }}
       unverified_license_line: ${{ steps.init_license_line.outputs.unverified }}
@@ -133,7 +142,7 @@ jobs:
         if: github.event.inputs.allow_overwrite == 'NO' && (github.event.inputs.tracing_format == 'TORCH_SCRIPT' || github.event.inputs.tracing_format == 'BOTH')
         run: |
           TORCH_FILE_PATH=$(python utils/model_uploader/save_model_file_path_to_env.py \
-              ${{ needs.init-workflow-var.outputs.sentence_transformer_folder }} ${{ github.event.inputs.model_id }} \
+              ${{ needs.init-workflow-var.outputs.model_prefix_folder }} ${{ github.event.inputs.model_id }} \
               ${{ github.event.inputs.model_version }} TORCH_SCRIPT)
           aws s3api head-object --bucket ${{ secrets.MODEL_BUCKET }} --key $TORCH_FILE_PATH > /dev/null 2>&1 || TORCH_MODEL_NOT_EXIST=true
           if [[ -z $TORCH_MODEL_NOT_EXIST ]]
@@ -145,7 +154,7 @@ jobs:
         if: github.event.inputs.allow_overwrite == 'NO' && (github.event.inputs.tracing_format == 'ONNX' || github.event.inputs.tracing_format == 'BOTH')
         run: |
           ONNX_FILE_PATH=$(python utils/model_uploader/save_model_file_path_to_env.py \
-            ${{ needs.init-workflow-var.outputs.sentence_transformer_folder }} ${{ github.event.inputs.model_id }} \
+            ${{ needs.init-workflow-var.outputs.model_prefix_folder }} ${{ github.event.inputs.model_id }} \
             ${{ github.event.inputs.model_version }} ONNX)
           aws s3api head-object --bucket ${{ secrets.MODEL_BUCKET }} --key $ONNX_FILE_PATH > /dev/null 2>&1 || ONNX_MODEL_NOT_EXIST=true
           if [[ -z $ONNX_MODEL_NOT_EXIST ]]
@@ -168,7 +177,7 @@ jobs:
         cluster: ["opensearch"]
         secured: ["true"]
         entry:
-          - { opensearch_version: 2.7.0 }
+          - { opensearch_version: 2.11.0 }
     steps:
       - name: Checkout
         uses: actions/checkout@v3
@@ -181,7 +190,7 @@ jobs:
           echo "POOLING_MODE=${{ github.event.inputs.pooling_mode }}" >> $GITHUB_ENV     
           echo "MODEL_DESCRIPTION=${{ github.event.inputs.model_description }}" >> $GITHUB_ENV     
       - name: Autotracing ${{ matrix.cluster }} secured=${{ matrix.secured }} version=${{matrix.entry.opensearch_version}}
-        run: "./.ci/run-tests ${{ matrix.cluster }} ${{ matrix.secured }} ${{ matrix.entry.opensearch_version }} trace"
+        run: "./.ci/run-tests ${{ matrix.cluster }} ${{ matrix.secured }} ${{ matrix.entry.opensearch_version }} ${{github.event.inputs.model_type}}Trace"
       - name: Limit Model Size to 2GB
         run: |
           upload_size_in_binary_bytes=$(ls -lR ./upload/ | awk '{ SUM += $5} END {print SUM}')
@@ -226,7 +235,7 @@ jobs:
       - name: Dryrun model uploading
         id: dryrun_model_uploading
         run: |
-          dryrun_output=$(aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.sentence_transformer_folder }} --dryrun \
+          dryrun_output=$(aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.model_prefix_folder }} --dryrun \
             | sed 's|s3://${{ secrets.MODEL_BUCKET }}/|s3://(MODEL_BUCKET)/|' 
           )
           echo "dryrun_output<<EOF" >> $GITHUB_OUTPUT
@@ -301,7 +310,7 @@ jobs:
       - name: Copy Files to the Bucket
         id: copying_to_bucket
         run: |
-          aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.sentence_transformer_folder }}
+          aws s3 sync ./upload/ s3://${{ secrets.MODEL_BUCKET }}/${{ needs.init-workflow-var.outputs.model_prefix_folder }}
           echo "upload_time=$(TZ='America/Los_Angeles' date "+%Y-%m-%d %T")" >> $GITHUB_OUTPUT
     outputs:
       upload_time: ${{ steps.copying_to_bucket.outputs.upload_time }}

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,7 +14,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 - Add support for model profiles by @rawwar in ([#358](https://github.com/opensearch-project/opensearch-py-ml/pull/358))
 - Support for security default admin credential changes in 2.12.0 in ([#365](https://github.com/opensearch-project/opensearch-py-ml/pull/365))
 - adding cross encoder models in the pre-trained traced list ([#378](https://github.com/opensearch-project/opensearch-py-ml/pull/378))
-
+- Add workflows and scripts for sparse encoding model tracing and uploading process by @conggguan in ([#394](https://github.com/opensearch-project/opensearch-py-ml/pull/394))
 
 ### Changed
 - Modify ml-models.JenkinsFile so that it takes model format into account and can be triggered with generic webhook by @thanawan-atc in ([#211](https://github.com/opensearch-project/opensearch-py-ml/pull/211))

diff --git a/noxfile.py b/noxfile.py
@@ -166,3 +166,20 @@ def trace(session):
         "utils/model_uploader/model_autotracing.py",
         *(session.posargs),
     )
+
+
+@nox.session(python=["3.9"])
+def sparsetrace(session):
+    session.install(
+        "-r",
+        "requirements-dev.txt",
+        "--timeout",
+        "1500",
+    )
+    session.install(".")
+
+    session.run(
+        "python",
+        "utils/model_uploader/sparse_model_autotracing.py",
+        *(session.posargs),
+    )
diff --git a/opensearch_py_ml/ml_commons/ml_common_utils.py b/opensearch_py_ml/ml_commons/ml_common_utils.py
@@ -11,7 +11,7 @@
 MODEL_CHUNK_MAX_SIZE = 10_000_000
 MODEL_MAX_SIZE = 4_000_000_000
 BUF_SIZE = 65536  # lets read stuff in 64kb chunks!
-TIMEOUT = 120  # timeout for synchronous method calls in seconds
+TIMEOUT = 240  # timeout for synchronous method calls in seconds
 META_API_ENDPOINT = "models/meta"
 MODEL_NAME_FIELD = "name"
 MODEL_VERSION_FIELD = "version"
@@ -24,6 +24,12 @@
 FRAMEWORK_TYPE = "framework_type"
 MODEL_CONTENT_HASH_VALUE = "model_content_hash_value"
 MODEL_GROUP_ID = "model_group_id"
+MODEL_FUNCTION_NAME = "function_name"
+MODEL_TASK_TYPE = "model_task_type"
+# URL of the license file for the OpenSearch project
+LICENSE_URL = "https://github.com/opensearch-project/opensearch-py-ml/raw/main/LICENSE"
+# Name of the function used for sparse encoding
+SPARSE_ENCODING_FUNCTION_NAME = "SPARSE_ENCODING"
 
 
 def _generate_model_content_hash_value(model_file_path: str) -> str:

diff --git a/opensearch_py_ml/ml_commons/ml_commons_client.py b/opensearch_py_ml/ml_commons/ml_commons_client.py
@@ -498,6 +498,24 @@ def get_model_info(self, model_id: str) -> object:
             url=API_URL,
         )
 
+    def generate_model_inference(self, model_id: str, request_body: dict) -> object:
+        """
+        Generates inference result for the given input using the specified request body.
+
+        :param model_id: Unique ID of the model.
+        :type model_id: string
+        :param request_body: Request body to send to the API.
+        :type request_body: dict
+        :return: Returns a JSON object `inference_results` containing the results for the given input.
+        :rtype: object
+        """
+        API_URL = f"{ML_BASE_URI}/models/{model_id}/_predict/"
+        return self._client.transport.perform_request(
+            method="POST",
+            url=API_URL,
+            body=request_body,
+        )
+
     def generate_embedding(self, model_id: str, sentences: List[str]) -> object:
         """
         This method return embedding for given sentences (using ml commons _predict api)

diff --git a/opensearch_py_ml/ml_commons/model_uploader.py b/opensearch_py_ml/ml_commons/model_uploader.py
@@ -22,9 +22,11 @@
     MODEL_CONTENT_HASH_VALUE,
     MODEL_CONTENT_SIZE_IN_BYTES_FIELD,
     MODEL_FORMAT_FIELD,
+    MODEL_FUNCTION_NAME,
     MODEL_GROUP_ID,
     MODEL_MAX_SIZE,
     MODEL_NAME_FIELD,
+    MODEL_TASK_TYPE,
     MODEL_TYPE,
     MODEL_VERSION_FIELD,
     TOTAL_CHUNKS_FIELD,
@@ -167,6 +169,7 @@ def _check_mandatory_field(self, model_meta: dict) -> bool:
         """
 
         if model_meta:
+
             if not model_meta.get(MODEL_NAME_FIELD):
                 raise ValueError(f"{MODEL_NAME_FIELD} can not be empty")
             if not model_meta.get(MODEL_VERSION_FIELD):
@@ -178,7 +181,11 @@ def _check_mandatory_field(self, model_meta: dict) -> bool:
             if not model_meta.get(TOTAL_CHUNKS_FIELD):
                 raise ValueError(f"{TOTAL_CHUNKS_FIELD} can not be empty")
             if not model_meta.get(MODEL_CONFIG_FIELD):
-                raise ValueError(f"{MODEL_CONFIG_FIELD} can not be empty")
+                if (
+                    model_meta.get(MODEL_FUNCTION_NAME) != "SPARSE_ENCODING"
+                    and model_meta.get(MODEL_TASK_TYPE) != "SPARSE_ENCODING"
+                ):
+                    raise ValueError(f"{MODEL_CONFIG_FIELD} can not be empty")
             else:
                 if not isinstance(model_meta.get(MODEL_CONFIG_FIELD), dict):
                     raise TypeError(

diff --git a/opensearch_py_ml/ml_models/__init__.py b/opensearch_py_ml/ml_models/__init__.py
@@ -7,5 +7,6 @@
 
 from .metrics_correlation.mcorr import MCorr
 from .sentencetransformermodel import SentenceTransformerModel
+from .sparse_encoding_model import SparseEncodingModel
 
-__all__ = ["SentenceTransformerModel", "MCorr"]
+__all__ = ["SentenceTransformerModel", "MCorr", "SparseEncodingModel"]
diff --git a/opensearch_py_ml/ml_models/base_models.py b/opensearch_py_ml/ml_models/base_models.py
@@ -0,0 +1,117 @@
+# SPDX-License-Identifier: Apache-2.0
+# The OpenSearch Contributors require contributions made to
+# this file be licensed under the Apache-2.0 license or a
+# compatible open source license.
+# Any modifications Copyright OpenSearch Contributors. See
+# GitHub history for details.
+import json
+import os
+from abc import ABC, abstractmethod
+from zipfile import ZipFile
+
+import requests
+
+from opensearch_py_ml.ml_commons.ml_common_utils import (
+    LICENSE_URL,
+    SPARSE_ENCODING_FUNCTION_NAME,
+)
+
+
+class BaseUploadModel(ABC):
+    """
+    A base class for uploading models to OpenSearch pretrained model hub.
+    """
+
+    def __init__(
+        self, model_id: str, folder_path: str = None, overwrite: bool = False
+    ) -> None:
+        self.model_id = model_id
+        self.folder_path = folder_path
+        self.overwrite = overwrite
+
+    @abstractmethod
+    def save_as_pt(self, *args, **kwargs):
+        pass
+
+    @abstractmethod
+    def save_as_onnx(self, *args, **kwargs):
+        pass
+
+    @abstractmethod
+    def make_model_config_json(
+        self,
+        version_number: str,
+        model_format: str,
+        description: str,
+    ) -> str:
+        pass
+
+    def _fill_null_truncation_field(
+        self,
+        save_json_folder_path: str,
+        max_length: int,
+    ) -> None:
+        """
+        Fill truncation field in tokenizer.json when it is null
+
+        :param save_json_folder_path:
+             path to save model json file, e.g, "home/save_pre_trained_model_json/")
+        :type save_json_folder_path: string
+        :param max_length:
+             maximum sequence length for model
+        :type max_length: int
+        :return: no return value expected
+        :rtype: None
+        """
+        tokenizer_file_path = os.path.join(save_json_folder_path, "tokenizer.json")
+        with open(tokenizer_file_path) as user_file:
+            parsed_json = json.load(user_file)
+        if "truncation" not in parsed_json or parsed_json["truncation"] is None:
+            parsed_json["truncation"] = {
+                "direction": "Right",
+                "max_length": max_length,
+                "strategy": "LongestFirst",
+                "stride": 0,
+            }
+            with open(tokenizer_file_path, "w") as file:
+                json.dump(parsed_json, file, indent=2)
+
+    def _add_apache_license_to_model_zip_file(self, model_zip_file_path: str):
+        """
+        Add Apache-2.0 license file to the model zip file at model_zip_file_path
+
+        :param model_zip_file_path:
+            Path to the model zip file
+        :type model_zip_file_path: string
+        :return: no return value expected
+        :rtype: None
+        """
+        r = requests.get(LICENSE_URL)
+        assert r.status_code == 200, "Failed to add license file to the model zip file"
+
+        with ZipFile(str(model_zip_file_path), "a") as zipObj:
+            zipObj.writestr("LICENSE", r.content)
+
+
+class SparseModel(BaseUploadModel, ABC):
+    """
+    Class for autotracing the Sparse Encoding model.
+    """
+
+    def __init__(
+        self,
+        model_id: str,
+        folder_path: str = "./model_files/",
+        overwrite: bool = False,
+    ):
+        super().__init__(model_id, folder_path, overwrite)
+        self.model_id = model_id
+        self.folder_path = folder_path
+        self.overwrite = overwrite
+        self.function_name = SPARSE_ENCODING_FUNCTION_NAME
+
+    def pre_process(self):
+        pass
+
+    def post_process(self):
+        pass