Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cross encoder support #1615

Merged
merged 18 commits into from
Dec 7, 2023
Merged

Conversation

HenryL27
Copy link
Collaborator

@HenryL27 HenryL27 commented Nov 10, 2023

Description

Adds support for (huggingface) cross encoders to ml-commons. Uses a new function name (TEXT_SIMILARITY) which takes as input a list of text pairs and spits out 1-dimensional tensors representing the similarity of the items in each pair. E.g.

{
  "query_text": "today is sunny"
  "text_docs": [ 
    "today is sunny",
    "today is july fifth",
    "it is winter"
  ] 
}

yields

{
  "inference_results": [
    {
      "output": [
        {
          "name": "logits",
          "data_type": "FLOAT32",
          "shape": [
            1
          ],
          "data": [
            10.939743
          ],
          "byte_buffer": {
            "array": "MAkvQQ==",
            "order": "LITTLE_ENDIAN"
          }
        }
      ]
    },
    {
      "output": [
        {
          "name": "logits",
          "data_type": "FLOAT32",
          "shape": [
            1
          ],
          "data": [
            -6.067284
          ],
          "byte_buffer": {
            "array": "MSfCwA==",
            "order": "LITTLE_ENDIAN"
          }
        }
      ]
    },
    {
      "output": [
        {
          "name": "logits",
          "data_type": "FLOAT32",
          "shape": [
            1
          ],
          "data": [
            -11.261627
          ],
          "byte_buffer": {
            "array": "oC80wQ==",
            "order": "LITTLE_ENDIAN"
          }
        }
      ]
    }
  ]
}

This was using the model cross-encoder/ms-marco-TinyBERT-L-2-v2 - the config I used to upload it looked like

{
  "name": model_name,
  "version": "1.0.0",
  "description": "Cross Encoder text similarity model",
  "model_format": "TORCH_SCRIPT",
  "function_name": "TEXT_SIMILARITY",
  "model_content_hash_value": hash_value,
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 1,
    "framework_type": "huggingface_transformers",
    "all_config": cfg.to_json_string(),
  }
}

Issues Resolved

Check List

  • [ x] New functionality includes testing.
    • [ x] All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • [ x] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (df644ff) 80.83% compared to head (be113ed) 80.98%.
Report is 7 commits behind head on main.

Files Patch % Lines
...rch/ml/common/input/nlp/TextSimilarityMLInput.java 86.95% 2 Missing and 4 partials ⚠️
...n/java/org/opensearch/ml/common/input/MLInput.java 66.66% 3 Missing and 2 partials ⚠️
...in/java/org/opensearch/ml/common/FunctionName.java 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1615      +/-   ##
============================================
+ Coverage     80.83%   80.98%   +0.15%     
- Complexity     4215     4246      +31     
============================================
  Files           404      408       +4     
  Lines         16977    17122     +145     
  Branches       1818     1835      +17     
============================================
+ Hits          13723    13867     +144     
+ Misses         2539     2534       -5     
- Partials        715      721       +6     
Flag Coverage Δ
ml-commons 80.98% <91.83%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@austintlee austintlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor question, but overall looks great!

austintlee
austintlee previously approved these changes Nov 16, 2023
Signed-off-by: HenryL27 <[email protected]>
@HenryL27 HenryL27 had a problem deploying to ml-commons-cicd-env December 6, 2023 23:52 — with GitHub Actions Failure
@HenryL27 HenryL27 temporarily deployed to ml-commons-cicd-env December 6, 2023 23:52 — with GitHub Actions Inactive
@HenryL27 HenryL27 temporarily deployed to ml-commons-cicd-env December 6, 2023 23:52 — with GitHub Actions Inactive
@HenryL27 HenryL27 had a problem deploying to ml-commons-cicd-env December 6, 2023 23:52 — with GitHub Actions Failure
@HenryL27 HenryL27 requested a review from dhrubo-os December 6, 2023 23:52
@dhrubo-os
Copy link
Collaborator

Thanks for working on this. Approved.

Copy link
Collaborator

@austintlee austintlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with one minor question. You can answer and resolve.)

@HenryL27 HenryL27 merged commit 2761d7d into opensearch-project:main Dec 7, 2023
8 of 12 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 7, 2023
* add text similarity inputs and function name

Signed-off-by: HenryL27 <[email protected]>

* add text similarity cross encoder model

Signed-off-by: HenryL27 <[email protected]>

* add text similarity unit tests

Signed-off-by: HenryL27 <[email protected]>

* add text similarity input unittests

Signed-off-by: HenryL27 <[email protected]>

* add text similarity dataset unittests

Signed-off-by: HenryL27 <[email protected]>

* add function name annotation

Signed-off-by: HenryL27 <[email protected]>

* refactor API to use single query

Signed-off-by: HenryL27 <[email protected]>

* omit private from class vars

Co-authored-by: Navneet Verma <[email protected]>
Signed-off-by: HenryL27 <[email protected]>

* change output name from logits to similarity

Signed-off-by: HenryL27 <[email protected]>

* hashify isDLModel

Signed-off-by: HenryL27 <[email protected]>

* add error message for non-torchscript cross encoders

Signed-off-by: HenryL27 <[email protected]>

* allow onnx, actually.

Signed-off-by: HenryL27 <[email protected]>

* apply spotless after rebase

Signed-off-by: HenryL27 <[email protected]>

* add unittest for new mlinput toXcontent clause

Signed-off-by: HenryL27 <[email protected]>

* static DLModels

Signed-off-by: HenryL27 <[email protected]>

* add tests and error message tweaks

Signed-off-by: HenryL27 <[email protected]>

* name test models w framework

Signed-off-by: HenryL27 <[email protected]>

* change pt->torch_script

Signed-off-by: HenryL27 <[email protected]>

---------

Signed-off-by: HenryL27 <[email protected]>
Co-authored-by: Navneet Verma <[email protected]>
(cherry picked from commit 2761d7d)
dhrubo-os pushed a commit that referenced this pull request Dec 7, 2023
* add text similarity inputs and function name

Signed-off-by: HenryL27 <[email protected]>

* add text similarity cross encoder model

Signed-off-by: HenryL27 <[email protected]>

* add text similarity unit tests

Signed-off-by: HenryL27 <[email protected]>

* add text similarity input unittests

Signed-off-by: HenryL27 <[email protected]>

* add text similarity dataset unittests

Signed-off-by: HenryL27 <[email protected]>

* add function name annotation

Signed-off-by: HenryL27 <[email protected]>

* refactor API to use single query

Signed-off-by: HenryL27 <[email protected]>

* omit private from class vars

Co-authored-by: Navneet Verma <[email protected]>
Signed-off-by: HenryL27 <[email protected]>

* change output name from logits to similarity

Signed-off-by: HenryL27 <[email protected]>

* hashify isDLModel

Signed-off-by: HenryL27 <[email protected]>

* add error message for non-torchscript cross encoders

Signed-off-by: HenryL27 <[email protected]>

* allow onnx, actually.

Signed-off-by: HenryL27 <[email protected]>

* apply spotless after rebase

Signed-off-by: HenryL27 <[email protected]>

* add unittest for new mlinput toXcontent clause

Signed-off-by: HenryL27 <[email protected]>

* static DLModels

Signed-off-by: HenryL27 <[email protected]>

* add tests and error message tweaks

Signed-off-by: HenryL27 <[email protected]>

* name test models w framework

Signed-off-by: HenryL27 <[email protected]>

* change pt->torch_script

Signed-off-by: HenryL27 <[email protected]>

---------

Signed-off-by: HenryL27 <[email protected]>
Co-authored-by: Navneet Verma <[email protected]>
(cherry picked from commit 2761d7d)

Co-authored-by: HenryL27 <[email protected]>
@martin-gaievski
Copy link
Member

@HenryL27 can you please share details of meta config for ms-marco-TinyBERT-L-2-v2 model?
I'm using following request but I'm getting errors, probably some param is missing:

POST /_plugins/_ml/models/meta
{
    "name": "ms-marco-TinyBERT-L-2-v2",
    "version": "1.0.0",
    "function_name": "TEXT_SIMILARITY",
    "description": "test model",
    "model_format": "TORCH_SCRIPT",
    "model_group_id": "<MODEL_GROUP_ID>",
    "model_content_hash_value": "90e39a926101d1a4e542aade0794319404689b12acfd5d7e65c03d91c668b5cf",
    "model_config": {
        "model_type": "bert",
        "embedding_dimension": 1,
        "framework_type": "huggingface_transformers",
        "all_config": "{\"total_chunks\":2,\"is_hidden\":false}"
    },
    "url": "https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_similarity/TinyBERT-CE-torch_script.zip?raw=true"
}

error response:

        "type": "illegal_argument_exception",
        "reason": "total chunks field is null"

austintlee pushed a commit to austintlee/ml-commons that referenced this pull request Feb 29, 2024
* add text similarity inputs and function name

Signed-off-by: HenryL27 <[email protected]>

* add text similarity cross encoder model

Signed-off-by: HenryL27 <[email protected]>

* add text similarity unit tests

Signed-off-by: HenryL27 <[email protected]>

* add text similarity input unittests

Signed-off-by: HenryL27 <[email protected]>

* add text similarity dataset unittests

Signed-off-by: HenryL27 <[email protected]>

* add function name annotation

Signed-off-by: HenryL27 <[email protected]>

* refactor API to use single query

Signed-off-by: HenryL27 <[email protected]>

* omit private from class vars

Co-authored-by: Navneet Verma <[email protected]>
Signed-off-by: HenryL27 <[email protected]>

* change output name from logits to similarity

Signed-off-by: HenryL27 <[email protected]>

* hashify isDLModel

Signed-off-by: HenryL27 <[email protected]>

* add error message for non-torchscript cross encoders

Signed-off-by: HenryL27 <[email protected]>

* allow onnx, actually.

Signed-off-by: HenryL27 <[email protected]>

* apply spotless after rebase

Signed-off-by: HenryL27 <[email protected]>

* add unittest for new mlinput toXcontent clause

Signed-off-by: HenryL27 <[email protected]>

* static DLModels

Signed-off-by: HenryL27 <[email protected]>

* add tests and error message tweaks

Signed-off-by: HenryL27 <[email protected]>

* name test models w framework

Signed-off-by: HenryL27 <[email protected]>

* change pt->torch_script

Signed-off-by: HenryL27 <[email protected]>

---------

Signed-off-by: HenryL27 <[email protected]>
Co-authored-by: Navneet Verma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants