Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport to 2.x]Add tokenizer and sparse encoding (#1301) (#1393) #1398

Conversation

zane-neo
Copy link
Collaborator

  • add tokenizer and sparse encoding

Signed-off-by: xinyual [email protected]

  • add tokenizer and sparse encoding

Signed-off-by: xinyual [email protected]

  • add tokenizer and sparse encoding

Signed-off-by: xinyual [email protected]

  • add tokenizer and sparse encoding

Signed-off-by: xinyual [email protected]

  • add tokenizer and sparse encoding

Signed-off-by: xinyual [email protected]

  • remove special token

Signed-off-by: xinyual [email protected]

  • add filter

Signed-off-by: xinyual [email protected]

  • try empty model

Signed-off-by: xinyual [email protected]

  • remove warm up

Signed-off-by: xinyual [email protected]

  • try empty model

Signed-off-by: xinyual [email protected]

  • add block

Signed-off-by: xinyual [email protected]

  • add log

Signed-off-by: xinyual [email protected]

  • add log

Signed-off-by: xinyual [email protected]

  • add log

Signed-off-by: xinyual [email protected]

  • remove log

Signed-off-by: xinyual [email protected]

  • remove pt file detect

Signed-off-by: xinyual [email protected]

  • add log

Signed-off-by: xinyual [email protected]

  • add functionName pipeline

Signed-off-by: xinyual [email protected]

  • remove verify log

Signed-off-by: xinyual [email protected]

  • skip special token in sparse encoding

Signed-off-by: xinyual [email protected]

  • skip omit tokenize config

Signed-off-by: xinyual [email protected]

  • skip omit tokenize config-change warm up logic

Signed-off-by: xinyual [email protected]

  • reArch

Signed-off-by: xinyual [email protected]

  • deduplicate

Signed-off-by: xinyual [email protected]

  • omit ml config in sparse encoding

Signed-off-by: xinyual [email protected]

  • add null config in warm up

Signed-off-by: xinyual [email protected]

  • fix original test

Signed-off-by: xinyual [email protected]

  • add tokenize ut half

Signed-off-by: xinyual [email protected]

  • fix sparse encoding bug

Signed-off-by: xinyual [email protected]

  • add UT for sparse encoding and tokenize

Signed-off-by: xinyual [email protected]

  • remove useless framwork type

Signed-off-by: xinyual [email protected]

  • common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java

Signed-off-by: xinyual [email protected]

  • change key for tokenize

Signed-off-by: xinyual [email protected]

  • reArch DLModel

Signed-off-by: xinyual [email protected]

  • reArch DLModel again

Signed-off-by: xinyual [email protected]

  • response format

Signed-off-by: xinyual [email protected]

  • tokenize only one output

Signed-off-by: xinyual [email protected]

  • clean sparse output

Signed-off-by: xinyual [email protected]

  • clean sparse output

Signed-off-by: xinyual [email protected]

  • change UT number

Signed-off-by: xinyual [email protected]

  • remove useless predict code

Signed-off-by: xinyual [email protected]

  • remove useless part

Signed-off-by: xinyual [email protected]

  • change tokenize way

Signed-off-by: xinyual [email protected]

  • reArch add textEmbedding model

Signed-off-by: xinyual [email protected]

  • add tokenize logic

Signed-off-by: xinyual [email protected]

  • add abstract

Signed-off-by: xinyual [email protected]

  • clear code

Signed-off-by: xinyual [email protected]

  • fix it class

Signed-off-by: xinyual [email protected]

  • fix it class

Signed-off-by: xinyual [email protected]

  • add IT file

Signed-off-by: xinyual [email protected]

  • reformulate

Signed-off-by: xinyual [email protected]

  • reformulate remote inference

Signed-off-by: xinyual [email protected]

  • reformulate remote inference

Signed-off-by: xinyual [email protected]

  • reformulate remote inference json and array

Signed-off-by: xinyual [email protected]

  • verify

Signed-off-by: xinyual [email protected]

  • undo string utils

Signed-off-by: xinyual [email protected]

  • skip dummy model

Signed-off-by: xinyual [email protected]

  • skip dummy model

Signed-off-by: xinyual [email protected]

  • skip dummy model

Signed-off-by: xinyual [email protected]

  • skip dummy model

Signed-off-by: xinyual [email protected]

  • skip dummy model

Signed-off-by: xinyual [email protected]

  • skip dummy model

Signed-off-by: xinyual [email protected]

  • add inner load Model

Signed-off-by: xinyual [email protected]

  • rename variable

Signed-off-by: xinyual [email protected]

  • add default for idf

Signed-off-by: xinyual [email protected]

  • add ut for sparse encoding and tokenizer

Signed-off-by: xinyual [email protected]

  • add close model

Signed-off-by: xinyual [email protected]

  • change mock class

Signed-off-by: xinyual [email protected]

  • remove buffer for sparse encoding output

Signed-off-by: xinyual [email protected]

  • change tokenize model ready logic

Signed-off-by: xinyual [email protected]

  • rewrite input functionName

Signed-off-by: xinyual [email protected]

  • deduplicate

Signed-off-by: xinyual [email protected]

  • change UT usage

Signed-off-by: xinyual [email protected]

  • fix downloadAndSplit test

Signed-off-by: xinyual [email protected]

  • fix Helper test

Signed-off-by: xinyual [email protected]

  • remove meaningless change

Signed-off-by: xinyual [email protected]

  • remove complie change

Signed-off-by: xinyual [email protected]

  • rename

Signed-off-by: xinyual [email protected]

  • fix typo error and simplify wrap code

Signed-off-by: xinyual [email protected]

  • add comment

Signed-off-by: xinyual [email protected]

  • using gson and remove useless close logic

Signed-off-by: xinyual [email protected]

  • update comment and import problem

Signed-off-by: xinyual [email protected]

  • add static idf name

Signed-off-by: xinyual [email protected]

  • fix format problem

Signed-off-by: xinyual [email protected]

  • extract an abstract model for sparse and dense sentence transformer translator

Signed-off-by: xinyual [email protected]

  • fix typo error

Signed-off-by: xinyual [email protected]

  • remove duplicate tokenizer file, fix import problem and add comment for tokenizer model

Signed-off-by: xinyual [email protected]


Signed-off-by: xinyual [email protected]
(cherry picked from commit 31a4e25)

Co-authored-by: xinyual [email protected]
(cherry picked from commit 44946da)

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ch-project#1393)

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* remove special token

Signed-off-by: xinyual <[email protected]>

* add filter

Signed-off-by: xinyual <[email protected]>

* try empty model

Signed-off-by: xinyual <[email protected]>

* remove warm up

Signed-off-by: xinyual <[email protected]>

* try empty model

Signed-off-by: xinyual <[email protected]>

* add block

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* remove log

Signed-off-by: xinyual <[email protected]>

* remove pt file detect

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* add functionName pipeline

Signed-off-by: xinyual <[email protected]>

* remove verify log

Signed-off-by: xinyual <[email protected]>

* skip special token in sparse encoding

Signed-off-by: xinyual <[email protected]>

* skip omit tokenize config

Signed-off-by: xinyual <[email protected]>

* skip omit tokenize config-change warm up logic

Signed-off-by: xinyual <[email protected]>

* reArch

Signed-off-by: xinyual <[email protected]>

* deduplicate

Signed-off-by: xinyual <[email protected]>

* omit ml config in sparse encoding

Signed-off-by: xinyual <[email protected]>

* add null config in warm up

Signed-off-by: xinyual <[email protected]>

* fix original test

Signed-off-by: xinyual <[email protected]>

* add tokenize ut half

Signed-off-by: xinyual <[email protected]>

* fix sparse encoding bug

Signed-off-by: xinyual <[email protected]>

* add UT for sparse encoding and tokenize

Signed-off-by: xinyual <[email protected]>

* remove useless framwork type

Signed-off-by: xinyual <[email protected]>

* common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java

Signed-off-by: xinyual <[email protected]>

* change key for tokenize

Signed-off-by: xinyual <[email protected]>

* reArch DLModel

Signed-off-by: xinyual <[email protected]>

* reArch DLModel again

Signed-off-by: xinyual <[email protected]>

* response format

Signed-off-by: xinyual <[email protected]>

* tokenize only one output

Signed-off-by: xinyual <[email protected]>

* clean sparse output

Signed-off-by: xinyual <[email protected]>

* clean sparse output

Signed-off-by: xinyual <[email protected]>

* change UT number

Signed-off-by: xinyual <[email protected]>

* remove useless predict code

Signed-off-by: xinyual <[email protected]>

* remove useless part

Signed-off-by: xinyual <[email protected]>

* change tokenize way

Signed-off-by: xinyual <[email protected]>

* reArch add textEmbedding model

Signed-off-by: xinyual <[email protected]>

* add tokenize logic

Signed-off-by: xinyual <[email protected]>

* add abstract

Signed-off-by: xinyual <[email protected]>

* clear code

Signed-off-by: xinyual <[email protected]>

* fix it class

Signed-off-by: xinyual <[email protected]>

* fix it class

Signed-off-by: xinyual <[email protected]>

* add IT file

Signed-off-by: xinyual <[email protected]>

* reformulate

Signed-off-by: xinyual <[email protected]>

* reformulate remote inference

Signed-off-by: xinyual <[email protected]>

* reformulate remote inference

Signed-off-by: xinyual <[email protected]>

* reformulate remote inference json and array

Signed-off-by: xinyual <[email protected]>

* verify

Signed-off-by: xinyual <[email protected]>

* undo string utils

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* add inner load Model

Signed-off-by: xinyual <[email protected]>

* rename variable

Signed-off-by: xinyual <[email protected]>

* add default for idf

Signed-off-by: xinyual <[email protected]>

* add ut for sparse encoding and tokenizer

Signed-off-by: xinyual <[email protected]>

* add close model

Signed-off-by: xinyual <[email protected]>

* change mock class

Signed-off-by: xinyual <[email protected]>

* remove buffer for sparse encoding output

Signed-off-by: xinyual <[email protected]>

* change tokenize model ready logic

Signed-off-by: xinyual <[email protected]>

* rewrite input functionName

Signed-off-by: xinyual <[email protected]>

* deduplicate

Signed-off-by: xinyual <[email protected]>

* change UT usage

Signed-off-by: xinyual <[email protected]>

* fix downloadAndSplit test

Signed-off-by: xinyual <[email protected]>

* fix Helper  test

Signed-off-by: xinyual <[email protected]>

* remove meaningless change

Signed-off-by: xinyual <[email protected]>

* remove complie change

Signed-off-by: xinyual <[email protected]>

* rename

Signed-off-by: xinyual <[email protected]>

* fix typo error and simplify wrap code

Signed-off-by: xinyual <[email protected]>

* add comment

Signed-off-by: xinyual <[email protected]>

* using gson and remove useless close logic

Signed-off-by: xinyual <[email protected]>

* update comment and import problem

Signed-off-by: xinyual <[email protected]>

* add static idf name

Signed-off-by: xinyual <[email protected]>

* fix format problem

Signed-off-by: xinyual <[email protected]>

* extract an abstract model for sparse and dense sentence transformer translator

Signed-off-by: xinyual <[email protected]>

* fix typo error

Signed-off-by: xinyual <[email protected]>

* remove duplicate tokenizer file, fix import problem and add comment for tokenizer model

Signed-off-by: xinyual <[email protected]>

---------

Signed-off-by: xinyual <[email protected]>
(cherry picked from commit 31a4e25)

Co-authored-by: xinyual <[email protected]>
(cherry picked from commit 44946da)
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env September 27, 2023 06:58 — with GitHub Actions Inactive
@zane-neo zane-neo temporarily deployed to ml-commons-cicd-env September 27, 2023 06:58 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

Merging #1398 (bd0382f) into 2.x (547ef21) will decrease coverage by 0.04%.
The diff coverage is 91.76%.

@@             Coverage Diff              @@
##                2.x    #1398      +/-   ##
============================================
- Coverage     78.28%   78.24%   -0.04%     
- Complexity     2272     2302      +30     
============================================
  Files           190      195       +5     
  Lines          9283     9370      +87     
  Branches        909      917       +8     
============================================
+ Hits           7267     7332      +65     
- Misses         1608     1624      +16     
- Partials        408      414       +6     
Flag Coverage Δ
ml-commons 78.24% <91.76%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...earch/ml/engine/algorithms/TextEmbeddingModel.java 100.00% <100.00%> (ø)
...thms/sparse_encoding/SparseEncodingTranslator.java 100.00% <100.00%> (ø)
...rse_encoding/TextEmbeddingSparseEncodingModel.java 100.00% <100.00%> (ø)
...ing/HuggingfaceTextEmbeddingServingTranslator.java 100.00% <ø> (ø)
...NNXSentenceTransformerTextEmbeddingTranslator.java 67.04% <ø> (ø)
...ng/SentenceTransformerTextEmbeddingTranslator.java 100.00% <100.00%> (+2.38%) ⬆️
...rithms/text_embedding/TextEmbeddingDenseModel.java 91.30% <100.00%> (ø)
...tion/prediction/TransportPredictionTaskAction.java 78.26% <100.00%> (+0.48%) ⬆️
...n/java/org/opensearch/ml/model/MLModelManager.java 52.38% <ø> (-2.25%) ⬇️
...ain/java/org/opensearch/ml/engine/ModelHelper.java 89.92% <50.00%> (ø)
... and 3 more

@zane-neo
Copy link
Collaborator Author

Backport will be done in this PR: #1399

@zane-neo zane-neo closed this Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants