-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Pre-trained tas-b model won't auto-truncate the doc #132
Comments
This happens because, the values of To solve this bug, we need to update this method, specifically here. When we save the
We inject this json structure in the |
This PR has merged to update the documentation. |
|
No, for now we updated the documentation. But we want to solve the bug in our end as well. So this issue will be closed when we solve the bug. |
@dhrubo-os Thank you for assigning. Working! |
Hey, I think this issue is not reproducible longer. Text with the length more than the threshold is processed as expected. Steps to reproduce:
|
Hi, thanks for looking at it. So in your provided example, the whole string is being treated as 1 token as there's no separator (space) between two words. I would suggest try with more than 500 words and let me know what do you see. |
Hi, @dhrubo-os . |
What problem were you facing? Can you please share the error log with me? |
Do you also have same file path: Are you saying when you run this code block your kernel starts showing this message? |
Exactly, I also created the following folders Is there another way of uploading model to the server? |
So we need to check why Nobody ever faced kernel restart issue during running this function. So I'm wondering if your computer has enough memory? Did you check if you can run other functions without any issue? |
Hi, the problem was really because of memory. Thank you @dhrubo-os ! When I am uploading the saved torchScript model in Opensearch, I see the following error. What can be the reason? |
Can you please enable these settings? From the Dashboard dev tools you can run this:
|
@Yerzhaisang, were you able to solve this issue? |
Hi @dhrubo-os , I was able trace the model with torchscript (see screenshot) and test this model using Question - to check the fix by me in |
Awesome. You can run the notebook in the root folder of the opensearch-py-ml. What do you mean by removing opensearch-py, opensearch-py-ml python modules? You need to import those modules. |
Hey @dhrubo-os , I noticed one thing - when I run the jupyter notebook outside the root folder, the model is traced successfully. However, when I run it inside the root, I see the following error: I registered model group and added its id to the config.json. Is this error familiar for you? |
Please check which opensearch version are you using. Let's use 2.7 to make this implementation easier. In 2.8, we introduces model group for access control, which can be bit confusing. So please make sure your opensearch version is 2.7 so that you don't face any model group related issue. |
Hey @dhrubo-os , where can I download docker-compose.yml with 2.7 version? |
In your current docker-compose.yml, you can replace Try to do a fresh installation with removing all the docker containers. |
@dhrubo-os , we made sure that this issue is fixed. However, can we discuss how to make |
When we have the Please let me know if you have any question. |
ok, let's say we made |
Keep those variables static value: "direction": "Right", "strategy": "LongestFirst", "stride": 0 Remember, if truncation is already exists in the tokenizer.json file the we won't update the value. We will update the value only if truncation is null. |
Got it, I will check my fix for all models to make sure that |
Hey @dhrubo-os , Can you look at https://docs.google.com/spreadsheets/d/1pCK0GJLvQfcmQ31475_IQTJzxueVUDJ0W3kkMxLbGkM/edit?usp=sharing? Many models have the same truncation parameters. Is it ok? Can you look at Yerzhaisang@03b281f, I checked |
Cool, looks good. Please raise a PR with unit test. Thanks. |
@dhrubo-os , I added unit test in Yerzhaisang@8a55e7b. Can you look at that please? Do you understand why one test failed? |
Why not raise a PR to our repo? codecov/patch is showing that you have less coverage (test) in code. With looking at the test code, I think we should assert the max length value of the tokenization to make sure we are updating accordingly. Currently you are just asserting that tokenization value is null which doesn't quite reflect the effect of your code change. |
What is the bug?
Our pre-trained tas-b model won't accept a doc with a token length exceeding 512
How can one reproduce the bug?
Simply upload our tas-b model (in ONNX form or torch_script form) into OpenSearch cluster and use it to embed a long doc will return you
What is the expected behavior?
Ideally the model is supposed to auto-truncate the doc exceeding a certain length.
The text was updated successfully, but these errors were encountered: