-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: stop passing language code from tesseract mapping to paddle #226
Conversation
also seems like the ingest tests failing is unrelated to this PR (already failed with same trace on main) |
Right, the |
Could you provide some reproduction instructions? I'm glad it worked for the user but want to understand myself how the information flows from unstructured/unstructured-api into this inference change. |
thanks! just added more test description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm other than some spelling nits, test worked for me! I also tested with ocr_languages="eng" and was able to reproduce the correct INFO Loading paddle with CPU on language=en..
message indicating that it was using the default paddle value. thanks for the fix!
Co-authored-by: shreyanid <[email protected]>
Co-authored-by: shreyanid <[email protected]>
@shreyanid do you have access to force merge this PR (since ingest tests are doomed to fail lol |
I guess I do :) |
) ### Summary A user is flagging the assertion error for paddle language code: ``` AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng ``` and tried setting the `ocr_languages` param to 'en' (the correct lang code for english in paddle) but also didn't work. The reason is that the `ocr_languages` uses the mapping for tesseract code which will convert `en` to `eng` since thats the correct lang code for english in tesseract. The quick workaround here is stop passing the lang code to paddle and let it use default `en`, and this will be addressed once we have the lang code mapping for paddle. ### Test looks like user used this branch and got the lang parameter working from [linked comments](Unstructured-IO/unstructured-api#247 (comment)) :) on api repo: ``` pip install paddlepaddle pip install "unstructured.PaddleOCR" export ENTIRE_PAGE_OCR=paddle make run-web-app ``` * check error before this change: ``` curl -X 'POST' 'http://localhost:8000/general/v0/general' -H 'accept: application/json' -F 'files=@sample-docs/english-and-korean.png' -F 'ocr_languages=en' | jq -C . | less -R ``` will see the error: ``` { "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng" } ``` also in logger you will see `INFO Loading paddle with CPU on language=eng...` since tesseract mapping converts `en` to `eng`. * check after this change: Checkout to this branch and install inference repo into your env (the same env thats running api) with `pip install -e .` Rerun `make run-web-app` Run the curl command again, you won't get the result on m1 chip since paddle doesn't work on it but from the logger info you can see `2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with CPU on language=en...`, which means the lang parameter is using default `en` (logger info is coming from [this line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)). --------- Co-authored-by: shreyanid <[email protected]>
Summary
A user is flagging the assertion error for paddle language code:
and tried setting the
ocr_languages
param to 'en' (the correct lang code for english in paddle) but also didn't work.The reason is that the
ocr_languages
uses the mapping for tesseract code which will converten
toeng
since thats the correct lang code for english in tesseract.The quick workaround here is stop passing the lang code to paddle and let it use default
en
, and this will be addressed once we have the lang code mapping for paddle.Test
looks like user used this branch and got the lang parameter working from linked comments :)
on api repo:
will see the error:
also in logger you will see
INFO Loading paddle with CPU on language=eng...
since tesseract mapping convertsen
toeng
.Checkout to this branch and install inference repo into your env (the same env thats running api) with
pip install -e .
Rerun
make run-web-app
Run the curl command again, you won't get the result on m1 chip since paddle doesn't work on it but from the logger info you can see
2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with CPU on language=en...
, which means the lang parameter is using defaulten
(logger info is coming from this line).