Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: stop passing language code from tesseract mapping to paddle #226

Merged
merged 7 commits into from
Sep 27, 2023

Conversation

yuming-long
Copy link
Contributor

@yuming-long yuming-long commented Sep 22, 2023

Summary

A user is flagging the assertion error for paddle language code:

AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng

and tried setting the ocr_languages param to 'en' (the correct lang code for english in paddle) but also didn't work.
The reason is that the ocr_languages uses the mapping for tesseract code which will convert en to eng since thats the correct lang code for english in tesseract.

The quick workaround here is stop passing the lang code to paddle and let it use default en, and this will be addressed once we have the lang code mapping for paddle.

Test

looks like user used this branch and got the lang parameter working from linked comments :)
on api repo:

pip install paddlepaddle
pip install "unstructured.PaddleOCR"
export ENTIRE_PAGE_OCR=paddle
make run-web-app
  • check error before this change:
curl  -X 'POST'  'http://localhost:8000/general/v0/general'   -H 'accept: application/json'  -F 'files=@sample-docs/english-and-korean.png'   -F 'ocr_languages=en'  | jq -C . | less -R

will see the error:

{
  "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng"
}

also in logger you will see INFO Loading paddle with CPU on language=eng... since tesseract mapping converts en to eng.

  • check after this change:

Checkout to this branch and install inference repo into your env (the same env thats running api) with pip install -e .

Rerun make run-web-app

Run the curl command again, you won't get the result on m1 chip since paddle doesn't work on it but from the logger info you can see 2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with CPU on language=en..., which means the lang parameter is using default en (logger info is coming from this line).

@yuming-long yuming-long marked this pull request as ready for review September 22, 2023 20:01
@yuming-long
Copy link
Contributor Author

also seems like the ingest tests failing is unrelated to this PR (already failed with same trace on main)

@shreyanid
Copy link
Contributor

Right, the _partition_pdf_or_image_with_ocr function for example assumes the OCR agent is Tesseract, but that's because it's also specified in the docstring and in dependencies and usage. What flow did the user follow to use paddle and get this error?

@shreyanid
Copy link
Contributor

Could you provide some reproduction instructions? I'm glad it worked for the user but want to understand myself how the information flows from unstructured/unstructured-api into this inference change.

@yuming-long
Copy link
Contributor Author

thanks! just added more test description

Copy link
Contributor

@shreyanid shreyanid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm other than some spelling nits, test worked for me! I also tested with ocr_languages="eng" and was able to reproduce the correct INFO Loading paddle with CPU on language=en.. message indicating that it was using the default paddle value. thanks for the fix!

unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved
unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved
@yuming-long
Copy link
Contributor Author

@shreyanid do you have access to force merge this PR (since ingest tests are doomed to fail lol

@shreyanid shreyanid merged commit cf15726 into main Sep 27, 2023
@shreyanid shreyanid deleted the yuming/remove_lang_in_paddle branch September 27, 2023 23:46
@shreyanid
Copy link
Contributor

I guess I do :)

benjats07 pushed a commit that referenced this pull request Sep 30, 2023
)

### Summary

A user is flagging the assertion error for paddle language code:
```
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
```
and tried setting the `ocr_languages` param to 'en' (the correct lang
code for english in paddle) but also didn't work.
The reason is that the `ocr_languages` uses the mapping for tesseract
code which will convert `en` to `eng` since thats the correct lang code
for english in tesseract.

The quick workaround here is stop passing the lang code to paddle and
let it use default `en`, and this will be addressed once we have the
lang code mapping for paddle.

### Test
looks like user used this branch and got the lang parameter working from
[linked
comments](Unstructured-IO/unstructured-api#247 (comment))
:)
on api repo:
```
pip install paddlepaddle
pip install "unstructured.PaddleOCR"
export ENTIRE_PAGE_OCR=paddle
make run-web-app
```
* check error before this change:
```
curl  -X 'POST'  'http://localhost:8000/general/v0/general'   -H 'accept: application/json'  -F 'files=@sample-docs/english-and-korean.png'   -F 'ocr_languages=en'  | jq -C . | less -R
```
will see the error:
```
{
  "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng"
}
```
also in logger you will see `INFO Loading paddle with CPU on
language=eng...` since tesseract mapping converts `en` to `eng`.
* check after this change:

Checkout to this branch and install inference repo into your env (the
same env thats running api) with `pip install -e .`

Rerun `make run-web-app`

Run the curl command again, you won't get the result on m1 chip since
paddle doesn't work on it but from the logger info you can see
`2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with
CPU on language=en...`, which means the lang parameter is using default
`en` (logger info is coming from [this
line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)).

---------

Co-authored-by: shreyanid <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants