Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Remove OCR related code for entire page OCR #231

Merged
merged 27 commits into from
Oct 5, 2023

Conversation

yuming-long
Copy link
Contributor

@yuming-long yuming-long commented Sep 29, 2023

Summary

First part of OCR refactor to move it from inference repo to unstructured repo. This PR removes all OCR related code for entire page OCR, which means all table related OCR still remain the same (will be moved after table refactor to accept preprocessed OCR data)

Test

Please see test description in Unstructured-IO/unstructured#1579, since those two need to work together.

Note

The ingest test won't pass until we merge the unstructured refactor PR

@yuming-long yuming-long force-pushed the yuming/remove_ocr_code branch from 041c465 to 12c30c3 Compare September 29, 2023 17:08
@yuming-long yuming-long changed the title Yuming/remove ocr code Chore: Remove OCR related code for entire page OCR Oct 3, 2023
@yuming-long yuming-long marked this pull request as ready for review October 3, 2023 23:47
Dockerfile Show resolved Hide resolved
unstructured_inference/inference/elements.py Outdated Show resolved Hide resolved
unstructured_inference/inference/elements.py Outdated Show resolved Hide resolved
unstructured_inference/inference/elements.py Show resolved Hide resolved
unstructured_inference/inference/elements.py Outdated Show resolved Hide resolved
unstructured_inference/inference/elements.py Show resolved Hide resolved
unstructured_inference/inference/elements.py Show resolved Hide resolved
unstructured_inference/inference/elements.py Outdated Show resolved Hide resolved
@christinestraub christinestraub changed the title Chore: Remove OCR related code for entire page OCR Refactor: Remove OCR related code for entire page OCR Oct 4, 2023
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cragwolfe cragwolfe merged commit ffb1f0b into main Oct 5, 2023
5 of 8 checks passed
@cragwolfe cragwolfe deleted the yuming/remove_ocr_code branch October 5, 2023 18:23
github-merge-queue bot pushed a commit to Unstructured-IO/unstructured that referenced this pull request Oct 6, 2023
…#1579)

## Summary
Second part of OCR refactor to move it from inference repo to
unstructured repo, first part is done in
Unstructured-IO/unstructured-inference#231. This
PR adds OCR process logics to entire page OCR, and support two OCR
modes, "entire_page" or "individual_blocks".

The updated workflow for `Hi_res` partition:
* pass the document as data/filename to inference repo to get
`inferred_layout` (DocumentLayout)
* pass the document as data/filename to OCR module, which first open the
document (create temp file/dir as needed), and split the document by
pages (convert PDF pages to image pages for PDF file)
* if ocr mode is `"entire_page"`
    *  OCR the entire image
    * merge the OCR layout with inferred page layout
 * if ocr mode is `"individual_blocks"`
* from inferred page layout, find element with no extracted text, crop
the entire image by the bboxes of the element
* replace empty text element with the text obtained from OCR the cropped
image
* return all merged PageLayouts and form a DocumentLayout subject for
later on process

This PR also bump `unstructured-inference==0.7.2` since the branch relay
on OCR refactor from unstructured-inference.
  
## Test
```
from unstructured.partition.auto import partition

entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res")
individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res")
print([el.text for el in entrie_page_ocr_mode_elements])
print([el.text for el in individual_blocks_ocr_mode_elements])
```
latest output:
```
# entrie_page
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고']
# individual_blocks
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email']
```

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: yuming-long <[email protected]>
Co-authored-by: christinestraub <[email protected]>
Co-authored-by: christinestraub <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants