Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: support entire page OCR with ocr_mode and ocr_languages #1579

Merged
merged 105 commits into from
Oct 6, 2023

Conversation

yuming-long
Copy link
Contributor

@yuming-long yuming-long commented Sep 29, 2023

Summary

Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in Unstructured-IO/unstructured-inference#231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks".

The updated workflow for Hi_res partition:

  • pass the document as data/filename to inference repo to get inferred_layout (DocumentLayout)
  • pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file)
  • if ocr mode is "entire_page"
    • OCR the entire image
    • merge the OCR layout with inferred page layout
  • if ocr mode is "individual_blocks"
    • from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element
    • replace empty text element with the text obtained from OCR the cropped image
  • return all merged PageLayouts and form a DocumentLayout subject for later on process

This PR also bump unstructured-inference==0.7.2 since the branch relay on OCR refactor from unstructured-inference.

Test

from unstructured.partition.auto import partition

entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res")
individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res")
print([el.text for el in entrie_page_ocr_mode_elements])
print([el.text for el in individual_blocks_ocr_mode_elements])

latest output:

# entrie_page
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고']
# individual_blocks
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email']

yuming-long and others added 20 commits October 5, 2023 14:35
… <- Ingest test fixtures update (#1658)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: yuming-long <[email protected]>
# Conflicts:
#	CHANGELOG.md
#	test_unstructured_ingest/expected-structured-output/biomed-api/65/11/main.PMC6312790.pdf.json
#	test_unstructured_ingest/expected-structured-output/biomed-api/75/29/main.PMC6312793.pdf.json
#	test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/2023-Jan-economic-outlook.pdf.json
#	test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/Silent-Giant-(1).pdf.json
#	test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/recalibrating-risk-report.pdf.json
… <- Ingest test fixtures update (#1661)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: christinestraub <[email protected]>
# Conflicts:
#	CHANGELOG.md
#	requirements/constraints.in
#	requirements/extra-pdf-image.txt
#	requirements/huggingface.txt
#	requirements/ingest-openai.txt
#	requirements/ingest-salesforce.txt
#	requirements/test.txt
… <- Ingest test fixtures update (#1677)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: christinestraub <[email protected]>
@cragwolfe cragwolfe dismissed stale reviews from christinestraub and qued October 6, 2023 22:54

stale request

@cragwolfe cragwolfe added this pull request to the merge queue Oct 6, 2023
Merged via the queue into main with commit dcd6d0f Oct 6, 2023
39 checks passed
@cragwolfe cragwolfe deleted the yuming/refactor_ocr branch October 6, 2023 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants