Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Remove OCR related code for entire page OCR #231

Merged
merged 27 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
34af41f
move func merge_inferred_layout_with_ocr_layout
yuming-long Sep 27, 2023
fc98bd6
aggregate_ocr_text_by_block
yuming-long Sep 27, 2023
57fb359
supplement_layout_with_ocr_elements
yuming-long Sep 27, 2023
08abf44
get_elements_from_ocr_regions
yuming-long Sep 27, 2023
0f38f5c
merge_text_regions
yuming-long Sep 27, 2023
b8d08f7
remove ocr_layout
yuming-long Sep 27, 2023
d7025df
remove text region ocr
yuming-long Sep 28, 2023
2f56748
more clean up
yuming-long Sep 28, 2023
97c6e02
add back
yuming-long Sep 28, 2023
365f3f8
remove tests clean up
yuming-long Sep 29, 2023
b64e546
disable tesseract
yuming-long Sep 29, 2023
bc89ffa
move test fixture
yuming-long Sep 29, 2023
a5fc90d
remove paddle install in docker
yuming-long Sep 29, 2023
abd95d1
empty
yuming-long Sep 29, 2023
bfaf2bb
Merge branch 'main' into yuming/remove_ocr_code
yuming-long Sep 29, 2023
5726d2f
ocr param in new test
yuming-long Sep 29, 2023
1a968c4
changlog version
yuming-long Sep 29, 2023
28e1bc7
remove ocr constant
yuming-long Oct 3, 2023
3f43f06
remove all comment ocr
yuming-long Oct 3, 2023
4768a8e
add deduplicate_detected_elements back
yuming-long Oct 4, 2023
5f0bbff
refactor: remove ocr layout visualization
christinestraub Oct 4, 2023
2b7a2fc
remove teseract module since table won't use it
yuming-long Oct 4, 2023
d319c9d
remove using image in extract_text
yuming-long Oct 4, 2023
e6bb6a3
Update CHANGELOG.md
yuming-long Oct 4, 2023
4de3ff3
version bump
yuming-long Oct 4, 2023
2f3c0db
fix: remove value error in extract text
yuming-long Oct 4, 2023
6a4677e
remove test since won't raise errir
yuming-long Oct 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 0.6.7
yuming-long marked this conversation as resolved.
Show resolved Hide resolved

* Remove all OCR related code expect the table OCR code

## 0.6.6

* Stop passing ocr_languages parameter into paddle to avoid invalid paddle language code error, this will be fixed until
Expand Down
1 change: 0 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ RUN python3.8 -m pip install pip==${PIP_VERSION} && \
pip install --no-cache -r requirements/base.txt && \
pip install --no-cache -r requirements/test.txt && \
pip install --no-cache -r requirements/dev.txt && \
pip install "unstructured.PaddleOCR" && \
christinestraub marked this conversation as resolved.
Show resolved Hide resolved
dnf -y groupremove "Development Tools" && \
dnf clean all

Expand Down
25 changes: 0 additions & 25 deletions test_unstructured_inference/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,15 +107,6 @@ def mock_embedded_text_regions():
]


@pytest.fixture()
def mock_ocr_regions():
return [
EmbeddedTextRegion(10, 10, 90, 90, text="0", source=None),
EmbeddedTextRegion(200, 200, 300, 300, text="1", source=None),
EmbeddedTextRegion(500, 320, 600, 350, text="3", source=None),
]


# TODO(alan): Make a better test layout
@pytest.fixture()
def mock_layout(mock_embedded_text_regions):
Expand All @@ -130,19 +121,3 @@ def mock_layout(mock_embedded_text_regions):
)
for r in mock_embedded_text_regions
]


@pytest.fixture()
def mock_inferred_layout(mock_embedded_text_regions):
return [
LayoutElement(
r.x1,
r.y1,
r.x2,
r.y2,
text=None,
source=None,
type="Text",
)
for r in mock_embedded_text_regions
]
Loading
Loading