rfctr(docx): improve typing etc. in prep for docx image extraction #3015

scanny · 2024-05-14T17:38:15Z

Summary
Noisy but trivial changes to partition_docx() environs and tests in preparation for DOCX image extraction. These changes are extracted here so they don't distract on the changes of substance to follow in the next PR.

- Modernize typing of list and dict. - Use `example_doc_path()` instead of more brittle relative paths. - Reimplement `test_ids_are_unique_and_deterministic()` to avoid depending on explicit id values.

Modernize typing for list, dict, tuple, and union types.

- Improve docstring. - Remove definitions for unused parameters captured by `**kwargs`.

sentry-io · 2024-05-14T17:38:21Z

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: unstructured/partition/docx.py

Function	Unhandled Issue
`convert_and_partition_docx`	RuntimeError: Pandoc died with exitcode "64" during conversion: Could not unzip ODT: not enough bytes ... `Event Count:` 3

_{Did you find this useful? React with a 👍 or 👎}

scanny · 2024-05-14T17:41:17Z

test_unstructured/partition/docx/test_docx.py

-def example_doc_path(filename: str) -> str:
-    """String path to a file in the example-docs/ directory."""
-    return str(pathlib.Path(__file__).parent.parent.parent.parent / "example-docs" / filename)


This now has a better implementation in test_unstructured.unit_utils, use that one instead.

scanny · 2024-05-14T17:41:54Z

test_unstructured/partition/docx/test_docx.py

-def test_ids_are_unique_and_deterministic():
-    elements = partition_docx("example-docs/duplicate-paragraphs.docx")
-
-    ids = [e.id for e in elements]
-    assert ids == [
-        "2f22d82eea1faf5f40dac60cef52700e",
-        "ca9e1f448e531a5152d960e14eefc360",
-        "9ddeacb172ac17fb45e6f3f15f3c703d",
-        "a4fd85d3f4141acae38c8f9c936ed2f3",
-        "44ebaaf66640719c918246d4ccba1c45",
-        "f36e8ebcb3b6a051940a168fe73cbc44",
-        "532b395177652c7d61e1e4d855f1dc1d",
-    ], "IDs are not deterministic"


Reimplement this to avoid relatively brittle comparison against hard-coded ids.

Coniferish · 2024-05-14T19:24:46Z

test_unstructured/partition/docx/test_docx.py

 import pathlib
 import re
-from tempfile import SpooledTemporaryFile
-from typing import Dict, List
+import tempfile


Why change this import?

It's a finer point I suppose. I generally import stdlib modules rather than selected objects from them unless they are used so frequently (like typing.*) that using the prefix becomes onerous. Using an explicit prefix tells folks this is a stdlib thing rather than a custom class/object. This is important some times when the names are more likely to collide with something defined locally.

Just a general inclination to keep the global namespace less populated I guess.

Coniferish

lgtm

scanny added 3 commits May 13, 2024 21:24

rfctr(docx): tidy up docx tests

2b198e0

- Modernize typing of list and dict. - Use `example_doc_path()` instead of more brittle relative paths. - Reimplement `test_ids_are_unique_and_deterministic()` to avoid depending on explicit id values.

rfctr(docx): improve typing in docx.py

d721cac

Modernize typing for list, dict, tuple, and union types.

spike: tidy up partition_docx()

f823203

- Improve docstring. - Remove definitions for unused parameters captured by `**kwargs`.

scanny requested a review from Coniferish May 14, 2024 17:38

chore: bump CHANGELOG + __version__

b00cf03

scanny commented May 14, 2024

View reviewed changes

scanny temporarily deployed to ci May 14, 2024 17:53 — with GitHub Actions Inactive

scanny had a problem deploying to ci May 14, 2024 17:53 — with GitHub Actions Failure

scanny temporarily deployed to ci May 14, 2024 18:58 — with GitHub Actions Inactive

Coniferish reviewed May 14, 2024

View reviewed changes

Coniferish approved these changes May 14, 2024

View reviewed changes

scanny added this pull request to the merge queue May 14, 2024

Merged via the queue into main with commit b4a6009 May 14, 2024
42 checks passed

scanny deleted the scanny/prep-for-docx-image-extraction branch May 14, 2024 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfctr(docx): improve typing etc. in prep for docx image extraction #3015

rfctr(docx): improve typing etc. in prep for docx image extraction #3015

scanny commented May 14, 2024

sentry-io bot commented May 14, 2024

scanny May 14, 2024

scanny May 14, 2024

Coniferish May 14, 2024

scanny May 14, 2024

Coniferish left a comment

rfctr(docx): improve typing etc. in prep for docx image extraction #3015

rfctr(docx): improve typing etc. in prep for docx image extraction #3015

Conversation

scanny commented May 14, 2024

sentry-io bot commented May 14, 2024

🔍 Existing Issues For Review

scanny May 14, 2024

Choose a reason for hiding this comment

scanny May 14, 2024

Choose a reason for hiding this comment

Coniferish May 14, 2024

Choose a reason for hiding this comment

scanny May 14, 2024

Choose a reason for hiding this comment

Coniferish left a comment

Choose a reason for hiding this comment