-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfctr(docx): improve typing etc. in prep for docx image extraction #3015
Conversation
- Modernize typing of list and dict. - Use `example_doc_path()` instead of more brittle relative paths. - Reimplement `test_ids_are_unique_and_deterministic()` to avoid depending on explicit id values.
Modernize typing for list, dict, tuple, and union types.
- Improve docstring. - Remove definitions for unused parameters captured by `**kwargs`.
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: unstructured/partition/docx.py
Did you find this useful? React with a 👍 or 👎 |
def example_doc_path(filename: str) -> str: | ||
"""String path to a file in the example-docs/ directory.""" | ||
return str(pathlib.Path(__file__).parent.parent.parent.parent / "example-docs" / filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now has a better implementation in test_unstructured.unit_utils
, use that one instead.
def test_ids_are_unique_and_deterministic(): | ||
elements = partition_docx("example-docs/duplicate-paragraphs.docx") | ||
|
||
ids = [e.id for e in elements] | ||
assert ids == [ | ||
"2f22d82eea1faf5f40dac60cef52700e", | ||
"ca9e1f448e531a5152d960e14eefc360", | ||
"9ddeacb172ac17fb45e6f3f15f3c703d", | ||
"a4fd85d3f4141acae38c8f9c936ed2f3", | ||
"44ebaaf66640719c918246d4ccba1c45", | ||
"f36e8ebcb3b6a051940a168fe73cbc44", | ||
"532b395177652c7d61e1e4d855f1dc1d", | ||
], "IDs are not deterministic" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimplement this to avoid relatively brittle comparison against hard-coded ids.
import pathlib | ||
import re | ||
from tempfile import SpooledTemporaryFile | ||
from typing import Dict, List | ||
import tempfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a finer point I suppose. I generally import stdlib modules rather than selected objects from them unless they are used so frequently (like typing.*) that using the prefix becomes onerous. Using an explicit prefix tells folks this is a stdlib thing rather than a custom class/object. This is important some times when the names are more likely to collide with something defined locally.
Just a general inclination to keep the global namespace less populated I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Summary
Noisy but trivial changes to
partition_docx()
environs and tests in preparation for DOCX image extraction. These changes are extracted here so they don't distract on the changes of substance to follow in the next PR.