Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfctr(docx): improve typing etc. in prep for docx image extraction #3015

Merged
merged 4 commits into from
May 14, 2024

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented May 14, 2024

Summary
Noisy but trivial changes to partition_docx() environs and tests in preparation for DOCX image extraction. These changes are extracted here so they don't distract on the changes of substance to follow in the next PR.

scanny added 3 commits May 13, 2024 21:24
- Modernize typing of list and dict.
- Use `example_doc_path()` instead of more brittle relative paths.
- Reimplement `test_ids_are_unique_and_deterministic()` to avoid
  depending on explicit id values.
Modernize typing for list, dict, tuple, and union types.
- Improve docstring.
- Remove definitions for unused parameters captured by `**kwargs`.
Copy link

sentry-io bot commented May 14, 2024

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: unstructured/partition/docx.py

Function Unhandled Issue
convert_and_partition_docx RuntimeError: Pandoc died with exitcode "64" during conversion: Could not unzip ODT: not enough bytes ...
Event Count: 3

Did you find this useful? React with a 👍 or 👎

@scanny scanny requested a review from Coniferish May 14, 2024 17:38
Comment on lines -604 to -606
def example_doc_path(filename: str) -> str:
"""String path to a file in the example-docs/ directory."""
return str(pathlib.Path(__file__).parent.parent.parent.parent / "example-docs" / filename)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now has a better implementation in test_unstructured.unit_utils, use that one instead.

Comment on lines -677 to -689
def test_ids_are_unique_and_deterministic():
elements = partition_docx("example-docs/duplicate-paragraphs.docx")

ids = [e.id for e in elements]
assert ids == [
"2f22d82eea1faf5f40dac60cef52700e",
"ca9e1f448e531a5152d960e14eefc360",
"9ddeacb172ac17fb45e6f3f15f3c703d",
"a4fd85d3f4141acae38c8f9c936ed2f3",
"44ebaaf66640719c918246d4ccba1c45",
"f36e8ebcb3b6a051940a168fe73cbc44",
"532b395177652c7d61e1e4d855f1dc1d",
], "IDs are not deterministic"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reimplement this to avoid relatively brittle comparison against hard-coded ids.

import pathlib
import re
from tempfile import SpooledTemporaryFile
from typing import Dict, List
import tempfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this import?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a finer point I suppose. I generally import stdlib modules rather than selected objects from them unless they are used so frequently (like typing.*) that using the prefix becomes onerous. Using an explicit prefix tells folks this is a stdlib thing rather than a custom class/object. This is important some times when the names are more likely to collide with something defined locally.

Just a general inclination to keep the global namespace less populated I guess.

Copy link
Collaborator

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@scanny scanny added this pull request to the merge queue May 14, 2024
Merged via the queue into main with commit b4a6009 May 14, 2024
42 checks passed
@scanny scanny deleted the scanny/prep-for-docx-image-extraction branch May 14, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants