-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docx refactor #1422
docx refactor #1422
Conversation
f9f0d34
to
c8d5a0e
Compare
c8d5a0e
to
6d08426
Compare
6d08426
to
36dcb4e
Compare
All looks good to me from the walk through and local tests. I still do see some test failures for the formatting, but that could be easily fix. |
f840e6f
to
3b8f4d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one suggestion related to consistency with how we've been notifying users when they are missing a dependency.
unstructured/partition/docx.py
Outdated
@@ -103,35 +103,44 @@ def convert_and_partition_docx( | |||
Determines whether or not metadata is included in the metadata attribute on the elements in | |||
the output. | |||
""" | |||
if filename is None: | |||
filename = "" | |||
if "pypandoc" not in globals(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be handled by the requires_dependencies
decorator? Feel free to suggest improvements there, but that would be consistent with the rest of our codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know, thanks @qued I'll check that out and replace this :)
3769abc
to
d7962cc
Compare
Copy `partition_docx()` as-is as a method of a new class `_DocxPartitioner`. In the next step the original implementation will be replaced with a call to `_DocxPartitioner` that will be the basis for further refactoring. Doing it this way makes the commit-diff a lot easier to follow.
... with a call to the main classmethod of _DocxPartitioner.
Next step is to make parameters available to all methods so we can extract methods without having to know exactly what their implementation needs.
Morph the "pasted-as-is" `._partion_docx()` method of _DocxPartitioner into `._iter_document_elements()`, primarily by changing its signature to -> Iterator[Element] rather than List[Element].
Opening the provided file with `python-docx` to produce a `Document object` is a separate concern from processing the document contents. Also, other methods should be free to access the document for other purposes like accessing the document settings, etc. Extract opening the document to a lazyproperty so when and who actually triggers the document to be opened is of no concern and all callers get the same instance of the document object.
No need for this to be a module-level function. Bring it into the class as a method. This associates it clearly with `_DocxPartitioner` which is its only caller.
This only needs to be computed once and will be referenced from multiple methods. Extract it to a lazyproperty.
Computing the last-modified date for use in element metadata is a distinct concern and we are likely to improve it going forward to query the docx document-properties. Extract it to a lazyproperty.
Detecting whether a paragraph is a list-item is a distinct concern and will need to become substantially more sophisticated than it is so far. Extract it to it's own method to encapsulate the concern.
Incorporate module-level function into _DocxPartitioner as it is its only caller.
This will be used shortly to replace part of `_paragraph_to_element()` which we will be decomposing in the next couple commits.
This will shortly replace `_text_to_element()`.
Separate concern, separate method.
This completes the recomposition of paragraph-partitioning. Remove now unused `_paragraph_to_element()` and `_text_to_element()`
Also remove now-unused module-level emphasis functions.
Handling a table extracts nicely now we have the supporting methods decoupled from document traversal.
This belongs upstream but will live here for now to allow us to stream the document blocks (paragraphs and tables) section-by-section in document order.
Update document traversal to go section-by-section, then block-by-block within each section. This lays the groundwork for the rest of the extractions.
Add detection and emission of PageBreak elements produced by document sections. A docx section can and perhaps often does start on a new page and can even give rise to two page breaks, for example to move from an odd page to the next odd page.
This partially replaces `_get_headers_and_footers()` but we can't remove that until we add `._iter_section_footers()` in the next commit. Meanwhile, this improved implementation accounts for "linked-to-previous" headers, which should not emit a Header element (it hasn't changed since the previous one was introduced into the element-stream).
This completes the extraction of header/footers and improving them to account for subtleties of linked-to-previous and whether first-page and/or different-odd-even settings are activated. Retire now unused `._get_headers_and_footers()` and `._join_paragraphs()`.
This approach needs some refinement, but this extraction localizes any changes required to this new method.
* Handle situation where `pypandoc` is not installed with a specific error message rather than something I expect is obscure. * Clarify logic for getting `filename_no_path` and resolve "filename_no_path is possibly unbound" lint error.
I didn't touch any of these Python files in my PR which makes me think either the linting changed recently or PRs that fail CI have been getting merged. Anyway, happy to fix them up for the greater good :)
01df5e3
to
6f3f017
Compare
Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of
partition/docx.py
as View File.This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.