Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx refactor #1422

Merged
merged 34 commits into from
Sep 19, 2023
Merged

docx refactor #1422

merged 34 commits into from
Sep 19, 2023

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Sep 14, 2023

Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of partition/docx.py as View File.

This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.

@scanny scanny marked this pull request as ready for review September 14, 2023 20:34
@Klaijan Klaijan requested a review from qued September 15, 2023 19:36
@Klaijan
Copy link
Contributor

Klaijan commented Sep 15, 2023

All looks good to me from the walk through and local tests. I still do see some test failures for the formatting, but that could be easily fix.

@scanny scanny force-pushed the scanny/docx-rfctr branch 16 times, most recently from f840e6f to 3b8f4d8 Compare September 17, 2023 23:17
Copy link
Contributor

@Klaijan Klaijan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one suggestion related to consistency with how we've been notifying users when they are missing a dependency.

@@ -103,35 +103,44 @@ def convert_and_partition_docx(
Determines whether or not metadata is included in the metadata attribute on the elements in
the output.
"""
if filename is None:
filename = ""
if "pypandoc" not in globals():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be handled by the requires_dependencies decorator? Feel free to suggest improvements there, but that would be consistent with the rest of our codebase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know, thanks @qued I'll check that out and replace this :)

@scanny scanny force-pushed the scanny/docx-rfctr branch 3 times, most recently from 3769abc to d7962cc Compare September 19, 2023 21:25
Copy `partition_docx()` as-is as a method of a new class
`_DocxPartitioner`. In the next step the original implementation will be
replaced with a call to `_DocxPartitioner` that will be the basis for
further refactoring.

Doing it this way makes the commit-diff a lot easier to follow.
... with a call to the main classmethod of _DocxPartitioner.
Next step is to make parameters available to all methods so we can
extract methods without having to know exactly what their implementation
needs.
Morph the "pasted-as-is" `._partion_docx()` method of _DocxPartitioner
into `._iter_document_elements()`, primarily by changing its signature
to -> Iterator[Element] rather than List[Element].
Opening the provided file with `python-docx` to produce a `Document
object` is a separate concern from processing the document contents.
Also, other methods should be free to access the document for other
purposes like accessing the document settings, etc.

Extract opening the document to a lazyproperty so when and who actually
triggers the document to be opened is of no concern and all callers get
the same instance of the document object.
No need for this to be a module-level function. Bring it into the class
as a method. This associates it clearly with `_DocxPartitioner` which is
its only caller.
This only needs to be computed once and will be referenced from multiple
methods. Extract it to a lazyproperty.
Computing the last-modified date for use in element metadata is a
distinct concern and we are likely to improve it going forward to query
the docx document-properties. Extract it to a lazyproperty.
Detecting whether a paragraph is a list-item is a distinct concern and
will need to become substantially more sophisticated than it is so far.
Extract it to it's own method to encapsulate the concern.
Incorporate module-level function into _DocxPartitioner as it is its
only caller.
This will be used shortly to replace part of `_paragraph_to_element()`
which we will be decomposing in the next couple commits.
This will shortly replace `_text_to_element()`.
Separate concern, separate method.
This completes the recomposition of paragraph-partitioning. Remove now
unused `_paragraph_to_element()` and `_text_to_element()`
Also remove now-unused module-level emphasis functions.
Handling a table extracts nicely now we have the supporting methods
decoupled from document traversal.
This belongs upstream but will live here for now to allow us to stream
the document blocks (paragraphs and tables) section-by-section in
document order.
Update document traversal to go section-by-section, then block-by-block
within each section. This lays the groundwork for the rest of the
extractions.
Add detection and emission of PageBreak elements produced by document
sections. A docx section can and perhaps often does start on a new page
and can even give rise to two page breaks, for example to move from an
odd page to the next odd page.
This partially replaces `_get_headers_and_footers()` but we can't remove
that until we add `._iter_section_footers()` in the next commit.
Meanwhile, this improved implementation accounts for
"linked-to-previous" headers, which should not emit a Header element (it
hasn't changed since the previous one was introduced into the
element-stream).
This completes the extraction of header/footers and improving them to
account for subtleties of linked-to-previous and whether first-page
and/or different-odd-even settings are activated.

Retire now unused `._get_headers_and_footers()` and
`._join_paragraphs()`.
This approach needs some refinement, but this extraction localizes any
changes required to this new method.
* Handle situation where `pypandoc` is not installed with a specific
  error message rather than something I expect is obscure.
* Clarify logic for getting `filename_no_path` and resolve
  "filename_no_path is possibly unbound" lint error.
I didn't touch any of these Python files in my PR which makes me think
either the linting changed recently or PRs that fail CI have been
getting merged. Anyway, happy to fix them up for the greater good :)
@ryannikolaidis ryannikolaidis merged commit b54994a into main Sep 19, 2023
@ryannikolaidis ryannikolaidis deleted the scanny/docx-rfctr branch September 19, 2023 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants