docx refactor #1422

scanny · 2023-09-14T18:17:48Z

Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of partition/docx.py as View File.

This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.

unstructured/partition/docx.py

Klaijan · 2023-09-15T19:40:03Z

All looks good to me from the walk through and local tests. I still do see some test failures for the formatting, but that could be easily fix.

unstructured/partition/docx.py

Klaijan

LGTM

qued

LGTM, just one suggestion related to consistency with how we've been notifying users when they are missing a dependency.

qued · 2023-09-18T19:58:31Z

unstructured/partition/docx.py

@@ -103,35 +103,44 @@ def convert_and_partition_docx(
        Determines whether or not metadata is included in the metadata attribute on the elements in
        the output.
    """
-    if filename is None:
-        filename = ""
+    if "pypandoc" not in globals():


Could this be handled by the requires_dependencies decorator? Feel free to suggest improvements there, but that would be consistent with the rest of our codebase.

Ah, good to know, thanks @qued I'll check that out and replace this :)

Copy `partition_docx()` as-is as a method of a new class `_DocxPartitioner`. In the next step the original implementation will be replaced with a call to `_DocxPartitioner` that will be the basis for further refactoring. Doing it this way makes the commit-diff a lot easier to follow.

... with a call to the main classmethod of _DocxPartitioner.

Next step is to make parameters available to all methods so we can extract methods without having to know exactly what their implementation needs.

Morph the "pasted-as-is" `._partion_docx()` method of _DocxPartitioner into `._iter_document_elements()`, primarily by changing its signature to -> Iterator[Element] rather than List[Element].

Opening the provided file with `python-docx` to produce a `Document object` is a separate concern from processing the document contents. Also, other methods should be free to access the document for other purposes like accessing the document settings, etc. Extract opening the document to a lazyproperty so when and who actually triggers the document to be opened is of no concern and all callers get the same instance of the document object.

No need for this to be a module-level function. Bring it into the class as a method. This associates it clearly with `_DocxPartitioner` which is its only caller.

This only needs to be computed once and will be referenced from multiple methods. Extract it to a lazyproperty.

Computing the last-modified date for use in element metadata is a distinct concern and we are likely to improve it going forward to query the docx document-properties. Extract it to a lazyproperty.

Detecting whether a paragraph is a list-item is a distinct concern and will need to become substantially more sophisticated than it is so far. Extract it to it's own method to encapsulate the concern.

Incorporate module-level function into _DocxPartitioner as it is its only caller.

This will be used shortly to replace part of `_paragraph_to_element()` which we will be decomposing in the next couple commits.

This will shortly replace `_text_to_element()`.

Separate concern, separate method.

This completes the recomposition of paragraph-partitioning. Remove now unused `_paragraph_to_element()` and `_text_to_element()`

Also remove now-unused module-level emphasis functions.

Handling a table extracts nicely now we have the supporting methods decoupled from document traversal.

This belongs upstream but will live here for now to allow us to stream the document blocks (paragraphs and tables) section-by-section in document order.

Update document traversal to go section-by-section, then block-by-block within each section. This lays the groundwork for the rest of the extractions.

Add detection and emission of PageBreak elements produced by document sections. A docx section can and perhaps often does start on a new page and can even give rise to two page breaks, for example to move from an odd page to the next odd page.

This partially replaces `_get_headers_and_footers()` but we can't remove that until we add `._iter_section_footers()` in the next commit. Meanwhile, this improved implementation accounts for "linked-to-previous" headers, which should not emit a Header element (it hasn't changed since the previous one was introduced into the element-stream).

This completes the extraction of header/footers and improving them to account for subtleties of linked-to-previous and whether first-page and/or different-odd-even settings are activated. Retire now unused `._get_headers_and_footers()` and `._join_paragraphs()`.

This approach needs some refinement, but this extraction localizes any changes required to this new method.

* Handle situation where `pypandoc` is not installed with a specific error message rather than something I expect is obscure. * Clarify logic for getting `filename_no_path` and resolve "filename_no_path is possibly unbound" lint error.

I didn't touch any of these Python files in my PR which makes me think either the linting changed recently or PRs that fail CI have been getting merged. Anyway, happy to fix them up for the greater good :)

scanny force-pushed the scanny/docx-rfctr branch from f9f0d34 to c8d5a0e Compare September 14, 2023 18:33

Klaijan reviewed Sep 14, 2023

View reviewed changes

unstructured/partition/docx.py Show resolved Hide resolved

scanny force-pushed the scanny/docx-rfctr branch from c8d5a0e to 6d08426 Compare September 14, 2023 20:20

scanny marked this pull request as ready for review September 14, 2023 20:34

Klaijan reviewed Sep 14, 2023

View reviewed changes

unstructured/partition/docx.py Show resolved Hide resolved

scanny force-pushed the scanny/docx-rfctr branch from 6d08426 to 36dcb4e Compare September 15, 2023 03:07

Klaijan requested a review from qued September 15, 2023 19:36

scanny force-pushed the scanny/docx-rfctr branch 16 times, most recently from f840e6f to 3b8f4d8 Compare September 17, 2023 23:17

Klaijan reviewed Sep 18, 2023

View reviewed changes

unstructured/partition/docx.py Show resolved Hide resolved

Klaijan approved these changes Sep 18, 2023

View reviewed changes

qued approved these changes Sep 18, 2023

View reviewed changes

scanny force-pushed the scanny/docx-rfctr branch 3 times, most recently from 3769abc to d7962cc Compare September 19, 2023 21:25

scanny added 27 commits September 19, 2023 14:28

rfctr: replace partition_docx implementation ...

b7263c1

... with a call to the main classmethod of _DocxPartitioner.

rfctr: move parameters to instance variables

eb86714

Next step is to make parameters available to all methods so we can extract methods without having to know exactly what their implementation needs.

rfctr: add _DocxPartitioner._iter_document_elements()

df6e763

Morph the "pasted-as-is" `._partion_docx()` method of _DocxPartitioner into `._iter_document_elements()`, primarily by changing its signature to -> Iterator[Element] rather than List[Element].

rfctr: incorporate ._element_contains_pagebreak()

28e5039

No need for this to be a module-level function. Bring it into the class as a method. This associates it clearly with `_DocxPartitioner` which is its only caller.

rfctr: extract ._document_contains_pagebreaks

fe520a0

This only needs to be computed once and will be referenced from multiple methods. Extract it to a lazyproperty.

rfctr: extract ._last_modified (date)

1d7e9f8

Computing the last-modified date for use in element metadata is a distinct concern and we are likely to improve it going forward to query the docx document-properties. Extract it to a lazyproperty.

rfctr: extract ._page_number, ._increment_page_number

0ae2f6d

rfctr: extract ._is_list_item()

faca034

Detecting whether a paragraph is a list-item is a distinct concern and will need to become substantially more sophisticated than it is so far. Extract it to it's own method to encapsulate the concern.

rfctr: incorporate ._iter_paragraph_emphasis()

21ffa9f

Incorporate module-level function into _DocxPartitioner as it is its only caller.

rfctr: extract ._paragraph_emphasis

61d3c8b

rfctr: incorporate ._style_based_element_type()

7b7c191

This will be used shortly to replace part of `_paragraph_to_element()` which we will be decomposing in the next couple commits.

rfctr: incorporate ._parse_paragraph_text_for_element_type()

bf89994

This will shortly replace `_text_to_element()`.

rfctr: extract ._paragraph_metadata()

590f5f2

Separate concern, separate method.

rfctr: extract ._iter_paragraph_elements()

a27fb64

This completes the recomposition of paragraph-partitioning. Remove now unused `_paragraph_to_element()` and `_text_to_element()`

rfctr: extract ._table_emphasis()

ee4c814

Also remove now-unused module-level emphasis functions.

rfctr: extract ._iter_table_element()

8c0b6c6

Handling a table extracts nicely now we have the supporting methods decoupled from document traversal.

docx: add _SectBlockIterator

0583d01

This belongs upstream but will live here for now to allow us to stream the document blocks (paragraphs and tables) section-by-section in document order.

rfctr: install _SectBlockIterator

056c67e

Update document traversal to go section-by-section, then block-by-block within each section. This lays the groundwork for the rest of the extractions.

docx: add ._iter_section_page_breaks()

3118e8c

Add detection and emission of PageBreak elements produced by document sections. A docx section can and perhaps often does start on a new page and can even give rise to two page breaks, for example to move from an odd page to the next odd page.

rfctr: extract ._iter_maybe_paragraph_page_breaks()

230ead6

This approach needs some refinement, but this extraction localizes any changes required to this new method.

rfctr: improve convert_and_partition_docx()

c1eddb3

* Handle situation where `pypandoc` is not installed with a specific error message rather than something I expect is obscure. * Clarify logic for getting `filename_no_path` and resolve "filename_no_path is possibly unbound" lint error.

pr: add CHANGELOG entry for this PR

6f3f017

fix: CI lint complaints

4d3ac35

I didn't touch any of these Python files in my PR which makes me think either the linting changed recently or PRs that fail CI have been getting merged. Anyway, happy to fix them up for the greater good :)

scanny force-pushed the scanny/docx-rfctr branch from 01df5e3 to 6f3f017 Compare September 19, 2023 21:29

ryannikolaidis merged commit b54994a into main Sep 19, 2023

ryannikolaidis deleted the scanny/docx-rfctr branch September 19, 2023 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docx refactor #1422

docx refactor #1422

scanny commented Sep 14, 2023

Klaijan commented Sep 15, 2023

Klaijan left a comment

qued left a comment

qued Sep 18, 2023

scanny Sep 19, 2023

docx refactor #1422

docx refactor #1422

Conversation

scanny commented Sep 14, 2023

Klaijan commented Sep 15, 2023

Klaijan left a comment

Choose a reason for hiding this comment

qued left a comment

Choose a reason for hiding this comment

qued Sep 18, 2023

Choose a reason for hiding this comment

scanny Sep 19, 2023

Choose a reason for hiding this comment