Skip to content

Commit

Permalink
rfctr: docx partitioning (#1422)
Browse files Browse the repository at this point in the history
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
  • Loading branch information
scanny authored Sep 19, 2023
1 parent 9a3e24f commit b54994a
Show file tree
Hide file tree
Showing 61 changed files with 1,286 additions and 434 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,9 @@ dmypy.json
# Pyre type checker
.pyre/

# pyright (Python LSP/type-checker in VSCode) config
/pyrightconfig.json

# ingest outputs
/structured-output

Expand Down Expand Up @@ -194,4 +197,4 @@ unstructured-inference/
example-docs/*_images
examples/**/output/

outputdiff.txt
outputdiff.txt
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
### Enhancements

* **Adds data source properties to Airtable, Confluence, Discord, Elasticsearch, Google Drive, and Wikipedia connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **DOCX partitioner refactored in preparation for enhancement.** Behavior should be unchanged except in multi-section documents containing different headers/footers for different sections. These will now emit all distinct headers and footers encountered instead of just those for the last section.

### Features

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,7 @@ check: check-src check-tests check-version
## check-src: runs linters (source only, no tests)
.PHONY: check-src
check-src:
ruff . --select I,UP015,UP032,UP034,UP018,COM,C4,PT,SIM,PLR0402 --ignore PT011,PT012,SIM117
ruff . --select I,UP015,UP032,UP034,UP018,COM,C4,PT,SIM,PLR0402 --ignore COM812,PT011,PT012,SIM117
black --line-length 100 ${PACKAGE_NAME} --check
flake8 ${PACKAGE_NAME}
mypy ${PACKAGE_NAME} --ignore-missing-imports --check-untyped-defs
Expand Down
2 changes: 1 addition & 1 deletion docs/source/introduction/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ After installation, confirm the setup by executing the below Python code:
.. code-block:: python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/fake-email.eml")
elements = partition(filename="example-docs/eml/fake-email.eml")
If you've opted for the "local-inference" installation, you should also be able to execute:

Expand Down
9 changes: 5 additions & 4 deletions docs/source/metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,13 @@ Some document types support location data for the elements, usually in the form
If it exists, an element's location data is available with ``element.metadata.coordinates``.

The ``coordinates`` property of an ``ElementMetadata`` stores:

* points: These specify the corners of the bounding box starting from the top left corner and
proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
the ``y`` coordinate increases in the downward direction.
proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
the ``y`` coordinate increases in the downward direction.
* system: The points have an associated coordinate system. A typical example of a coordinate system is
``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a
name, orientation, layout width, and layout height.
``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a
name, orientation, layout width, and layout height.

Information about the element’s coordinates (including the coordinate system name, coordinate points,
the layout width, and the layout height) can be accessed with `element.to_dict()["metadata"]["coordinates"]`.
Expand Down
25 changes: 25 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[tool.black]
line-length = 100

[tool.ruff]
line-length = 100
select = [
"C4", # -- flake8-comprehensions --
"COM", # -- flake8-commas --
"E", # -- pycodestyle errors --
"F", # -- pyflakes --
"I", # -- isort (imports) --
"PLR0402", # -- Name compared with itself like `foo == foo` --
"PT", # -- flake8-pytest-style --
"SIM", # -- flake8-simplify --
"UP015", # -- redundant `open()` mode parameter (like "r" is default) --
"UP018", # -- Unnecessary {literal_type} call like `str("abc")`. (rewrite as a literal) --
"UP032", # -- Use f-string instead of `.format()` call --
"UP034", # -- Avoid extraneous parentheses --
]
ignore = [
"COM812", # -- over aggressively insists on trailing commas where not desireable --
"PT011", # -- pytest.raises({exc}) too broad, use match param or more specific exception --
"PT012", # -- pytest.raises() block should contain a single simple statement --
"SIM117", # -- merge `with` statements for context managers that have same scope --
]
19 changes: 5 additions & 14 deletions scripts/collect_env.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def get_os_version():
return platform.platform()


def is_python_package_installed(package_name):
def is_python_package_installed(package_name: str):
"""
Check if a Python package is installed
Expand All @@ -57,14 +57,10 @@ def is_python_package_installed(package_name):
check=True,
)

for line in result.stdout.splitlines():
if line.lower().startswith(package_name.lower()):
return True

return False
return any(line.lower().startswith(package_name.lower()) for line in result.stdout.splitlines())


def is_brew_package_installed(package_name):
def is_brew_package_installed(package_name: str):
"""
Check if a Homebrew package is installed
Expand Down Expand Up @@ -95,11 +91,7 @@ def is_brew_package_installed(package_name):
check=True,
)

for line in result.stdout.splitlines():
if line.lower().startswith(package_name.lower()):
return True

return False
return any(line.lower().startswith(package_name.lower()) for line in result.stdout.splitlines())


def get_python_package_version(package_name):
Expand Down Expand Up @@ -221,8 +213,7 @@ def main():
):
print(
"PaddleOCR version: ",
get_python_package_version("paddlepaddle")
or get_python_package_version("paddleocr"),
get_python_package_version("paddlepaddle") or get_python_package_version("paddleocr"),
)
else:
print("PaddleOCR is not installed")
Expand Down
6 changes: 1 addition & 5 deletions scripts/performance/run_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,7 @@

file_path = sys.argv[1]
strategy = sys.argv[2]
model_name = None
if len(sys.argv) > 3:
model_name = sys.argv[3]
else:
model_name = os.environ.get("PARTITION_MODEL_NAME")
model_name = sys.argv[3] if len(sys.argv) > 3 else os.environ.get("PARTITION_MODEL_NAME")
result = partition(file_path, strategy=strategy, model_name=model_name)
# access element in the return value to make sure we got something back, otherwise error
result[1]
2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@ max-line-length = 100
[tool:pytest]
filterwarnings =
ignore::DeprecationWarning
python_classes = Test Describe
python_functions = test_ it_ they_ but_ and_
60 changes: 26 additions & 34 deletions test_unstructured/partition/docx/test_docx.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# pyright: reportPrivateUsage=false

import os
from tempfile import SpooledTemporaryFile
from typing import Dict, List

import docx
import pytest
Expand All @@ -16,12 +19,7 @@
Title,
)
from unstructured.partition.doc import partition_doc
from unstructured.partition.docx import (
_extract_contents_and_tags,
_get_emphasized_texts_from_paragraph,
_get_emphasized_texts_from_table,
partition_docx,
)
from unstructured.partition.docx import _DocxPartitioner, partition_docx
from unstructured.partition.json import partition_json
from unstructured.staging.base import elements_to_json

Expand Down Expand Up @@ -316,52 +314,46 @@ def test_partition_docx_from_file_without_metadata_date(
assert elements[0].metadata.last_modified is None


def test_get_emphasized_texts_from_paragraph(
expected_emphasized_texts,
filename="example-docs/fake-doc-emphasized-text.docx",
):
document = docx.Document(filename)
paragraph = document.paragraphs[1]
emphasized_texts = _get_emphasized_texts_from_paragraph(paragraph)
def test_get_emphasized_texts_from_paragraph(expected_emphasized_texts: List[Dict[str, str]]):
partitioner = _DocxPartitioner(
"example-docs/fake-doc-emphasized-text.docx", None, None, False, None
)
paragraph = partitioner._document.paragraphs[1]
emphasized_texts = list(partitioner._iter_paragraph_emphasis(paragraph))
assert paragraph.text == "I am a bold italic bold-italic text."
assert emphasized_texts == expected_emphasized_texts

paragraph = document.paragraphs[2]
emphasized_texts = _get_emphasized_texts_from_paragraph(paragraph)
paragraph = partitioner._document.paragraphs[2]
emphasized_texts = list(partitioner._iter_paragraph_emphasis(paragraph))
assert paragraph.text == ""
assert emphasized_texts == []

paragraph = document.paragraphs[3]
emphasized_texts = _get_emphasized_texts_from_paragraph(paragraph)
paragraph = partitioner._document.paragraphs[3]
emphasized_texts = list(partitioner._iter_paragraph_emphasis(paragraph))
assert paragraph.text == "I am a normal text."
assert emphasized_texts == []


def test_get_emphasized_texts_from_table(
expected_emphasized_texts,
filename="example-docs/fake-doc-emphasized-text.docx",
):
document = docx.Document(filename)
table = document.tables[0]
emphasized_texts = _get_emphasized_texts_from_table(table)
def test_iter_table_emphasis(expected_emphasized_texts: List[Dict[str, str]]):
partitioner = _DocxPartitioner(
"example-docs/fake-doc-emphasized-text.docx", None, None, False, None
)
table = partitioner._document.tables[0]
emphasized_texts = list(partitioner._iter_table_emphasis(table))
assert emphasized_texts == expected_emphasized_texts


def test_extract_contents_and_tags(
expected_emphasized_texts,
expected_emphasized_text_contents,
expected_emphasized_text_tags,
def test_table_emphasis(
expected_emphasized_text_contents: List[str], expected_emphasized_text_tags: List[str]
):
emphasized_text_contents, emphasized_text_tags = _extract_contents_and_tags(
expected_emphasized_texts,
partitioner = _DocxPartitioner(
"example-docs/fake-doc-emphasized-text.docx", None, None, False, None
)
table = partitioner._document.tables[0]
emphasized_text_contents, emphasized_text_tags = partitioner._table_emphasis(table)
assert emphasized_text_contents == expected_emphasized_text_contents
assert emphasized_text_tags == expected_emphasized_text_tags

emphasized_text_contents, emphasized_text_tags = _extract_contents_and_tags([])
assert emphasized_text_contents is None
assert emphasized_text_tags is None


@pytest.mark.parametrize(
("filename", "partition_func"),
Expand Down
3 changes: 3 additions & 0 deletions typings/docx/__init__.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from docx.api import Document

__all__ = ["Document"]
5 changes: 5 additions & 0 deletions typings/docx/api.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from typing import BinaryIO, Optional, Union

import docx.document

def Document(docx: Optional[Union[str, BinaryIO]] = None) -> docx.document.Document: ...
12 changes: 12 additions & 0 deletions typings/docx/blkcntnr.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from typing import Sequence

from docx.oxml.xmlchemy import BaseOxmlElement
from docx.table import Table
from docx.text.paragraph import Paragraph

class BlockItemContainer:
_element: BaseOxmlElement
@property
def paragraphs(self) -> Sequence[Paragraph]: ...
@property
def tables(self) -> Sequence[Table]: ...
22 changes: 22 additions & 0 deletions typings/docx/document.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# pyright: reportPrivateUsage=false

from typing import BinaryIO, Optional, Union

from docx.blkcntnr import BlockItemContainer
from docx.oxml.document import CT_Document
from docx.section import Sections
from docx.settings import Settings
from docx.styles.style import _ParagraphStyle
from docx.text.paragraph import Paragraph

class Document(BlockItemContainer):
def add_paragraph(
self, text: str = "", style: Optional[Union[_ParagraphStyle, str]] = None
) -> Paragraph: ...
@property
def element(self) -> CT_Document: ...
def save(self, path_or_stream: Union[str, BinaryIO]) -> None: ...
@property
def sections(self) -> Sections: ...
@property
def settings(self) -> Settings: ...
11 changes: 11 additions & 0 deletions typings/docx/enum/section.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import enum

class WD_SECTION_START(enum.Enum):
CONTINUOUS: enum.Enum
EVEN_PAGE: enum.Enum
NEW_COLUMN: enum.Enum
NEW_PAGE: enum.Enum
ODD_PAGE: enum.Enum

# -- alias --
WD_SECTION = WD_SECTION_START
7 changes: 7 additions & 0 deletions typings/docx/oxml/__init__.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# pyright: reportPrivateUsage=false

from typing import Union

from lxml import etree

def parse_xml(xml: Union[str, bytes]) -> etree._Element: ...
10 changes: 10 additions & 0 deletions typings/docx/oxml/document.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from typing import Iterator

from docx.oxml.xmlchemy import BaseOxmlElement

class CT_Body(BaseOxmlElement):
def __iter__(self) -> Iterator[BaseOxmlElement]: ...

class CT_Document(BaseOxmlElement):
@property
def body(self) -> CT_Body: ...
5 changes: 5 additions & 0 deletions typings/docx/oxml/ns.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from typing import Dict

nsmap: Dict[str, str]

def qn(tag: str) -> str: ...
7 changes: 7 additions & 0 deletions typings/docx/oxml/section.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from typing import Optional

from docx.oxml.xmlchemy import BaseOxmlElement

class CT_SectPr(BaseOxmlElement):
@property
def preceding_sectPr(self) -> Optional[CT_SectPr]: ...
3 changes: 3 additions & 0 deletions typings/docx/oxml/table.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from docx.oxml.xmlchemy import BaseOxmlElement

class CT_Tbl(BaseOxmlElement): ...
3 changes: 3 additions & 0 deletions typings/docx/oxml/text/paragraph.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from docx.oxml.xmlchemy import BaseOxmlElement

class CT_P(BaseOxmlElement): ...
3 changes: 3 additions & 0 deletions typings/docx/oxml/text/parfmt.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from docx.oxml.xmlchemy import BaseOxmlElement

class CT_PPr(BaseOxmlElement): ...
9 changes: 9 additions & 0 deletions typings/docx/oxml/text/run.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from typing import Optional

from docx.oxml.xmlchemy import BaseOxmlElement

class CT_Br(BaseOxmlElement):
type: Optional[str]
clear: Optional[str]

class CT_R(BaseOxmlElement): ...
17 changes: 17 additions & 0 deletions typings/docx/oxml/xmlchemy.pyi
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from typing import Any, Iterator

from lxml import etree

class BaseOxmlElement(etree.ElementBase):
def __iter__(self) -> Iterator[BaseOxmlElement]: ...
@property
def xml(self) -> str: ...
def xpath(self, xpath_str: str) -> Any:
"""Return type is typically Sequence[ElementBase], but ...
lxml.etree.XPath has many possible return types including bool, (a "smart") str,
float. The return type can also be a list containing ElementBase, comments,
processing instructions, str, and tuple. So you need to cast the result based on
the XPath expression you use.
"""
...
Loading

0 comments on commit b54994a

Please sign in to comment.