Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Better logic for setting category_depth metadata for Title elements #1517

Merged
merged 22 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
881d1fe
fix: better logic for setting Title category_depth
LaverdeS Sep 25, 2023
9214708
docs: update with last fix changes
LaverdeS Sep 25, 2023
aa948a8
chore: new dev version
LaverdeS Sep 25, 2023
aed91ca
fix: Title depth logic depend on Headline Subheadline
LaverdeS Sep 26, 2023
b82ead7
fix: catch exception for HTML objects with no type attr
LaverdeS Sep 26, 2023
aa38a94
chore: add test to corroborate the improvement for Titles
LaverdeS Sep 26, 2023
28d3f30
chore: update description of the logic to correct depth of Titles
LaverdeS Sep 26, 2023
eed145a
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Sep 26, 2023
aff98bc
chore: add logging for HTML elements without type attr
LaverdeS Sep 26, 2023
1a43fdb
fix: condition is True only if category_depth is None
LaverdeS Sep 26, 2023
be60650
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Sep 26, 2023
f9e4436
chore: update CHANGELOG.md
LaverdeS Sep 27, 2023
de387ad
chore: better description of fix in CHANGELOG
LaverdeS Sep 27, 2023
27ce065
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Sep 27, 2023
cdced8e
Merge branch 'main' into sebastian/fix-title-depth
qued Oct 4, 2023
40ea340
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Oct 4, 2023
12ac02f
chore: tidyig
LaverdeS Oct 5, 2023
6d86282
chore: update CHANGELOG.md
LaverdeS Oct 5, 2023
6c2f012
docs: add Importance to changes in CHANGELOG.md
LaverdeS Oct 5, 2023
18b5267
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Oct 5, 2023
a7d2959
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Oct 5, 2023
4286f04
chore: update changes and dev version
LaverdeS Oct 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.20-dev1
## 0.10.20-dev2

### Enhancements

Expand All @@ -10,6 +10,7 @@

### Fixes

* **Fixes category_depth None value for Title elements** Problem: `Title` elements from `chipper` get `category_depth`= None even when `Headline` and/or `Subheadline` elements are present in the same page. Fix: all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead iff there are `Headline` and/or `Subheadline` element-types present. Importance: `Title` elements should be equivalent html `H1` when nested headings are present; otherwise, `category_depth` metadata can result ambiguous within elements in a page.
* **Tweak `xy-cut` ordering output to be more column friendly** This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate.

## 0.10.19
Expand Down
21 changes: 21 additions & 0 deletions test_unstructured/partition/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,18 @@ def elements(self):
type="Headline",
text="Charlie Brown and the Great Pumpkin",
),
LocationlessLayoutElement(
type="Subheadline",
text="The Beginning",
),
LocationlessLayoutElement(
type="Text",
text="This time Charlie Brown had it really tricky...",
),
LocationlessLayoutElement(
type="Title",
text="Another book title in the same page",
),
]


Expand Down Expand Up @@ -405,3 +417,12 @@ def test_set_element_hierarchy_custom_rule_set():
assert (
elements[5].metadata.parent_id == elements[4].id
), "FigureCaption should be child of Title 2"


def test_document_to_element_list_sets_category_depth_titles():
layout_with_hierarchies = MockDocumentLayout()
elements = document_to_element_list(layout_with_hierarchies)
assert elements[0].metadata.category_depth == 1
assert elements[1].metadata.category_depth == 2
assert elements[2].metadata.category_depth is None
newelh marked this conversation as resolved.
Show resolved Hide resolved
assert elements[3].metadata.category_depth == 0
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.20-dev1" # pragma: no cover
__version__ = "0.10.20-dev2" # pragma: no cover
10 changes: 9 additions & 1 deletion unstructured/partition/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
ListItem,
PageBreak,
Text,
Title,
)
from unstructured.logger import logger
from unstructured.nlp.patterns import ENUMERATED_BULLETS_RE, UNICODE_BULLETS_RE
Expand Down Expand Up @@ -561,7 +562,6 @@ def document_to_element_list(
infer_list_items=infer_list_items,
source_format=source_format if source_format else "html",
)

if isinstance(element, List):
for el in element:
if last_modification_date:
Expand All @@ -575,6 +575,14 @@ def document_to_element_list(
element.metadata.text_as_html = (
layout_element.text_as_html if hasattr(layout_element, "text_as_html") else None
)
try:
if (
isinstance(element, Title) and element.metadata.category_depth is None
) and any(el.type in ["Headline", "Subheadline"] for el in page.elements):
element.metadata.category_depth = 0
except AttributeError:
logger.info("HTML element instance has no attribute type")

page_elements.append(element)
coordinates = (
element.metadata.coordinates.points if element.metadata.coordinates else None
Expand Down