Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Better logic for setting category_depth metadata for Title elements #1517

Merged
merged 22 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
881d1fe
fix: better logic for setting Title category_depth
LaverdeS Sep 25, 2023
9214708
docs: update with last fix changes
LaverdeS Sep 25, 2023
aa948a8
chore: new dev version
LaverdeS Sep 25, 2023
aed91ca
fix: Title depth logic depend on Headline Subheadline
LaverdeS Sep 26, 2023
b82ead7
fix: catch exception for HTML objects with no type attr
LaverdeS Sep 26, 2023
aa38a94
chore: add test to corroborate the improvement for Titles
LaverdeS Sep 26, 2023
28d3f30
chore: update description of the logic to correct depth of Titles
LaverdeS Sep 26, 2023
eed145a
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Sep 26, 2023
aff98bc
chore: add logging for HTML elements without type attr
LaverdeS Sep 26, 2023
1a43fdb
fix: condition is True only if category_depth is None
LaverdeS Sep 26, 2023
be60650
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Sep 26, 2023
f9e4436
chore: update CHANGELOG.md
LaverdeS Sep 27, 2023
de387ad
chore: better description of fix in CHANGELOG
LaverdeS Sep 27, 2023
27ce065
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Sep 27, 2023
cdced8e
Merge branch 'main' into sebastian/fix-title-depth
qued Oct 4, 2023
40ea340
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Oct 4, 2023
12ac02f
chore: tidyig
LaverdeS Oct 5, 2023
6d86282
chore: update CHANGELOG.md
LaverdeS Oct 5, 2023
6c2f012
docs: add Importance to changes in CHANGELOG.md
LaverdeS Oct 5, 2023
18b5267
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Oct 5, 2023
a7d2959
Merge branch 'main' into sebastian/fix-title-depth
LaverdeS Oct 5, 2023
4286f04
chore: update changes and dev version
LaverdeS Oct 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@

### Fixes

* **Fixes category_depth None value for Title elements** Problem: The mapping of `Headline` and `Subheadline` element types from `chipper` to `Title`, assigns metadata `category_depth` 1 and 2 respectively. During the same mapping, the `Title` elements from `chipper` are left unchanged, with metadata `category_depth` = None. Fix: Whenever `Headline` or `Subheadline` type layout elements are present in a page, then all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make sense semantically?
remind me what the category_depth for an H1 element is again? if it is 0, does this imply all other Titles from Chipper are like html H1's?

Please add "Why does it matter? for the 2nd sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, H1 is category_depth 0. For, chipper: Title elements are like H1, Headline like H2, and Subheadline like H3. I am changing the description of the fix for clarification.

* **Fixes a metadata source serialization bug** Problem: In unstructured elements, when loading an elements json file from the disk, the data_source attribute is assumed to be an instance of DataSourceMetadata and the code acts based on that. However the loader did not satisfy the assumption, and loaded it as a dict instead, causing an error. Fix: Added necessary code block to initialize a DataSourceMetadata object, also refactored DataSourceMetadata.from_dict() method to remove redundant code. Importance: Crucial to be able to load elements (which have data_source fields) from json files.
* **Fixes issue where unstructured-inference was not getting updated** Problem: unstructured-inference was not getting upgraded to the version to match unstructured release when doing a pip install. Solution: using `pip install unstructured[all-docs]` it will now upgrade both unstructured and unstructured-inference. Importance: This will ensure that the inference library is always in sync with the unstructured library, otherwise users will be using outdated libraries which will likely lead to unintended behavior.
* **Fixes SharePoint connector failures if any document has an unsupported filetype** Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's `__post_init__`. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
LaverdeS marked this conversation as resolved.
Show resolved Hide resolved
* **Fixes Sharepoint connector server_path issue** Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
* **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class
Expand Down
21 changes: 21 additions & 0 deletions test_unstructured/partition/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,18 @@ def elements(self):
type="Headline",
text="Charlie Brown and the Great Pumpkin",
),
LocationlessLayoutElement(
type="Subheadline",
text="The Beginning",
),
LocationlessLayoutElement(
type="Text",
text="This time Charlie Brown had it really tricky...",
),
LocationlessLayoutElement(
type="Title",
text="Another book title in the same page",
),
]


Expand Down Expand Up @@ -405,3 +417,12 @@ def test_set_element_hierarchy_custom_rule_set():
assert (
elements[5].metadata.parent_id == elements[4].id
), "FigureCaption should be child of Title 2"


def test_document_to_element_list_sets_category_depth_titles():
layout_with_hierarchies = MockDocumentLayout()
elements = document_to_element_list(layout_with_hierarchies)
assert elements[0].metadata.category_depth == 1
assert elements[1].metadata.category_depth == 2
assert elements[2].metadata.category_depth is None
newelh marked this conversation as resolved.
Show resolved Hide resolved
assert elements[3].metadata.category_depth == 0
10 changes: 9 additions & 1 deletion unstructured/partition/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
ListItem,
PageBreak,
Text,
Title,
)
from unstructured.logger import logger
from unstructured.nlp.patterns import ENUMERATED_BULLETS_RE, UNICODE_BULLETS_RE
Expand Down Expand Up @@ -560,7 +561,6 @@ def document_to_element_list(
infer_list_items=infer_list_items,
source_format=source_format if source_format else "html",
)

if isinstance(element, List):
for el in element:
if last_modification_date:
Expand All @@ -574,6 +574,14 @@ def document_to_element_list(
element.metadata.text_as_html = (
layout_element.text_as_html if hasattr(layout_element, "text_as_html") else None
)
try:
if (isinstance(element, Title) and element.metadata.category_depth is None) and any(
el.type in ["Headline", "Subheadline"] for el in page.elements
):
element.metadata.category_depth = 0
except AttributeError:
logger.info("HTML element instance has no attribute type")

page_elements.append(element)
coordinates = (
element.metadata.coordinates.points if element.metadata.coordinates else None
Expand Down