Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Better logic for setting category_depth metadata for Title elements #1517

Merged
merged 22 commits into from
Oct 5, 2023

Conversation

LaverdeS
Copy link
Contributor

@LaverdeS LaverdeS commented Sep 25, 2023

This PR promotes the category_depth metadata for Title elements from None to 0, whenever Headline and/or Subheadline types (that are also mapped to Title elements with depth 1 and 2) are present. An additional test to test_common.py has been added to check on the improvement. More test of how this logic fixes the behaviour can be found in a adapted version on the colab here.

Problem: The mapping of Headline and Subheadline element types from chipper to Title, assigns metadata category_depth 1 and 2 respectively. During the same mapping, the Title elements from chipper are left unchanged, with metadata category_depth = None.

Fix: Whenever Headline or Subheadline type layout elements are present in a page, then all Title elements with category_depth = None should be set to have a depth of 0 instead.

Importance: Title elements should be equivalent html H1 when nested headings are present; otherwise, category_depth metadata can result ambiguous within elements in a page.

@LaverdeS LaverdeS marked this pull request as ready for review September 26, 2023 11:13
@LaverdeS LaverdeS self-assigned this Sep 26, 2023
test_unstructured/partition/test_common.py Show resolved Hide resolved
unstructured/partition/common.py Outdated Show resolved Hide resolved
unstructured/partition/common.py Outdated Show resolved Hide resolved
@newelh
Copy link
Contributor

newelh commented Sep 26, 2023

LGTM!

@LaverdeS LaverdeS enabled auto-merge (squash) September 26, 2023 21:59
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
@@ -15,9 +15,9 @@

### Fixes

* **Fixes category_depth None value for Title elements** Problem: The mapping of `Headline` and `Subheadline` element types from `chipper` to `Title`, assigns metadata `category_depth` 1 and 2 respectively. During the same mapping, the `Title` elements from `chipper` are left unchanged, with metadata `category_depth` = None. Fix: Whenever `Headline` or `Subheadline` type layout elements are present in a page, then all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make sense semantically?
remind me what the category_depth for an H1 element is again? if it is 0, does this imply all other Titles from Chipper are like html H1's?

Please add "Why does it matter? for the 2nd sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, H1 is category_depth 0. For, chipper: Title elements are like H1, Headline like H2, and Subheadline like H3. I am changing the description of the fix for clarification.

@LaverdeS LaverdeS disabled auto-merge September 27, 2023 10:07
@LaverdeS LaverdeS enabled auto-merge (squash) October 5, 2023 17:27
@LaverdeS LaverdeS merged commit e90a979 into main Oct 5, 2023
39 checks passed
@LaverdeS LaverdeS deleted the sebastian/fix-title-depth branch October 5, 2023 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants