Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Detect all text in HTML Heading tags as titles #1556

Merged
merged 16 commits into from
Oct 3, 2023

Conversation

newelh
Copy link
Contributor

@newelh newelh commented Sep 27, 2023

Summary

This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address categorize it as a title.

Testing

from unstructured.partition.html import partition_html
elements = partition_html(url="https://www.eda.gov/grants/2015")

Before, the date headers at the given url would not be correctly parsed as titles, after this change they are now correctly identified.

A unit test to verify the functionality has been added: test_html_partition::test_html_heading_title_detection that includes values that were previously detected as narrative text and uncategorized text

@newelh newelh linked an issue Sep 27, 2023 that may be closed by this pull request
@newelh newelh marked this pull request as ready for review September 28, 2023 14:36
@newelh newelh requested a review from qued September 28, 2023 16:02
@newelh newelh requested a review from cragwolfe September 28, 2023 18:35
Copy link
Collaborator

@badGarnet badGarnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirm the test would fail with <h1> content being id'ed as text with the version on main.
the test website from PR note returns 504 😢 so can't verify that case
code does make sense
grabbed a few random websites but do not see difference between this PR and main (could be missing those naked <h1> tags)

@newelh
Copy link
Contributor Author

newelh commented Oct 2, 2023

confirm the test would fail with <h1> content being id'ed as text with the version on main. the test website from PR note returns 504 😢 so can't verify that case code does make sense grabbed a few random websites but do not see difference between this PR and main (could be missing those naked <h1> tags)

Yes, confirmed that the test fails on main branch. (e.g. something that would be categorized by main as NarrativeText or Uncategorized text is now properly categorized as a title)

@newelh newelh merged commit bcd0eee into main Oct 3, 2023
@newelh newelh deleted the newelh/partition-html/all-heading-text-as-titles branch October 3, 2023 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat/Detect all text in HTML Heading tags as titles
2 participants