-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Detect all text in HTML Heading tags as titles #1556
Feat: Detect all text in HTML Heading tags as titles #1556
Conversation
…into newelh/partition-html/all-heading-text-as-titles
…ed as uncategorized text
…f they're in heading tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirm the test would fail with <h1>
content being id'ed as text with the version on main.
the test website from PR note returns 504 😢 so can't verify that case
code does make sense
grabbed a few random websites but do not see difference between this PR and main (could be missing those naked <h1>
tags)
Yes, confirmed that the test fails on main branch. (e.g. something that would be categorized by main as NarrativeText or Uncategorized text is now properly categorized as a title) |
Summary
This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address categorize it as a title.
Testing
Before, the date headers at the given url would not be correctly parsed as titles, after this change they are now correctly identified.
A unit test to verify the functionality has been added:
test_html_partition::test_html_heading_title_detection
that includes values that were previously detected as narrative text and uncategorized text