-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Better logic for setting category_depth
metadata for Title
elements
#1517
Conversation
LGTM! |
CHANGELOG.md
Outdated
@@ -15,9 +15,9 @@ | |||
|
|||
### Fixes | |||
|
|||
* **Fixes category_depth None value for Title elements** Problem: The mapping of `Headline` and `Subheadline` element types from `chipper` to `Title`, assigns metadata `category_depth` 1 and 2 respectively. During the same mapping, the `Title` elements from `chipper` are left unchanged, with metadata `category_depth` = None. Fix: Whenever `Headline` or `Subheadline` type layout elements are present in a page, then all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this make sense semantically?
remind me what the category_depth
for an H1
element is again? if it is 0, does this imply all other Titles from Chipper are like html H1
's?
Please add "Why does it matter? for the 2nd sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, H1
is category_depth
0. For, chipper
: Title
elements are like H1
, Headline
like H2
, and Subheadline
like H3
. I am changing the description of the fix for clarification.
add missing bullet in Fixes
This PR promotes the
category_depth
metadata forTitle
elements fromNone
to 0, wheneverHeadline
and/orSubheadline
types (that are also mapped toTitle
elements with depth 1 and 2) are present. An additional test totest_common.py
has been added to check on the improvement. More test of how this logic fixes the behaviour can be found in a adapted version on the colab here.Problem: The mapping of
Headline
andSubheadline
element types fromchipper
toTitle
, assigns metadatacategory_depth
1 and 2 respectively. During the same mapping, theTitle
elements fromchipper
are left unchanged, with metadatacategory_depth
= None.Fix: Whenever
Headline
orSubheadline
type layout elements are present in a page, then allTitle
elements withcategory_depth
= None should be set to have a depth of 0 instead.Importance: Title elements should be equivalent html H1 when nested headings are present; otherwise, category_depth metadata can result ambiguous within elements in a page.