-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: element type frequency #1688
Changes from 9 commits
85acfc7
2da1f1f
5507432
354f40d
371d4d9
652dcc8
005da71
60e9597
be6a913
a5cccb6
e6031fa
bf7e7e5
44e8cbc
1ac004c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
MIME-Version: 1.0 | ||
Date: Fri, 16 Dec 2022 17:04:16 -0500 | ||
Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com> | ||
Subject: Test Email | ||
From: Matthew Robinson <[email protected]> | ||
To: Matthew Robinson <[email protected]> | ||
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630" | ||
|
||
--00000000000095c9b205eff92630 | ||
Content-Type: text/plain; charset="UTF-8" | ||
|
||
This is a test email to use for unit tests. | ||
|
||
Important points: | ||
|
||
- Roses are red | ||
- Violets are blue | ||
|
||
--00000000000095c9b205eff92630 | ||
Content-Type: text/html; charset="UTF-8" | ||
|
||
<div dir="ltr"><div>This is a test email to use for unit tests.</div><div><br></div><div>Important points:</div><div><ul><li>Roses are red</li><li>Violets are blue</li></ul></div></div> | ||
|
||
--00000000000095c9b205eff92630-- |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
import pytest | ||
|
||
from unstructured.metrics.element_type import get_element_type_frequency | ||
from unstructured.partition.auto import partition | ||
|
||
|
||
@pytest.mark.parametrize( | ||
("filename", "frequency"), | ||
[ | ||
( | ||
"fake-email.txt", | ||
{ | ||
"UncategorizedText": [("None", 6)], | ||
"FigureCaption": [], | ||
"Figure": [], | ||
"Text": [], | ||
"NarrativeText": [("None", 2)], | ||
"ListItem": [("None", 12)], | ||
"BulletedText": [], | ||
"Title": [("None", 5)], | ||
"Address": [], | ||
"EmailAddress": [], | ||
"Image": [], | ||
"PageBreak": [], | ||
"Table": [], | ||
"Header": [], | ||
"Footer": [], | ||
"Caption": [], | ||
"Footnote": [], | ||
"Formula": [], | ||
"List-item": [], | ||
"Page-footer": [], | ||
"Page-header": [], | ||
"Picture": [], | ||
"Section-header": [], | ||
"Headline": [], | ||
"Subheadline": [], | ||
"Abstract": [], | ||
"Threading": [], | ||
"Form": [], | ||
"Field-Name": [], | ||
"Value": [], | ||
"Link": [], | ||
}, | ||
), | ||
( | ||
"sample-presentation.pptx", | ||
{ | ||
"UncategorizedText": [], | ||
"FigureCaption": [], | ||
"Figure": [], | ||
"Text": [], | ||
"NarrativeText": [("0", 3)], | ||
"ListItem": [("0", 6), ("1", 6), ("2", 3)], | ||
"BulletedText": [], | ||
"Title": [("0", 4), ("1", 1)], | ||
"Address": [], | ||
"EmailAddress": [], | ||
"Image": [], | ||
"PageBreak": [], | ||
"Table": [("None", 1)], | ||
"Header": [], | ||
"Footer": [], | ||
"Caption": [], | ||
"Footnote": [], | ||
"Formula": [], | ||
"List-item": [], | ||
"Page-footer": [], | ||
"Page-header": [], | ||
"Picture": [], | ||
"Section-header": [], | ||
"Headline": [], | ||
"Subheadline": [], | ||
"Abstract": [], | ||
"Threading": [], | ||
"Form": [], | ||
"Field-Name": [], | ||
"Value": [], | ||
"Link": [], | ||
}, | ||
), | ||
], | ||
) | ||
def test_get_element_type_frequency(filename, frequency): | ||
elements = partition(filename=f"example-docs/{filename}") | ||
elements_freq = get_element_type_frequency(elements) | ||
assert elements_freq == frequency |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
from typing import Dict, List, Optional, Tuple, Union | ||
|
||
from unstructured.documents.elements import TYPE_TO_TEXT_ELEMENT_MAP | ||
|
||
|
||
def get_element_type_frequency( | ||
elements: List, | ||
) -> Union[Dict[str, Tuple[Optional[str], int]], Dict]: | ||
""" | ||
Calculate the frequency of Element Types from a list of elements. | ||
""" | ||
frequency: Dict = {key: {} for key in TYPE_TO_TEXT_ELEMENT_MAP} | ||
if len(elements) == 0: | ||
return frequency | ||
for element in elements: | ||
category = element.category | ||
category_depth = element.metadata.category_depth | ||
|
||
if str(category_depth) not in frequency[category]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Debating if a similar value initialization should be done for depth, and in general what the best representation of depth is for this use case. I like that the nested category depth info makes it easy to consider element types without depth if desired, but the alternative of the main key being a tuple ( For comparisons, I was thinking it would be easiest if both dictionaries being compared had the exact same keys, but in the case of depth we don't have a limit on it and could potentially nest infinitely, so we don't want to represent all those values with a default. What do you think makes the most sense for a representation + default value decision given that the primary use case is to (easily) compare the resulting element frequencies between 2 input element lists? I'm leaning default for element categories (prior comment), no default for category depth, and let category depth be nested so it isn't clunky for categories that don't have a depth (which is most of them). EDIT: I looked at the test output (see comment in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you mean for the case that has |
||
frequency[category][str(category_depth)] = 1 | ||
else: | ||
frequency[category][str(category_depth)] += 1 | ||
for key in frequency: | ||
frequency[key] = list(frequency[key].items()) | ||
return frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant a tuple as the key in the dictionary of frequencies.
ex.
("UncategorizedText", "None"): 6
or ex.("Title", 1): 3
the type of the depth as string or numerical (to cover both None and 0, 1, etc) is up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in that case, how would the Element types that aren't exist in the doc be initialize?
("UncategorizedText",)
like this?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this way, each tuple key to the dictionary represents a unique element type. Overall, the type of the dictionary will be
Dict[Tuple[str, int], int]
(or possibly str for second value in tuple to capture None)Would it then no longer be possible to create all the keys for the dictionary? Probably. Let's instead prioritize having meaningful keys instead of initializing the dictionary with all elements categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we revert back to not having all the Element Type initialized, then assign value to each tuple key.