Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: element type frequency #1688

Merged
merged 14 commits into from
Oct 11, 2023
Merged
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
* **Adds `edit_distance` calculation metrics** In order to benchmark the cleaned, extracted text with unstructured, `edit_distance` (`Levenshtein distance`) is included.
* **Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
* **Adds a function that calculates frequency of the element type and its depth** To capture the accuracy of element type extraction, this function counts the occurrences of each unique element type with its depth for use in element metrics.

### Fixes

Expand Down
24 changes: 24 additions & 0 deletions example-docs/fake-email.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
MIME-Version: 1.0
Date: Fri, 16 Dec 2022 17:04:16 -0500
Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>
Subject: Test Email
From: Matthew Robinson <[email protected]>
To: Matthew Robinson <[email protected]>
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"

--00000000000095c9b205eff92630
Content-Type: text/plain; charset="UTF-8"

This is a test email to use for unit tests.

Important points:

- Roses are red
- Violets are blue

--00000000000095c9b205eff92630
Content-Type: text/html; charset="UTF-8"

<div dir="ltr"><div>This is a test email to use for unit tests.</div><div><br></div><div>Important points:</div><div><ul><li>Roses are red</li><li>Violets are blue</li></ul></div></div>

--00000000000095c9b205eff92630--
87 changes: 87 additions & 0 deletions test_unstructured/metrics/test_element_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import pytest

from unstructured.metrics.element_type import get_element_type_frequency
from unstructured.partition.auto import partition


@pytest.mark.parametrize(
("filename", "frequency"),
[
(
"fake-email.txt",
{
"UncategorizedText": [("None", 6)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant a tuple as the key in the dictionary of frequencies.
ex. ("UncategorizedText", "None"): 6 or ex. ("Title", 1): 3

the type of the depth as string or numerical (to cover both None and 0, 1, etc) is up to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in that case, how would the Element types that aren't exist in the doc be initialize? ("UncategorizedText",) like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, each tuple key to the dictionary represents a unique element type. Overall, the type of the dictionary will be Dict[Tuple[str, int], int] (or possibly str for second value in tuple to capture None)

Would it then no longer be possible to create all the keys for the dictionary? Probably. Let's instead prioritize having meaningful keys instead of initializing the dictionary with all elements categories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we revert back to not having all the Element Type initialized, then assign value to each tuple key.

"FigureCaption": [],
"Figure": [],
"Text": [],
"NarrativeText": [("None", 2)],
"ListItem": [("None", 12)],
"BulletedText": [],
"Title": [("None", 5)],
"Address": [],
"EmailAddress": [],
"Image": [],
"PageBreak": [],
"Table": [],
"Header": [],
"Footer": [],
"Caption": [],
"Footnote": [],
"Formula": [],
"List-item": [],
"Page-footer": [],
"Page-header": [],
"Picture": [],
"Section-header": [],
"Headline": [],
"Subheadline": [],
"Abstract": [],
"Threading": [],
"Form": [],
"Field-Name": [],
"Value": [],
"Link": [],
},
),
(
"sample-presentation.pptx",
{
"UncategorizedText": [],
"FigureCaption": [],
"Figure": [],
"Text": [],
"NarrativeText": [("0", 3)],
"ListItem": [("0", 6), ("1", 6), ("2", 3)],
"BulletedText": [],
"Title": [("0", 4), ("1", 1)],
"Address": [],
"EmailAddress": [],
"Image": [],
"PageBreak": [],
"Table": [("None", 1)],
"Header": [],
"Footer": [],
"Caption": [],
"Footnote": [],
"Formula": [],
"List-item": [],
"Page-footer": [],
"Page-header": [],
"Picture": [],
"Section-header": [],
"Headline": [],
"Subheadline": [],
"Abstract": [],
"Threading": [],
"Form": [],
"Field-Name": [],
"Value": [],
"Link": [],
},
),
],
)
def test_get_element_type_frequency(filename, frequency):
elements = partition(filename=f"example-docs/{filename}")
elements_freq = get_element_type_frequency(elements)
assert elements_freq == frequency
25 changes: 25 additions & 0 deletions unstructured/metrics/element_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from typing import Dict, List, Optional, Tuple, Union

from unstructured.documents.elements import TYPE_TO_TEXT_ELEMENT_MAP


def get_element_type_frequency(
elements: List,
) -> Union[Dict[str, Tuple[Optional[str], int]], Dict]:
"""
Calculate the frequency of Element Types from a list of elements.
"""
frequency: Dict = {key: {} for key in TYPE_TO_TEXT_ELEMENT_MAP}
if len(elements) == 0:
return frequency
for element in elements:
category = element.category
category_depth = element.metadata.category_depth

if str(category_depth) not in frequency[category]:
Copy link
Contributor

@shreyanid shreyanid Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debating if a similar value initialization should be done for depth, and in general what the best representation of depth is for this use case.

I like that the nested category depth info makes it easy to consider element types without depth if desired, but the alternative of the main key being a tuple (element category, depth) would also possibly have the same benefit. Drawback of a tuple is that every category needs to hold some value in the depth field of the tuple whether or not it applies.

For comparisons, I was thinking it would be easiest if both dictionaries being compared had the exact same keys, but in the case of depth we don't have a limit on it and could potentially nest infinitely, so we don't want to represent all those values with a default.

What do you think makes the most sense for a representation + default value decision given that the primary use case is to (easily) compare the resulting element frequencies between 2 input element lists? I'm leaning default for element categories (prior comment), no default for category depth, and let category depth be nested so it isn't clunky for categories that don't have a depth (which is most of them).

EDIT: I looked at the test output (see comment in test_unstructured/metrics/test_element_type.py) and am now leaning default for element categories (prior comment), no defaults for category depth, and let category depth be TUPLE for the ability to ignore the depth field instead of having to index by None categories that don't have a depth (which is most of them). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean for the case that has category_depth=None, you'd want the value in the tuple to be just pure number? What about when same category have both depth and no depth in the same document?

frequency[category][str(category_depth)] = 1
else:
frequency[category][str(category_depth)] += 1
for key in frequency:
frequency[key] = list(frequency[key].items())
return frequency
Loading