Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: element type frequency #1688

Merged
merged 14 commits into from
Oct 11, 2023
Merged
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
* **Adds `edit_distance` calculation metrics** In order to benchmark the cleaned, extracted text with unstructured, `edit_distance` (`Levenshtein distance`) is included.
* **Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature
setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed.
* **Adds a function that calculates frequency of the element type and its depth** To capture the accuracy of element type extraction, this function counts the occurrences of each unique element type with its depth for use in element metrics.

### Fixes

Expand Down
24 changes: 24 additions & 0 deletions example-docs/fake-email.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
MIME-Version: 1.0
Date: Fri, 16 Dec 2022 17:04:16 -0500
Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>
Subject: Test Email
From: Matthew Robinson <[email protected]>
To: Matthew Robinson <[email protected]>
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"

--00000000000095c9b205eff92630
Content-Type: text/plain; charset="UTF-8"

This is a test email to use for unit tests.

Important points:

- Roses are red
- Violets are blue

--00000000000095c9b205eff92630
Content-Type: text/html; charset="UTF-8"

<div dir="ltr"><div>This is a test email to use for unit tests.</div><div><br></div><div>Important points:</div><div><ul><li>Roses are red</li><li>Violets are blue</li></ul></div></div>

--00000000000095c9b205eff92630--
37 changes: 37 additions & 0 deletions test_unstructured/metrics/test_element_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import pytest

from unstructured.metrics.element_type import get_element_type_frequency
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json


@pytest.mark.parametrize(
("filename", "frequency"),
[
(
"fake-email.txt",
{
("UncategorizedText", None): 6,
("ListItem", None): 12,
("Title", None): 5,
("NarrativeText", None): 2,
},
),
(
"sample-presentation.pptx",
{
("Title", 0): 4,
("Title", 1): 1,
("NarrativeText", 0): 3,
("ListItem", 0): 6,
("ListItem", 1): 6,
("ListItem", 2): 3,
("Table", None): 1,
},
),
],
)
def test_get_element_type_frequency(filename, frequency):
elements = partition(filename=f"example-docs/{filename}")
elements_freq = get_element_type_frequency(elements_to_json(elements))
assert elements_freq == frequency
22 changes: 22 additions & 0 deletions unstructured/metrics/element_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import json
from typing import Dict, Optional, Tuple, Union


def get_element_type_frequency(
elements: str,
) -> Union[Dict[Tuple[str, Optional[int]], int], Dict]:
"""
Calculate the frequency of Element Types from a list of elements.
"""
frequency: Dict = {}
Klaijan marked this conversation as resolved.
Show resolved Hide resolved
if len(elements) == 0:
return frequency
for element in json.loads(elements):
type = element.get("type")
category_depth = element["metadata"].get("category_depth")
key = (type, category_depth)
if key not in frequency:
frequency[key] = 1
else:
frequency[key] += 1
return frequency
Loading