feat: element type frequency #1688

Klaijan · 2023-10-09T18:00:23Z

Executive Summary

Add function that returns frequency of given element types and depth.

CHANGELOG.md

Co-authored-by: shreyanid <[email protected]>

unstructured/metrics/element_type.py

shreyanid · 2023-10-09T23:58:04Z

unstructured/metrics/element_type.py

+        category_depth = element.metadata.category_depth
+        if category not in frequency:
+            frequency[category] = {}
+        if str(category_depth) not in frequency[category]:


Debating if a similar value initialization should be done for depth, and in general what the best representation of depth is for this use case.

I like that the nested category depth info makes it easy to consider element types without depth if desired, but the alternative of the main key being a tuple (element category, depth) would also possibly have the same benefit. Drawback of a tuple is that every category needs to hold some value in the depth field of the tuple whether or not it applies.

For comparisons, I was thinking it would be easiest if both dictionaries being compared had the exact same keys, but in the case of depth we don't have a limit on it and could potentially nest infinitely, so we don't want to represent all those values with a default.

What do you think makes the most sense for a representation + default value decision given that the primary use case is to (easily) compare the resulting element frequencies between 2 input element lists? I'm leaning default for element categories (prior comment), no default for category depth, and let category depth be nested so it isn't clunky for categories that don't have a depth (which is most of them).

EDIT: I looked at the test output (see comment in test_unstructured/metrics/test_element_type.py) and am now leaning default for element categories (prior comment), no defaults for category depth, and let category depth be TUPLE for the ability to ignore the depth field instead of having to index by None categories that don't have a depth (which is most of them). What do you think?

So you mean for the case that has category_depth=None, you'd want the value in the tuple to be just pure number? What about when same category have both depth and no depth in the same document?

test_unstructured/metrics/test_element_type.py

shreyanid · 2023-10-10T22:53:07Z

test_unstructured/metrics/test_element_type.py

+        (
+            "fake-email.txt",
+            {
+                "UncategorizedText": [("None", 6)],


I meant a tuple as the key in the dictionary of frequencies.
ex. ("UncategorizedText", "None"): 6 or ex. ("Title", 1): 3

the type of the depth as string or numerical (to cover both None and 0, 1, etc) is up to you

But in that case, how would the Element types that aren't exist in the doc be initialize? ("UncategorizedText",) like this?

In this way, each tuple key to the dictionary represents a unique element type. Overall, the type of the dictionary will be Dict[Tuple[str, int], int] (or possibly str for second value in tuple to capture None)

Would it then no longer be possible to create all the keys for the dictionary? Probably. Let's instead prioritize having meaningful keys instead of initializing the dictionary with all elements categories.

So we revert back to not having all the Element Type initialized, then assign value to each tuple key.

…/Unstructured-IO/unstructured into klaijan/metric-element-type-freq

shreyanid

LGTM!

Klaijan added 2 commits October 9, 2023 13:59

feat: element type frequency

85acfc7

linting and changelog

2da1f1f

Klaijan changed the title ~~feat: element type frequency~~ Klaijan/feat: element type frequency Oct 9, 2023

add test doc

5507432

Klaijan requested a review from shreyanid October 9, 2023 19:02

Klaijan enabled auto-merge October 9, 2023 21:00

shreyanid reviewed Oct 9, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Klaijan and others added 2 commits October 9, 2023 19:19

Update CHANGELOG.md

354f40d

Co-authored-by: shreyanid <[email protected]>

Merge branch 'main' into klaijan/metric-element-type-freq

371d4d9

Klaijan requested a review from shreyanid October 9, 2023 23:19

shreyanid reviewed Oct 9, 2023

View reviewed changes

unstructured/metrics/element_type.py Show resolved Hide resolved

shreyanid reviewed Oct 9, 2023

View reviewed changes

shreyanid reviewed Oct 10, 2023

View reviewed changes

test_unstructured/metrics/test_element_type.py Outdated Show resolved Hide resolved

Klaijan changed the title ~~Klaijan/feat: element type frequency~~ feat: element type frequency Oct 10, 2023

Klaijan added 3 commits October 10, 2023 16:39

edit element count to tuple

652dcc8

add docstrings

005da71

Merge branch 'main' into klaijan/metric-element-type-freq

60e9597

Klaijan requested review from shreyanid and mallorih October 10, 2023 20:45

Merge branch 'main' into klaijan/metric-element-type-freq

be6a913

shreyanid reviewed Oct 10, 2023

View reviewed changes

Klaijan added 2 commits October 10, 2023 19:15

fix

a5cccb6

Merge branch 'klaijan/metric-element-type-freq' of https://github.com…

e6031fa

…/Unstructured-IO/unstructured into klaijan/metric-element-type-freq

Klaijan requested a review from shreyanid October 10, 2023 23:15

Klaijan added 3 commits October 10, 2023 19:31

Merge branch 'main' into klaijan/metric-element-type-freq

bf7e7e5

takes json str params'

44e8cbc

Merge branch 'klaijan/metric-element-type-freq' of https://github.com…

1ac004c

…/Unstructured-IO/unstructured into klaijan/metric-element-type-freq

shreyanid approved these changes Oct 11, 2023

View reviewed changes

Klaijan added this pull request to the merge queue Oct 11, 2023

Merged via the queue into main with commit ee75ce2 Oct 11, 2023
39 checks passed

Klaijan deleted the klaijan/metric-element-type-freq branch October 11, 2023 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: element type frequency #1688

feat: element type frequency #1688

Klaijan commented Oct 9, 2023

shreyanid Oct 9, 2023 •

edited

Loading

Klaijan Oct 10, 2023

shreyanid Oct 10, 2023

Klaijan Oct 10, 2023

shreyanid Oct 10, 2023

Klaijan Oct 10, 2023

shreyanid left a comment

feat: element type frequency #1688

feat: element type frequency #1688

Conversation

Klaijan commented Oct 9, 2023

shreyanid Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Klaijan Oct 10, 2023

Choose a reason for hiding this comment

shreyanid Oct 10, 2023

Choose a reason for hiding this comment

Klaijan Oct 10, 2023

Choose a reason for hiding this comment

shreyanid Oct 10, 2023

Choose a reason for hiding this comment

Klaijan Oct 10, 2023

Choose a reason for hiding this comment

shreyanid left a comment

Choose a reason for hiding this comment

shreyanid Oct 9, 2023 •

edited

Loading