-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: element type frequency #1688
Conversation
Co-authored-by: shreyanid <[email protected]>
unstructured/metrics/element_type.py
Outdated
category_depth = element.metadata.category_depth | ||
if category not in frequency: | ||
frequency[category] = {} | ||
if str(category_depth) not in frequency[category]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debating if a similar value initialization should be done for depth, and in general what the best representation of depth is for this use case.
I like that the nested category depth info makes it easy to consider element types without depth if desired, but the alternative of the main key being a tuple (element category
, depth
) would also possibly have the same benefit. Drawback of a tuple is that every category needs to hold some value in the depth field of the tuple whether or not it applies.
For comparisons, I was thinking it would be easiest if both dictionaries being compared had the exact same keys, but in the case of depth we don't have a limit on it and could potentially nest infinitely, so we don't want to represent all those values with a default.
What do you think makes the most sense for a representation + default value decision given that the primary use case is to (easily) compare the resulting element frequencies between 2 input element lists? I'm leaning default for element categories (prior comment), no default for category depth, and let category depth be nested so it isn't clunky for categories that don't have a depth (which is most of them).
EDIT: I looked at the test output (see comment in test_unstructured/metrics/test_element_type.py
) and am now leaning default for element categories (prior comment), no defaults for category depth, and let category depth be TUPLE for the ability to ignore the depth field instead of having to index by None
categories that don't have a depth (which is most of them). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean for the case that has category_depth=None
, you'd want the value in the tuple to be just pure number? What about when same category have both depth and no depth in the same document?
( | ||
"fake-email.txt", | ||
{ | ||
"UncategorizedText": [("None", 6)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant a tuple as the key in the dictionary of frequencies.
ex. ("UncategorizedText", "None"): 6
or ex. ("Title", 1): 3
the type of the depth as string or numerical (to cover both None and 0, 1, etc) is up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in that case, how would the Element types that aren't exist in the doc be initialize? ("UncategorizedText",)
like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this way, each tuple key to the dictionary represents a unique element type. Overall, the type of the dictionary will be Dict[Tuple[str, int], int]
(or possibly str for second value in tuple to capture None)
Would it then no longer be possible to create all the keys for the dictionary? Probably. Let's instead prioritize having meaningful keys instead of initializing the dictionary with all elements categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we revert back to not having all the Element Type initialized, then assign value to each tuple key.
…/Unstructured-IO/unstructured into klaijan/metric-element-type-freq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Executive Summary
Add function that returns frequency of given element types and depth.