Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: element type frequency #1688

Merged
merged 14 commits into from
Oct 11, 2023
Merged

feat: element type frequency #1688

merged 14 commits into from
Oct 11, 2023

Conversation

Klaijan
Copy link
Contributor

@Klaijan Klaijan commented Oct 9, 2023

Executive Summary

Add function that returns frequency of given element types and depth.

@Klaijan Klaijan changed the title feat: element type frequency Klaijan/feat: element type frequency Oct 9, 2023
@Klaijan Klaijan requested a review from shreyanid October 9, 2023 19:02
@Klaijan Klaijan enabled auto-merge October 9, 2023 21:00
CHANGELOG.md Outdated Show resolved Hide resolved
@Klaijan Klaijan requested a review from shreyanid October 9, 2023 23:19
category_depth = element.metadata.category_depth
if category not in frequency:
frequency[category] = {}
if str(category_depth) not in frequency[category]:
Copy link
Contributor

@shreyanid shreyanid Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debating if a similar value initialization should be done for depth, and in general what the best representation of depth is for this use case.

I like that the nested category depth info makes it easy to consider element types without depth if desired, but the alternative of the main key being a tuple (element category, depth) would also possibly have the same benefit. Drawback of a tuple is that every category needs to hold some value in the depth field of the tuple whether or not it applies.

For comparisons, I was thinking it would be easiest if both dictionaries being compared had the exact same keys, but in the case of depth we don't have a limit on it and could potentially nest infinitely, so we don't want to represent all those values with a default.

What do you think makes the most sense for a representation + default value decision given that the primary use case is to (easily) compare the resulting element frequencies between 2 input element lists? I'm leaning default for element categories (prior comment), no default for category depth, and let category depth be nested so it isn't clunky for categories that don't have a depth (which is most of them).

EDIT: I looked at the test output (see comment in test_unstructured/metrics/test_element_type.py) and am now leaning default for element categories (prior comment), no defaults for category depth, and let category depth be TUPLE for the ability to ignore the depth field instead of having to index by None categories that don't have a depth (which is most of them). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean for the case that has category_depth=None, you'd want the value in the tuple to be just pure number? What about when same category have both depth and no depth in the same document?

@Klaijan Klaijan changed the title Klaijan/feat: element type frequency feat: element type frequency Oct 10, 2023
@Klaijan Klaijan requested review from shreyanid and mallorih October 10, 2023 20:45
(
"fake-email.txt",
{
"UncategorizedText": [("None", 6)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant a tuple as the key in the dictionary of frequencies.
ex. ("UncategorizedText", "None"): 6 or ex. ("Title", 1): 3

the type of the depth as string or numerical (to cover both None and 0, 1, etc) is up to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in that case, how would the Element types that aren't exist in the doc be initialize? ("UncategorizedText",) like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, each tuple key to the dictionary represents a unique element type. Overall, the type of the dictionary will be Dict[Tuple[str, int], int] (or possibly str for second value in tuple to capture None)

Would it then no longer be possible to create all the keys for the dictionary? Probably. Let's instead prioritize having meaningful keys instead of initializing the dictionary with all elements categories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we revert back to not having all the Element Type initialized, then assign value to each tuple key.

@Klaijan Klaijan requested a review from shreyanid October 10, 2023 23:15
Copy link
Contributor

@shreyanid shreyanid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Klaijan Klaijan added this pull request to the merge queue Oct 11, 2023
Merged via the queue into main with commit ee75ce2 Oct 11, 2023
39 checks passed
@Klaijan Klaijan deleted the klaijan/metric-element-type-freq branch October 11, 2023 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants