Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Bag of words for testing metric #1650

Merged
merged 49 commits into from
Oct 10, 2023
Merged

Feat: Bag of words for testing metric #1650

merged 49 commits into from
Oct 10, 2023

Conversation

mallorih
Copy link
Contributor

@mallorih mallorih commented Oct 5, 2023

This PR adds the bag_of_words function to count the frequency of words for evaluation.

Testing

from unstructured.cleaners.core import bag_of_words
string = "The dog loved the cat, but the cat loved the cow."

print(bag_of_words)

@mallorih mallorih requested a review from shreyanid October 5, 2023 18:20
@mallorih mallorih removed the request for review from shreyanid October 5, 2023 18:20
@shreyanid
Copy link
Contributor

Could we add a test for text that contains incorrect spaces between words (like how the unstructured output looks sometimes)? for example in "i n t r o d u c t i o n", each letter will be considered its own word right?

@mallorih
Copy link
Contributor Author

mallorih commented Oct 9, 2023

Could we add a test for text that contains incorrect spaces between words (like how the unstructured output looks sometimes)? for example in "i n t r o d u c t i o n", each letter will be considered its own word right?

Are we sure we want to keep all punctuation?

>>> bag_of_words(string)
{'have': 1, 'dog': 1, 'and': 1, 'cat,': 1, 'love': 1, 'my': 1, 'dog.': 1}

@shreyanid
Copy link
Contributor

No, we should not keep all punctuation, only punctuation within a word (hyphens and apostrophes). commas, dashes, between words or end of sentence punctuation should all be removed.

@shreyanid
Copy link
Contributor

shreyanid commented Oct 9, 2023

Yes, dogs' should (ideally) fall into the same case as dog's because those are both punctuation for only one word, not between words. Could we add a test case for an end of word possessive like dogs' ?

Copy link
Contributor

@shreyanid shreyanid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thanks for taking care of all the punctuation cases :)

@mallorih mallorih enabled auto-merge October 10, 2023 18:38
@mallorih mallorih added this pull request to the merge queue Oct 10, 2023
Merged via the queue into main with commit a5d7ae4 Oct 10, 2023
39 checks passed
@mallorih mallorih deleted the feat/bag-of-words branch October 10, 2023 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants