Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Bag of words for testing metric #1650

Merged
merged 49 commits into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
fe1767d
local embedding model from huggingface
Sep 28, 2023
a339690
Merge branch 'main' of https://github.com/Unstructured-IO/unstructured
Sep 28, 2023
c0808f8
Merge branch 'main' of https://github.com/Unstructured-IO/unstructure…
Oct 2, 2023
672bc8d
add arguments
Oct 2, 2023
a6f9fbb
begin coding bag of words
Oct 3, 2023
8511de1
bag of words function
Oct 5, 2023
2722e09
fix syntax
Oct 5, 2023
fc2754f
Merge branch 'main' into feat/bag-of-words
mallorih Oct 5, 2023
ed42bc1
format
Oct 5, 2023
0ae04ea
Merge branch 'feat/bag-of-words' of https://github.com/Unstructured-I…
Oct 5, 2023
332c70a
remove unwanted file
Oct 5, 2023
ced5db6
Merge branch 'main' into feat/bag-of-words
mallorih Oct 5, 2023
4f1c9ec
Merge branch 'main' of https://github.com/Unstructured-IO/unstructure…
Oct 5, 2023
394e3bb
Merge branch 'feat/bag-of-words' of https://github.com/Unstructured-I…
Oct 5, 2023
81ba875
update changelog and version
Oct 5, 2023
866e8e3
Merge branch 'main' into feat/bag-of-words
mallorih Oct 5, 2023
c4114f7
fix test
Oct 5, 2023
bdefeae
Merge branch 'feat/bag-of-words' of https://github.com/Unstructured-I…
Oct 5, 2023
71b5656
added test
Oct 5, 2023
2e04119
redo logic for bag of words
Oct 5, 2023
5d1769a
update tests
Oct 5, 2023
f8ecffa
remove funky words
Oct 6, 2023
010477a
update version
Oct 6, 2023
b851862
Merge branch 'main' into feat/bag-of-words
Klaijan Oct 6, 2023
34334b3
Merge branch 'main' of https://github.com/Unstructured-IO/unstructure…
Oct 9, 2023
b36a310
fix bag of words and move code to correct files
Oct 9, 2023
7da1314
conflict
Oct 9, 2023
7e06054
formatting
Oct 9, 2023
21bd5fd
Merge branch 'main' into feat/bag-of-words
mallorih Oct 9, 2023
c5128fc
fix typing
Oct 9, 2023
ca30d9d
Merge branch 'feat/bag-of-words' of https://github.com/Unstructured-I…
Oct 9, 2023
f1d32cb
restore core.py file
Oct 9, 2023
fbd1abb
correct typing
Oct 9, 2023
58a670a
fix syntax
Oct 9, 2023
dcd053f
add new condition
Oct 9, 2023
e86da52
remove additional code
Oct 9, 2023
88ba596
removes hypens at the beginning of sentence
Oct 10, 2023
bd46203
formatted
Oct 10, 2023
1838b95
adding test for dash and hyphen
shreyanid Oct 10, 2023
128ea22
add test
Oct 10, 2023
9ad8073
Merge branch 'feat/bag-of-words' of https://github.com/Unstructured-I…
Oct 10, 2023
8d8dcde
Merge branch 'main' into feat/bag-of-words
mallorih Oct 10, 2023
8dd9b06
removed test
Oct 10, 2023
9a1aaa0
Merge branch 'feat/bag-of-words' of https://github.com/Unstructured-I…
Oct 10, 2023
999cfc8
fix logic to remove punctuation with spaces around it.
Oct 10, 2023
adfec61
fix test
Oct 10, 2023
dc690c4
Merge branch 'main' into feat/bag-of-words
shreyanid Oct 10, 2023
0a47804
Merge branch 'main' into feat/bag-of-words
mallorih Oct 10, 2023
b699f9b
Merge branch 'main' into feat/bag-of-words
mallorih Oct 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions test_unstructured/metrics/test_text_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,21 @@ def test_calculate_edit_distance_with_filename(filename, expected_score, expecte
"I have a dog and a cat, I love my dog.",
{"i": 2, "have": 1, "a": 2, "dog": 2, "and": 1, "cat": 1, "love": 1, "my": 1},
),
(
"My dog's hair is red, but the dogs' houses are blue.",
{
"my": 1,
"dog's": 1,
"hair": 1,
"is": 2,
"red": 1,
"but": 1,
"the": 1,
"dogs": 1,
mallorih marked this conversation as resolved.
Show resolved Hide resolved
"house": 1,
"blue": 1,
},
),
],
)
def test_bag_of_words(text, expected):
Expand Down
8 changes: 8 additions & 0 deletions unstructured/metrics/text_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

from rapidfuzz.distance import Levenshtein

from unstructured.nlp.patterns import ENDS_IN_PUNCT_RE


def calculate_edit_distance(
output: str,
Expand Down Expand Up @@ -75,6 +77,12 @@ def bag_of_words(text: str) -> Dict[str, int]:
incorrect_word: str = ""
words = remove_punctuation(text.lower(), ["-", "'"]).split()

# Remove remaining punctuation
for idx in range(len(words)):
punct = ENDS_IN_PUNCT_RE.findall(words[idx])
if punct:
words[idx] = words[idx].replace(punct[0], "")

i = 0
while i < len(words):
if len(words[i]) > 1:
Expand Down
Loading