Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: Either words or rawWords must be filled" using add_doc sometimes #161

Closed
batmanscode opened this issue Feb 26, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@batmanscode
Copy link

I have text in a dataframe and was adding it in like this:

for text in df['text']:
    mdl.add_doc(text.strip().split())

This works fine

However, when I tried to remove stopwords before using add_doc I get the error in the title

I'm doing the preprocessing using texthero like this:

import texthero as hero
from texthero import preprocessing

custom_pipeline = [preprocessing.remove_stopwords,
                   preprocessing.remove_digits,
                   preprocessing.remove_punctuation,
                   preprocessing.remove_whitespace]

df['clean_text'] = hero.clean(df['tweet'], custom_pipeline)

for text in df['clean_text']:
    mdl.add_doc(text.strip().split())
RuntimeError: Either `words` or `rawWords` must be filled.

Side note: maybe this could be built into tomotopy using texthero

@batmanscode batmanscode changed the title RuntimeError: Either words or rawWords must be filled using add_doc sometimes RuntimeError: Either words or rawWords must be filled using add_doc sometimes Feb 26, 2022
@batmanscode batmanscode changed the title RuntimeError: Either words or rawWords must be filled using add_doc sometimes "RuntimeError: Either words or rawWords must be filled" using add_doc sometimes Feb 26, 2022
@bab2min
Copy link
Owner

bab2min commented Feb 26, 2022

Hi @batmanscode ,
It seems that there is an empty document in your df['clean_text']. Could you check the value of df['clean_text'] to make sure there are no blank documents?

@batmanscode
Copy link
Author

@bab2min df['clean_text'].isnull().value_counts() showed no empty values

@bab2min
Copy link
Owner

bab2min commented Mar 2, 2022

@batmanscode
df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:

df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()

@batmanscode
Copy link
Author

batmanscode commented Mar 8, 2022

@batmanscode df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:

df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()

Ah this makes sense, thanks you. There are indeed empty values here. Are there some ways to get tomotopy to skip these? It's not really a problem to remove, but just curious

@bab2min
Copy link
Owner

bab2min commented Mar 8, 2022

@batmanscode Currently, add_doc has no such feature. But I think it's a good idea to add the option to ignore empty docs.

@bab2min bab2min added the enhancement New feature or request label Mar 8, 2022
@batmanscode
Copy link
Author

@bab2min Agreed. Would be a nice quality of life feature to have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants