"RuntimeError: Either `words` or `rawWords` must be filled" using `add_doc` sometimes #161

batmanscode · 2022-02-26T10:37:02Z

I have text in a dataframe and was adding it in like this:

for text in df['text']:
    mdl.add_doc(text.strip().split())

This works fine

However, when I tried to remove stopwords before using add_doc I get the error in the title

I'm doing the preprocessing using texthero like this:

import texthero as hero
from texthero import preprocessing

custom_pipeline = [preprocessing.remove_stopwords,
                   preprocessing.remove_digits,
                   preprocessing.remove_punctuation,
                   preprocessing.remove_whitespace]

df['clean_text'] = hero.clean(df['tweet'], custom_pipeline)

for text in df['clean_text']:
    mdl.add_doc(text.strip().split())

RuntimeError: Either `words` or `rawWords` must be filled.

Side note: maybe this could be built into tomotopy using texthero

The text was updated successfully, but these errors were encountered:

bab2min · 2022-02-26T16:20:30Z

Hi @batmanscode ,
It seems that there is an empty document in your df['clean_text']. Could you check the value of df['clean_text'] to make sure there are no blank documents?

batmanscode · 2022-02-28T06:10:36Z

@bab2min df['clean_text'].isnull().value_counts() showed no empty values

bab2min · 2022-03-02T15:58:14Z

@batmanscode
df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:

df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()

batmanscode · 2022-03-08T12:34:34Z

@batmanscode df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:
df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()

Ah this makes sense, thanks you. There are indeed empty values here. Are there some ways to get tomotopy to skip these? It's not really a problem to remove, but just curious

bab2min · 2022-03-08T17:08:45Z

@batmanscode Currently, add_doc has no such feature. But I think it's a good idea to add the option to ignore empty docs.

batmanscode · 2022-03-08T19:30:01Z

@bab2min Agreed. Would be a nice quality of life feature to have

batmanscode changed the title ~~RuntimeError: Either words or rawWords must be filled using add_doc sometimes~~ RuntimeError: Either words or rawWords must be filled using add_doc sometimes Feb 26, 2022

batmanscode changed the title ~~RuntimeError: Either words or rawWords must be filled using add_doc sometimes~~ "RuntimeError: Either words or rawWords must be filled" using add_doc sometimes Feb 26, 2022

bab2min added the enhancement New feature or request label Mar 8, 2022

bab2min added a commit that referenced this issue Jul 17, 2022

implemented ignore_empty_words argument (#161)

3329a92

bab2min mentioned this issue Jul 17, 2022

Dev 0.12.3 #176

Merged

bab2min closed this as completed Jan 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"RuntimeError: Either `words` or `rawWords` must be filled" using `add_doc` sometimes #161

"RuntimeError: Either `words` or `rawWords` must be filled" using `add_doc` sometimes #161

batmanscode commented Feb 26, 2022

bab2min commented Feb 26, 2022

batmanscode commented Feb 28, 2022

bab2min commented Mar 2, 2022

batmanscode commented Mar 8, 2022 •

edited

Loading

bab2min commented Mar 8, 2022

batmanscode commented Mar 8, 2022

"RuntimeError: Either words or rawWords must be filled" using add_doc sometimes #161

"RuntimeError: Either words or rawWords must be filled" using add_doc sometimes #161

Comments

batmanscode commented Feb 26, 2022

bab2min commented Feb 26, 2022

batmanscode commented Feb 28, 2022

bab2min commented Mar 2, 2022

batmanscode commented Mar 8, 2022 • edited Loading

bab2min commented Mar 8, 2022

batmanscode commented Mar 8, 2022

"RuntimeError: Either `words` or `rawWords` must be filled" using `add_doc` sometimes #161

"RuntimeError: Either `words` or `rawWords` must be filled" using `add_doc` sometimes #161

batmanscode commented Mar 8, 2022 •

edited

Loading