-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom component not executing when calling nlp.pipe #1959
Comments
Thanks, this is a bug. When component functions don't have a This function should be yielding the result, but is instead yielding the original doc. Here's a minimal hack that should make your code work for now: def clean_component(doc):
""" Clean up text. Tokenize, lowercase, and remove punctuation and stopwords """
print("Running cleaner")
# Remove punctuation, symbols (#) and stopwords
doc = [tok.text for tok in doc if (tok.text not in stopwords and tok.pos_ != "PUNCT" and tok.pos_ != "SYM")]
# Make all tokens lowercase
doc = [tok.lower() for tok in doc]
doc = ' '.join(doc)
return nlp.make_doc(doc)
def pipe_clean(docs, **kwargs):
for doc in docs:
yield clean_component(doc)
# Yes, adding attributes to functions works...It's just a bit dirty-looking. Arguably less confusing to
# make it a class. Shrug.
clean_component.pipe = pipe_clean |
Btw, the stop words in spaCy are currently case-sensitive, so you might want to write your clean-up logic slightly differently. You should also take-care that the processing you're doing will likely have a huge impact on the accuracy of the parser and NER. If you just want to get a bag of words that's lower-cased and doesn't have stop words, you might be better off keeping the original |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I created a custom component to filter out stop words and punctuation and add it to my pipeline like so:
But when I run nlp.pipe on some text "Running cleaner" is printed but the text isn't filtered.
The output is the same as the input. Am I using pipe wrong? Thanks.
Your Environment
The text was updated successfully, but these errors were encountered: