Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom component not executing when calling nlp.pipe #1959

Closed
enerrio opened this issue Feb 9, 2018 · 3 comments
Closed

Custom component not executing when calling nlp.pipe #1959

enerrio opened this issue Feb 9, 2018 · 3 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@enerrio
Copy link

enerrio commented Feb 9, 2018

I created a custom component to filter out stop words and punctuation and add it to my pipeline like so:

nlp = spacy.load('en')

punctuations = string.punctuation
stopwords = spacy.lang.en.STOP_WORDS

def clean_component(doc):
    """ Clean up text. Tokenize, lowercase, and remove punctuation and stopwords """
    print("Running cleaner")
    # Remove punctuation, symbols (#) and stopwords
    doc = [tok.text for tok in doc if (tok.text not in stopwords and tok.pos_ != "PUNCT" and tok.pos_ != "SYM")]
    # Make all tokens lowercase
    doc = [tok.lower() for tok in doc]
    doc = ' '.join(doc)
    return nlp.make_doc(doc)

nlp.add_pipe(clean_component, name='cleaner', after='tagger')
print(nlp.pipe_names) # ['tagger', 'cleaner','parser', 'ner']

But when I run nlp.pipe on some text "Running cleaner" is printed but the text isn't filtered.

for doc in nlp.pipe(data['text'][:2]):
    print(doc)

The output is the same as the input. Am I using pipe wrong? Thanks.

Your Environment

  • Operating System: MacOS 10.13.3
  • Python Version Used: 3.6.4
  • spaCy Version Used: 2.0.7
  • Environment Information:
@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Feb 9, 2018
@honnibal
Copy link
Member

honnibal commented Feb 9, 2018

Thanks, this is a bug. When component functions don't have a .pipe() method, we call a helper function to pipe them, here: https://github.com/explosion/spaCy/blob/master/spacy/language.py#L721

This function should be yielding the result, but is instead yielding the original doc. Here's a minimal hack that should make your code work for now:

def clean_component(doc):
    """ Clean up text. Tokenize, lowercase, and remove punctuation and stopwords """
    print("Running cleaner")
    # Remove punctuation, symbols (#) and stopwords
    doc = [tok.text for tok in doc if (tok.text not in stopwords and tok.pos_ != "PUNCT" and tok.pos_ != "SYM")]
    # Make all tokens lowercase
    doc = [tok.lower() for tok in doc]
    doc = ' '.join(doc)
    return nlp.make_doc(doc)

def pipe_clean(docs, **kwargs):
    for doc in docs:
        yield clean_component(doc)

# Yes, adding attributes to functions works...It's just a bit dirty-looking. Arguably less confusing to
# make it a class. Shrug.
clean_component.pipe = pipe_clean

@honnibal
Copy link
Member

honnibal commented Feb 9, 2018

Btw, the stop words in spaCy are currently case-sensitive, so you might want to write your clean-up logic slightly differently. You should also take-care that the processing you're doing will likely have a huge impact on the accuracy of the parser and NER.

If you just want to get a bag of words that's lower-cased and doesn't have stop words, you might be better off keeping the original Doc object, and using the token.is_stop, token.lower_, token.is_punct etc attributes.

enerrio pushed a commit to enerrio/spaCy that referenced this issue Feb 15, 2018
This was referenced Feb 15, 2018
honnibal added a commit that referenced this issue Feb 17, 2018
@ines ines closed this as completed Feb 17, 2018
@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants