-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stopwords #649
Comments
Thanks! I'm actually in the process of finally reorganising the language data, so there will be an update soon that fixes this problem, among other things. We're not very happy with the current stopword lists (or most other standard stopword lists that are available tbh). They're outdated and full of pre-processing artifacts, custom hacks and other stuff that's not relevant for spaCy (like It's probably okay for information extraction, but not very useful for Machine Learning at the moment. So we want to use a slightly different and non-standard approach to determine what spaCy considers a stopword and how the language data is organised in the codebase. We're always happy about input and suggestions – although there obviously won't be a 100% perfect solution, because in the end, it's always sort of arbitrary. In the meantime, here's how you can customise the stopword behaviour. You can set attributes in the vocabulary, and tokens will inherit these attributes: lex = nlp.vocab[u'call']
lex.is_stop = False
doc = nlp(u'Call me!')
[(w.text, w.is_stop) for w in doc]
# (u'Call', False), (u'me', True), (u'!', False)] |
It would be helpful if the docs eventually included an explanation of the decision-making that went into whichever words end up being considered stopwords. In my experience, it's better to err on the side of fewer than more for stopwords, and get a linguist's input (the NLTK list is actually pretty decent starting place, notwithstanding some of its flaws). You've shown that it's easy to customise stopword behaviour, so stopword-ifying e.g. very frequent words should be straightforward. |
+1 to fmailhot's comment. An explanation of stopwords decision would be helpful and (IMO or at least for my case) it is probably better to err on the conservative side when labeling stopwords as for most applications it is easier for users to explicitly label what they consider as stopwords (e.g. company names in a company corpora) than to explicitly 'unlist' words from stopwords. |
So where is the explanation/justification for the stopword list? This got closed so I assume the explanation was written somewhere. There are some words in there that don't make sense like 'call' and 'well'. I think it could use some improvement. |
I am also interested since the list seems to be multiple times the size of the |
@nateGeorge Actually, after digging through the git history, it looks like the list may have came from Stone, Dennis, Kwantes (2010) as seen in this line from the repository. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I have observed that spacy considers many common verbs like 'call' also as stopwords (as indicated by IS_STOP) which is a little out of ordinary. Is there any information that describes how spacy determines stopwords? Is there a way to get change the stopword criteria?
The text was updated successfully, but these errors were encountered: