Skip to content
This repository has been archived by the owner on Feb 19, 2022. It is now read-only.

Fix fr stopwords #132

Merged
merged 2 commits into from
Aug 9, 2013
Merged

Fix fr stopwords #132

merged 2 commits into from
Aug 9, 2013

Conversation

mbwolff
Copy link
Contributor

@mbwolff mbwolff commented Aug 8, 2013

When using Serendipomatic with French texts, the common words "les" and "a" are not filtered as stop words. I made a quick fix for this.

mbwolff added 2 commits August 7, 2013 14:44
Added line breaks.
I think the problem is with NLTK:  it is missing "les" and "a" from its
French stopword list.
@mialondon
Copy link
Contributor

Hello, and thanks for your contribution! Just in case you didn't see, there's a bunch of work to do for multilingual texts discussed at #114

@rlskoeser
Copy link
Contributor

Wow, I'm surprised nltk stopwords don't include those. I wonder if we're supposed to be stemming terms first? Although I guess that wouldn't help at all for a...

The two changes are redundant, right? We should only need to add the words in one place or the other? Although technically I probably shouldn't have the nltk stopwords checked into the git repo-- I couldn't figure out how to get them deployed on heroku without doing that. @mialondon any thoughts?

@mbwolff when I have the time, I will look into merging this in and see what I can do to set it up to be more extensible (e.g., maybe we need to start our own lists of extra stopwords not handled by nltk, so we can add more terms as we discover them).

@mbwolff
Copy link
Contributor Author

mbwolff commented Aug 8, 2013

Yes, I was thinking the same thing. We can use local lists of stopwords and edit as necessary.

@mialondon
Copy link
Contributor

I'm wondering if there's a reason why the nltk library doesn't include them - presumably they've thought through these issues? It'd be good to check in with them in case they're not doing it for a hard-won reason, and perhaps either contribute to theirs, or as you've suggested, supplement it with a local version. A version editable as a plain text file would be the easiest way to have a range of contributors.

I've been wondering about stemming... I suppose it depends how ruthlessly it's applied. e.g. if 'policing' is stemmed to 'police' it brings in extra, unrelated concepts (to use an example from another historian I know).

I've only had a bit of a play with deploying to heroku (I set up an instance to play with the code) so I'm not sure about libraries.

@frankieroberto
Copy link

@rlskoeser Hello, @mialondon pointed me here. Not too sure I can help, but the way I usually pull in external libraries onto a Heroku box is via a Gemfile. However, that applies to Ruby and I think you're using Python?

@frankieroberto
Copy link

PS https://devcenter.heroku.com/articles/python suggests using 'Pip' for dependency management, if that's any help.

@rlskoeser
Copy link
Contributor

@frankieroberto yep, we're using pip for all of the normal python dependencies. However, as far as I can discover the nltk corpora have to be downloaded via the nltk downloader tool (we should have the stopwords download command documented in the github project readme). I think I tried it on my dev heroku instance last week, but can't remember now if it didn't work or just went into an unexpected place. I'll try to revisit that again (and take notes) when I have a bit of time.

@mialondon
Copy link
Contributor

Ah, sorry Frankie, I knew you'd used Heroku but I thought you used Python rather than Ruby...

@mbwolff
Copy link
Contributor Author

mbwolff commented Aug 8, 2013

I just sent a message to Peter Ljunglöf http://www.cse.chalmers.se/~peb/ who handles parsing for NLTK. Will let you know if I hear anything.

@anarchivist
Copy link
Contributor

Hi, you may want to take a look at this StackOverflow post as it may be useful. http://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku

@rlskoeser
Copy link
Contributor

Thanks, I suspected it might be something like that. I'll definitely make use of that when I revisit & document our heroku deploy.

rlskoeser added a commit that referenced this pull request Aug 9, 2013
@rlskoeser rlskoeser merged commit a679827 into chnm:master Aug 9, 2013
@mbwolff mbwolff deleted the FixFrStopwords branch August 9, 2013 02:16
@rlskoeser
Copy link
Contributor

@mbwolff - I generalized your contribution and set up a simple way to handling extra stop words by language that aren't in the nltk corpus. For now, the only extras are the two you added, but I think it should make it very easy to add extra stop words for any of the languages that are currently supported. The updated stop words should go live in the next production update.

@anarchivist - thanks for the link; when I actually went to look at it in detail I discovered that it was pretty much exactly what I had done myself. I guess with heroku's read-only filesystem it must be the only solution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants