Fix fr stopwords #132

mbwolff · 2013-08-08T05:12:19Z

When using Serendipomatic with French texts, the common words "les" and "a" are not filtered as stop words. I made a quick fix for this.

Added line breaks.

I think the problem is with NLTK: it is missing "les" and "a" from its French stopword list.

mialondon · 2013-08-08T09:19:28Z

Hello, and thanks for your contribution! Just in case you didn't see, there's a bunch of work to do for multilingual texts discussed at #114

rlskoeser · 2013-08-08T15:01:46Z

Wow, I'm surprised nltk stopwords don't include those. I wonder if we're supposed to be stemming terms first? Although I guess that wouldn't help at all for a...

The two changes are redundant, right? We should only need to add the words in one place or the other? Although technically I probably shouldn't have the nltk stopwords checked into the git repo-- I couldn't figure out how to get them deployed on heroku without doing that. @mialondon any thoughts?

@mbwolff when I have the time, I will look into merging this in and see what I can do to set it up to be more extensible (e.g., maybe we need to start our own lists of extra stopwords not handled by nltk, so we can add more terms as we discover them).

mbwolff · 2013-08-08T15:06:03Z

Yes, I was thinking the same thing. We can use local lists of stopwords and edit as necessary.

mialondon · 2013-08-08T15:53:06Z

I'm wondering if there's a reason why the nltk library doesn't include them - presumably they've thought through these issues? It'd be good to check in with them in case they're not doing it for a hard-won reason, and perhaps either contribute to theirs, or as you've suggested, supplement it with a local version. A version editable as a plain text file would be the easiest way to have a range of contributors.

I've been wondering about stemming... I suppose it depends how ruthlessly it's applied. e.g. if 'policing' is stemmed to 'police' it brings in extra, unrelated concepts (to use an example from another historian I know).

I've only had a bit of a play with deploying to heroku (I set up an instance to play with the code) so I'm not sure about libraries.

frankieroberto · 2013-08-08T15:53:06Z

@rlskoeser Hello, @mialondon pointed me here. Not too sure I can help, but the way I usually pull in external libraries onto a Heroku box is via a Gemfile. However, that applies to Ruby and I think you're using Python?

frankieroberto · 2013-08-08T15:54:57Z

PS https://devcenter.heroku.com/articles/python suggests using 'Pip' for dependency management, if that's any help.

rlskoeser · 2013-08-08T16:06:59Z

@frankieroberto yep, we're using pip for all of the normal python dependencies. However, as far as I can discover the nltk corpora have to be downloaded via the nltk downloader tool (we should have the stopwords download command documented in the github project readme). I think I tried it on my dev heroku instance last week, but can't remember now if it didn't work or just went into an unexpected place. I'll try to revisit that again (and take notes) when I have a bit of time.

mialondon · 2013-08-08T16:12:27Z

Ah, sorry Frankie, I knew you'd used Heroku but I thought you used Python rather than Ruby...

mbwolff · 2013-08-08T16:21:18Z

I just sent a message to Peter Ljunglöf http://www.cse.chalmers.se/~peb/ who handles parsing for NLTK. Will let you know if I hear anything.

anarchivist · 2013-08-08T16:49:14Z

Hi, you may want to take a look at this StackOverflow post as it may be useful. http://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku

rlskoeser · 2013-08-08T16:53:18Z

Thanks, I suspected it might be something like that. I'll definitely make use of that when I revisit & document our heroku deploy.

Fix fr stopwords

rlskoeser · 2013-08-09T02:42:33Z

@mbwolff - I generalized your contribution and set up a simple way to handling extra stop words by language that aren't in the nltk corpus. For now, the only extras are the two you added, but I think it should make it very easy to add extra stop words for any of the languages that are currently supported. The updated stop words should go live in the next production update.

@anarchivist - thanks for the link; when I actually went to look at it in detail I discovered that it was pretty much exactly what I had done myself. I guess with heroku's read-only filesystem it must be the only solution.

mbwolff added 2 commits August 7, 2013 14:44

Made README.md readable

141fe16

Added line breaks.

Kluge for French stopwords.

6d48f43

I think the problem is with NLTK: it is missing "les" and "a" from its French stopword list.

rlskoeser added a commit that referenced this pull request Aug 9, 2013

Merge pull request #132 from mbwolff/FixFrStopwords

a679827

Fix fr stopwords

rlskoeser merged commit a679827 into chnm:master Aug 9, 2013

mbwolff deleted the FixFrStopwords branch August 9, 2013 02:16

rlskoeser added a commit that referenced this pull request Aug 9, 2013

more generic handling for extra language stopwords not in nlkt #132

8b30f57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fr stopwords #132

Fix fr stopwords #132

mbwolff commented Aug 8, 2013

mialondon commented Aug 8, 2013

rlskoeser commented Aug 8, 2013

mbwolff commented Aug 8, 2013

mialondon commented Aug 8, 2013

frankieroberto commented Aug 8, 2013

frankieroberto commented Aug 8, 2013

rlskoeser commented Aug 8, 2013

mialondon commented Aug 8, 2013

mbwolff commented Aug 8, 2013

anarchivist commented Aug 8, 2013

rlskoeser commented Aug 8, 2013

rlskoeser commented Aug 9, 2013

Fix fr stopwords #132

Fix fr stopwords #132

Conversation

mbwolff commented Aug 8, 2013

mialondon commented Aug 8, 2013

rlskoeser commented Aug 8, 2013

mbwolff commented Aug 8, 2013

mialondon commented Aug 8, 2013

frankieroberto commented Aug 8, 2013

frankieroberto commented Aug 8, 2013

rlskoeser commented Aug 8, 2013

mialondon commented Aug 8, 2013

mbwolff commented Aug 8, 2013

anarchivist commented Aug 8, 2013

rlskoeser commented Aug 8, 2013

rlskoeser commented Aug 9, 2013