Skip to content

Latest commit

 

History

History
182 lines (125 loc) · 5.13 KB

README.md

File metadata and controls

182 lines (125 loc) · 5.13 KB

Stopwords Filter

Build Status

This project is a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence.

Quick guide

  • Install

just type

gem install stopwords-filter

or

# Don't forget the 'require:'
gem 'stopwords-filter', require: 'stopwords'

in your Gemfile.

  • Use it

    1. Simple version
stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords

filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']

filter.stopword? 'by'
# true
  1. Snowball version
filter = Stopwords::Snowball::Filter.new "en"
filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']

filter.stopword? 'by'
# true

2.1 Snowball version with Sieve class (thanks to @s2gatev)

sieve = Stopwords::Snowball::WordSieve.new

filtered = sieve.filter lang: :en, words: 'guide by douglas adams'.split
# filtered = ['guide', 'douglas', 'adams']

sieve.stopword? lang: :en, word: 'by'
# true

What is a Stopword?

According to Wikipedia

In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text).

And that's it. Words that are removed before you perform some task on the rest of them.

Why would I want to remove anything?

Imagine you have a database of products and you want your customers to search on them. You can't use a proper search engine (such as Solr, Sphinx or even Google) neither full search systems from popular database systems such as PostgreSQL. You are left alone with LIKEs and %.

You have your fake search engine working. Someone searches 'Guide Douglas Adams' and you find 'Douglas Adams - Hitchhiker's guide to the galaxy' everything is perfect.

But then someone searches 'guide by douglas adams' and you don't find anything. You don't have any 'by' in the description or title of the book! Most importantly, you don't need that 'by'!

You wish you could get rid of all those 'by' or 'written' or 'from', huh? That's why we are here!

How this thing works?

Main class of this 'library' is Stopwords::Filter You just create a new object with an array of stopwords

stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords

And then you have it, you just can filter

filter.filter 'guide by douglas adams'.split  #-> ['guide', 'douglas', 'adams']

That's all?

I know what you're thinking, it takes a line of ruby code to filter one array from other. That's why we have added an extra functionality, Snowball stopwords lists, already built for you and ready to use.

At least, in the beginning we were using snowball stopwords, but several collaborators have improved this humble gem by including new languages or adding new stopwords. So now, the Snowball version is more an "Snowball and friends" version.

How do I use that snowball thing?

You just create the filter with the locale you want to use

filter = Stopwords::Snowball::Filter.new "en"

And then you filter without worrying about the exact stopwords used

filter.filter 'guide by douglas adams'.split  #-> ['guide', 'douglas', 'adams']

Which languages are supported with snowball?

Currently we have support for:

  • Afrikaans (af)
  • Arabic (ar)
  • Bengali (bn)
  • Breton (br)
  • Catalán (ca)
  • Chinese (zh)
  • Czesch (cs)
  • Danish (da)
  • German (de)
  • Greek (el)
  • English (en)
  • Spanish (es)
  • Finnish (fi): Due to an error it can also be used referring to the fn locale
  • French (fr)
  • Hebrew (he)
  • Hungarian (hu)
  • Indonesian (id)
  • Italian (it)
  • Korean (ko)
  • Dutch (nl)
  • Polish (pl)
  • Portuguese (pt)
  • Romanian (ro)
  • Russian (ru)
  • Swedish (sv)
  • Thai (th)
  • Turkish (tr)
  • Vietnamese (vi)

In the changelog you can see the collaborators for each language.

Anything else?

In a future version I would like to include a chaining filter where you include a series of operations and they are executed in a lineal order, just like the Pipes and Filters design pattern

Ackonowledgments

Thanks to @s2gatev who added the stopword? method and the sieve class to this gem

Thanks to @bettysteger, @fauno, @vrypan, @woto, @grzegorzblaszczyk, @nerde, @sbeckeriv and @zackxu1 for language support and other features.