This is implementation of Moses Charikar’s simhashes in Ruby.
When you have a string and want to calculate it’s simhash, you should
my_string.simhash
By default it will generate 64-bit integer - that is simhash for this string
It’s always better to tokenize string before simhashing. It’s as simple as
my_string.simhash(:split_by => / /)
This will generate 64-bit integer based, but will split string into words before. It’s handy when you need to calculate similarity of strings based on word usage. You can split string as you like: by letters/sentences/specific letter-combinations, etc.
my_string.simhash(:split_by => /./, :hashbits => 512)
Sometimes you might need longer simhash (finding similarity for very long strings is a good example). You can set length of result hash by passing hashbits parameter. This example will return 512-bit simhash for your string splitted by sentences.
It’s useful to clean your string before simhashing. But it’s useful not to clean, too.
Here are examples:
my_string.simhash(:stop_words => true) # here we clean
This will find stop-words in your string and remove them before simhashing. Stop-words are “the”, “not”, “about”, etc. Currently we remove only Russian and English stop-words.
my_string.simhash(:preserve_punctuation => true) # here we not
This will not remove punctuation before simhashing. Yes, we remove all dots, commas, etc. after splitting string to words by default. Because different punctiation does not mean difference in general. If you not agree you can turn this default off.
As usual:
gem install simhash
But if you have GNU MP library, simhash will work faster! To check out which version is used, type:
Simhash::DEFAULT_STRING_HASH_METHOD It should return symbol. If symbol ends with “rb”, your simhash is slow. If you want make it faster, install GNU MP.
To build C extensions:
cd ext/string_hashing ARCHFLAGS='-arch x84_64' ruby extconf.rb make
You might be able to omit ‘ARCHFLAGS` or use a different version.