Approximate similarity search #51

piskvorky · 2011-09-02T20:59:43Z

Google published a whitepaper http://www.google.com/trends/correlate on approximate kNN queries. See how it could apply to gensim & semantic similarity searches.

Background:

currently, gensim does a linear scan (compare query against every indexed vector, by means of a matrix multiplication)
I tried fancier indexing techniques, but they degenerate into linearly checking each datum anyway, for the high-dimensional vectors
even worse, they access objects out-of-order (thrashing caches & HW buffers), so much much slower than plain linear scan in reality (matrix multiplication is linear in index size, but the constant factors are super low)
Google claims this new technique works well for high-dim data as well => maybe finally something faster than a linear scan?

piskvorky · 2011-12-03T02:08:58Z

More resources re. approx sim search:

piskvorky · 2012-01-05T11:46:39Z

Another resource: FLANN http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN has Python bindings.

They don't mention scalability, but considering it's a recent SW specialized for approx k-NN in high dim spaces, this ought to be as good as it gets.

jtmcmc · 2015-04-04T17:53:15Z

Just curious considering http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/#survivors if you feel like there is a best library to integrate now if this was to be done (thinking of tackling this issue)

piskvorky · 2015-04-04T18:36:18Z

Hmm, I wonder if it makes sense to integrate some algo fully (as in, implement Annoy directly in Python/C/Cython). Not super difficult, but not trivial either.

Another option is to rely on Annoy as a 3rd party lib, no deep integration, and just make it easier to use the Annoy API from gensim (or vice versa). I remember Annoy was picky about the type of input it accepts etc, the API was a bit unintuitive, so working around that would be a plus. Plus, reliance on C++ and Boost can make Annoy hard to install for many users.

Tackling this 4-year-old issue will be welcome :)

jtmcmc · 2015-04-04T22:56:31Z

Yes I see how annoy is a bit hard to implement. I've also found https://github.com/ryanrhymes/panns which maybe could be a better fit. I'm going to get annoy installed and try and do some comparisons. Alternatively the google correlate algorithm doesn't seem that complicated to implement so that could be promising as well.

piskvorky · 2015-04-05T09:26:02Z

I don't think the Annoy algo is that hard to implement. It's pretty straightforward IIRC.

I mean, Erik's C++ implementation is involved, because it's heavily optimized, goes for memory-mapping etc etc. But the algo itself is clean.

Either way, let me know how you progress. Would be great to finally have something efficient in gensim :)

jodaiber · 2015-06-09T11:54:42Z

This would be incredibly useful! Is there any update on approximate sim. search in gensim (i.e. is anyone working on it)?

piskvorky · 2015-06-09T12:29:25Z

The only update is, @erikbern (author of Annoy) left Spotify... but he still works on Annoy, somehow :)

On the other hand, Annoy has shed its dependency on Boost + got several cleanups, fixes and improvements recently, so it's become much more viable as a 3rd party lib.

I think I'd prefer to keep the brute force exact kNN in gensim (for small problems, <1M items) and integrate cleanly with Annoy's approximate kNN for larger datasets.

@jodaiber or do you have other ideas?

erikbern · 2015-06-09T12:57:07Z

I'm all for integrating Annoy. Obv I'm biased though :).

I'm currently running some benchmarks that could be relevant: https://github.com/erikbern/ann-benchmarks

tmylk · 2016-10-18T15:34:01Z

Annoy has been integrated in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb

erikbern · 2016-10-18T15:48:26Z

nice!

piskvorky · 2016-10-19T05:57:37Z

@tmylk can we change the tutorial to use a more meaningful dataset?

How about the GoogleNews word2vec model (3,000,000 x 300 matrix)? Lots of people use that.

tmylk · 2016-10-19T06:15:52Z

I agree that it's a more illustrative example to show benefits of Annoy. It would look great in a blog post. For the tutorial we chose something that easily runs on a laptop.

piskvorky mentioned this issue Apr 8, 2015

Boost library linking problem on Python 3 spotify/annoy#46

Closed

gojomo mentioned this issue Nov 26, 2015

Potential refactor: a 'NamedVectors' class for reuse by Word2Vec, Doc2Vec, etc #549

Closed

karlhigley mentioned this issue Jun 3, 2016

[WIP] Add support for using Annoy as an external similarity index #732

Closed

droudy mentioned this issue Jun 27, 2016

Add annoy import to travis #758

Closed

droudy mentioned this issue Jul 6, 2016

Add support for using Annoy as an external similarity index #774

Merged

tmylk closed this as completed Oct 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximate similarity search #51

Approximate similarity search #51

piskvorky commented Sep 2, 2011 •

edited by tmylk

Loading

piskvorky commented Dec 3, 2011

piskvorky commented Jan 5, 2012

jtmcmc commented Apr 4, 2015

piskvorky commented Apr 4, 2015

jtmcmc commented Apr 4, 2015

piskvorky commented Apr 5, 2015

jodaiber commented Jun 9, 2015

piskvorky commented Jun 9, 2015

erikbern commented Jun 9, 2015

tmylk commented Oct 18, 2016

erikbern commented Oct 18, 2016

piskvorky commented Oct 19, 2016

tmylk commented Oct 19, 2016

Approximate similarity search #51

Approximate similarity search #51

Comments

piskvorky commented Sep 2, 2011 • edited by tmylk Loading

piskvorky commented Dec 3, 2011

piskvorky commented Jan 5, 2012

jtmcmc commented Apr 4, 2015

piskvorky commented Apr 4, 2015

jtmcmc commented Apr 4, 2015

piskvorky commented Apr 5, 2015

jodaiber commented Jun 9, 2015

piskvorky commented Jun 9, 2015

erikbern commented Jun 9, 2015

tmylk commented Oct 18, 2016

erikbern commented Oct 18, 2016

piskvorky commented Oct 19, 2016

tmylk commented Oct 19, 2016

piskvorky commented Sep 2, 2011 •

edited by tmylk

Loading