Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate similarity search #51

Closed
piskvorky opened this issue Sep 2, 2011 · 13 comments
Closed

Approximate similarity search #51

piskvorky opened this issue Sep 2, 2011 · 13 comments
Labels
wishlist Feature request

Comments

@piskvorky
Copy link
Owner

piskvorky commented Sep 2, 2011

Google published a whitepaper http://www.google.com/trends/correlate on approximate kNN queries. See how it could apply to gensim & semantic similarity searches.

Background:

  • currently, gensim does a linear scan (compare query against every indexed vector, by means of a matrix multiplication)
  • I tried fancier indexing techniques, but they degenerate into linearly checking each datum anyway, for the high-dimensional vectors
  • even worse, they access objects out-of-order (thrashing caches & HW buffers), so much much slower than plain linear scan in reality (matrix multiplication is linear in index size, but the constant factors are super low)
  • Google claims this new technique works well for high-dim data as well => maybe finally something faster than a linear scan?
@piskvorky
Copy link
Owner Author

@piskvorky
Copy link
Owner Author

Another resource: FLANN http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN has Python bindings.

They don't mention scalability, but considering it's a recent SW specialized for approx k-NN in high dim spaces, this ought to be as good as it gets.

@jtmcmc
Copy link
Contributor

jtmcmc commented Apr 4, 2015

Just curious considering http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/#survivors if you feel like there is a best library to integrate now if this was to be done (thinking of tackling this issue)

@piskvorky
Copy link
Owner Author

Hmm, I wonder if it makes sense to integrate some algo fully (as in, implement Annoy directly in Python/C/Cython). Not super difficult, but not trivial either.

Another option is to rely on Annoy as a 3rd party lib, no deep integration, and just make it easier to use the Annoy API from gensim (or vice versa). I remember Annoy was picky about the type of input it accepts etc, the API was a bit unintuitive, so working around that would be a plus. Plus, reliance on C++ and Boost can make Annoy hard to install for many users.

Tackling this 4-year-old issue will be welcome :)

@jtmcmc
Copy link
Contributor

jtmcmc commented Apr 4, 2015

Yes I see how annoy is a bit hard to implement. I've also found https://github.com/ryanrhymes/panns which maybe could be a better fit. I'm going to get annoy installed and try and do some comparisons. Alternatively the google correlate algorithm doesn't seem that complicated to implement so that could be promising as well.

@piskvorky
Copy link
Owner Author

I don't think the Annoy algo is that hard to implement. It's pretty straightforward IIRC.

I mean, Erik's C++ implementation is involved, because it's heavily optimized, goes for memory-mapping etc etc. But the algo itself is clean.

Either way, let me know how you progress. Would be great to finally have something efficient in gensim :)

@jodaiber
Copy link

jodaiber commented Jun 9, 2015

This would be incredibly useful! Is there any update on approximate sim. search in gensim (i.e. is anyone working on it)?

@piskvorky
Copy link
Owner Author

The only update is, @erikbern (author of Annoy) left Spotify... but he still works on Annoy, somehow :)

On the other hand, Annoy has shed its dependency on Boost + got several cleanups, fixes and improvements recently, so it's become much more viable as a 3rd party lib.

I think I'd prefer to keep the brute force exact kNN in gensim (for small problems, <1M items) and integrate cleanly with Annoy's approximate kNN for larger datasets.

@jodaiber or do you have other ideas?

@erikbern
Copy link

erikbern commented Jun 9, 2015

I'm all for integrating Annoy. Obv I'm biased though :).

I'm currently running some benchmarks that could be relevant: https://github.com/erikbern/ann-benchmarks

@tmylk
Copy link
Contributor

tmylk commented Oct 18, 2016

@tmylk tmylk closed this as completed Oct 18, 2016
@erikbern
Copy link

nice!

@piskvorky
Copy link
Owner Author

@tmylk can we change the tutorial to use a more meaningful dataset?

How about the GoogleNews word2vec model (3,000,000 x 300 matrix)? Lots of people use that.

@tmylk
Copy link
Contributor

tmylk commented Oct 19, 2016

I agree that it's a more illustrative example to show benefits of Annoy. It would look great in a blog post. For the tutorial we chose something that easily runs on a laptop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wishlist Feature request
Projects
None yet
Development

No branches or pull requests

5 participants