-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approximate similarity search #51
Comments
More resources re. approx sim search: |
Another resource: FLANN http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN has Python bindings. They don't mention scalability, but considering it's a recent SW specialized for approx k-NN in high dim spaces, this ought to be as good as it gets. |
Just curious considering http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/#survivors if you feel like there is a best library to integrate now if this was to be done (thinking of tackling this issue) |
Hmm, I wonder if it makes sense to integrate some algo fully (as in, implement Annoy directly in Python/C/Cython). Not super difficult, but not trivial either. Another option is to rely on Annoy as a 3rd party lib, no deep integration, and just make it easier to use the Annoy API from gensim (or vice versa). I remember Annoy was picky about the type of input it accepts etc, the API was a bit unintuitive, so working around that would be a plus. Plus, reliance on C++ and Boost can make Annoy hard to install for many users. Tackling this 4-year-old issue will be welcome :) |
Yes I see how annoy is a bit hard to implement. I've also found https://github.com/ryanrhymes/panns which maybe could be a better fit. I'm going to get annoy installed and try and do some comparisons. Alternatively the google correlate algorithm doesn't seem that complicated to implement so that could be promising as well. |
I don't think the Annoy algo is that hard to implement. It's pretty straightforward IIRC. I mean, Erik's C++ implementation is involved, because it's heavily optimized, goes for memory-mapping etc etc. But the algo itself is clean. Either way, let me know how you progress. Would be great to finally have something efficient in gensim :) |
This would be incredibly useful! Is there any update on approximate sim. search in gensim (i.e. is anyone working on it)? |
The only update is, @erikbern (author of Annoy) left Spotify... but he still works on Annoy, somehow :) On the other hand, Annoy has shed its dependency on Boost + got several cleanups, fixes and improvements recently, so it's become much more viable as a 3rd party lib. I think I'd prefer to keep the brute force exact kNN in gensim (for small problems, <1M items) and integrate cleanly with Annoy's approximate kNN for larger datasets. @jodaiber or do you have other ideas? |
I'm all for integrating Annoy. Obv I'm biased though :). I'm currently running some benchmarks that could be relevant: https://github.com/erikbern/ann-benchmarks |
Annoy has been integrated in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb |
nice! |
@tmylk can we change the tutorial to use a more meaningful dataset? How about the GoogleNews word2vec model (3,000,000 x 300 matrix)? Lots of people use that. |
I agree that it's a more illustrative example to show benefits of Annoy. It would look great in a blog post. For the tutorial we chose something that easily runs on a laptop. |
Google published a whitepaper http://www.google.com/trends/correlate on approximate kNN queries. See how it could apply to gensim & semantic similarity searches.
Background:
The text was updated successfully, but these errors were encountered: