-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance for function most_similar #527
Comments
Yes, that's a good point @anhldbk ! Extending the similarity API to allow arbitrary kNN libs (such as Annoy) is actually one of our "wishlist tasks": The "init vs. query" tradeoff is fine; the main question here is how to design an API so that the choice of the underlying kNN implementation (MatrixSimilarity, Annoy, K-graph...) is up to the user and flexible. Btw the particular benchmark times will depend on the BLAS implementation you're using. With a fast BLAS library (such as Intel's MKL on Intel processors), the tradeoff between using approximate kNN vs. exact kNN is not so clear cut, especially on smaller datasets. But we still want to give the user the choice, so they can plug in whatever NN lib they prefer. |
@piskvorky So is it a good idea to integrate such libs in gensim, allowing end-users to choose by themselves? |
Yep. Sorry if I wasn't clear. Desired steps:
Not all steps must necessarily be implemented at once. But doing 3) without 1) seems suboptimal, I wouldn't like that. Without an API, we'll have to do the same set of adjustments you did in this PR for every single algorithm, one by one, manually, introducing inconsistencies and code duplication. Apart from |
Sounds like a great idea. I've been using https://github.com/lyst/rpforest to achieve the same thing, so an abstraction layer would be much more preferable than a hardcoded one. |
@piskvorky 3 steps above are required to extensively integrate such libs. I think I'm gonna have my implementation soon. |
Should users want to drop in different kNN? It seems to me that for this use-case, some implementation is going to be just better, for speed/accuracy trade-off. So is it really necessary to abstract this part? |
Yes, we definitely want to abstract the kNN behind an API. |
+1 |
The current implementation of most_similar is NOT efficient. I've found that Annoy can be used to yield a better performance.
Here is my code:
I used a corpus of 5000 documents and recorded the total execution time using most_similar and dev_most_similar (with different tree sizes):
We have to trade-off between initialization time for improving performance. But it's worth, right?
The text was updated successfully, but these errors were encountered: