-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Eigen #135
Use Eigen #135
Conversation
glove euclidean 100 goes from 2.0s (before) to 2.2s (after) so 10% slowdown with bumped search_k to 10000 it goes from 15s to 20s so 25% slowdown |
@erikbern would you let me know how did you do the benchmark? I will try |
I ran The first time it will build an index etc so will be very slow. Subsequent runs will be much faster. I wonder if either (a) GCC is already using SIMD instructions (b) Eigen didn't use SIMD instructions |
I just noticed this: http://eigen.tuxfamily.org/index.php?title=FAQ#How_can_I_enable_vectorization.3F Will try to re-run with different flags to see if it helps |
The other thing I'm starting to think is that time might not be dominated by CPU but by RAM access time. Annoy does a lot of random access when searching the tree. Not sure how to benchmark this |
I was expected as 20~50% performance gain from Eigen (I didn't looks compiler options yet) I'll take a look at these in a bit more detail..
|
thanks a lot for looking into this! |
Annoy relies on random memory accesses. I bet, it's several dozens if not hundreds of CPU cycles on top of the actual computation. From my experience with optimizing memory layouts, for 100d data this typically means that L2s are twice as slow compared to the case when data is accessed sequentially. So, making L2s 10% faster won't make a big difference. However, if you make your L2s (or cosine's) 4times as slower, you will probably see the difference. This would happen if the compiler fails to vectorize the loops. Modern GCC compilers are good at vectorizing L2s and the cosine. However, some older versions and Clang aren't. |
I was confused by your message for a bit but I assume with L2 you mean vector norms not L2 memory? :) By now, I'm pretty sure GCC vectorizes the vector norms so I think if anything I should focus on memory access time (L2/L3 latency). |
Sorry, I mean the Euclidean distance. Yep, optimizing memory access would be important. |
No description provided.