Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Eigen #135

Closed
wants to merge 1 commit into from
Closed

Use Eigen #135

wants to merge 1 commit into from

Conversation

erikbern
Copy link
Collaborator

@erikbern erikbern commented Feb 2, 2016

No description provided.

@erikbern
Copy link
Collaborator Author

erikbern commented Feb 2, 2016

glove euclidean 100 goes from 2.0s (before) to 2.2s (after) so 10% slowdown

with bumped search_k to 10000 it goes from 15s to 20s so 25% slowdown

@ummae
Copy link

ummae commented Feb 10, 2016

@erikbern would you let me know how did you do the benchmark? I will try

@erikbern
Copy link
Collaborator Author

I ran nosetests test/accuracy_test.py:AccuracyTest.test_euclidean_100 multiple times.

The first time it will build an index etc so will be very slow. Subsequent runs will be much faster.

I wonder if either (a) GCC is already using SIMD instructions (b) Eigen didn't use SIMD instructions

@erikbern
Copy link
Collaborator Author

I just noticed this: http://eigen.tuxfamily.org/index.php?title=FAQ#How_can_I_enable_vectorization.3F

Will try to re-run with different flags to see if it helps

@erikbern
Copy link
Collaborator Author

The other thing I'm starting to think is that time might not be dominated by CPU but by RAM access time. Annoy does a lot of random access when searching the tree. Not sure how to benchmark this

@ummae
Copy link

ummae commented Feb 14, 2016

branch AccuracyTest.test_euclidean_100
master 1.64(+-0.18)
erikbern/eigen 1.67(+-0.07)
  • (+-) stdev
  • seconds

I was expected as 20~50% performance gain from Eigen (I didn't looks compiler options yet)
Perhaps yes, cpu operations might not be main bottleneck point.

I'll take a look at these in a bit more detail..

  • Environ
    • Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
    • gcc 4.8.3

@erikbern
Copy link
Collaborator Author

thanks a lot for looking into this!

@searchivarius
Copy link

Annoy relies on random memory accesses. I bet, it's several dozens if not hundreds of CPU cycles on top of the actual computation. From my experience with optimizing memory layouts, for 100d data this typically means that L2s are twice as slow compared to the case when data is accessed sequentially. So, making L2s 10% faster won't make a big difference.

However, if you make your L2s (or cosine's) 4times as slower, you will probably see the difference. This would happen if the compiler fails to vectorize the loops. Modern GCC compilers are good at vectorizing L2s and the cosine. However, some older versions and Clang aren't.

@erikbern
Copy link
Collaborator Author

I was confused by your message for a bit but I assume with L2 you mean vector norms not L2 memory? :)

By now, I'm pretty sure GCC vectorizes the vector norms so I think if anything I should focus on memory access time (L2/L3 latency).

@searchivarius
Copy link

Sorry, I mean the Euclidean distance. Yep, optimizing memory access would be important.

@erikbern erikbern closed this Mar 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants