Use Eigen #135

erikbern · 2016-02-02T03:56:31Z

No description provided.

erikbern · 2016-02-02T04:03:34Z

glove euclidean 100 goes from 2.0s (before) to 2.2s (after) so 10% slowdown

with bumped search_k to 10000 it goes from 15s to 20s so 25% slowdown

ummae · 2016-02-10T07:01:18Z

@erikbern would you let me know how did you do the benchmark? I will try

erikbern · 2016-02-10T12:45:33Z

I ran nosetests test/accuracy_test.py:AccuracyTest.test_euclidean_100 multiple times.

The first time it will build an index etc so will be very slow. Subsequent runs will be much faster.

I wonder if either (a) GCC is already using SIMD instructions (b) Eigen didn't use SIMD instructions

erikbern · 2016-02-10T12:46:34Z

I just noticed this: http://eigen.tuxfamily.org/index.php?title=FAQ#How_can_I_enable_vectorization.3F

Will try to re-run with different flags to see if it helps

erikbern · 2016-02-10T12:47:39Z

The other thing I'm starting to think is that time might not be dominated by CPU but by RAM access time. Annoy does a lot of random access when searching the tree. Not sure how to benchmark this

ummae · 2016-02-14T19:05:35Z

branch	AccuracyTest.test_euclidean_100
master	1.64(+-0.18)
erikbern/eigen	1.67(+-0.07)

(+-) stdev
seconds

I was expected as 20~50% performance gain from Eigen (I didn't looks compiler options yet)
Perhaps yes, cpu operations might not be main bottleneck point.

I'll take a look at these in a bit more detail..

Environ
- Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
- gcc 4.8.3

erikbern · 2016-02-15T06:52:51Z

thanks a lot for looking into this!

searchivarius · 2016-02-15T13:21:28Z

Annoy relies on random memory accesses. I bet, it's several dozens if not hundreds of CPU cycles on top of the actual computation. From my experience with optimizing memory layouts, for 100d data this typically means that L2s are twice as slow compared to the case when data is accessed sequentially. So, making L2s 10% faster won't make a big difference.

However, if you make your L2s (or cosine's) 4times as slower, you will probably see the difference. This would happen if the compiler fails to vectorize the loops. Modern GCC compilers are good at vectorizing L2s and the cosine. However, some older versions and Clang aren't.

erikbern · 2016-02-15T15:32:12Z

I was confused by your message for a bit but I assume with L2 you mean vector norms not L2 memory? :)

By now, I'm pretty sure GCC vectorizes the vector norms so I think if anything I should focus on memory access time (L2/L3 latency).

searchivarius · 2016-02-15T17:29:04Z

Sorry, I mean the Euclidean distance. Yep, optimizing memory access would be important.

use Eigen

35a1cc4

erikbern mentioned this pull request Feb 2, 2016

Are there any plans to employ SIMD? #128

Closed

erikbern closed this Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Eigen #135

Use Eigen #135

erikbern commented Feb 2, 2016

erikbern commented Feb 2, 2016

ummae commented Feb 10, 2016

erikbern commented Feb 10, 2016

erikbern commented Feb 10, 2016

erikbern commented Feb 10, 2016

ummae commented Feb 14, 2016

erikbern commented Feb 15, 2016

searchivarius commented Feb 15, 2016

erikbern commented Feb 15, 2016

searchivarius commented Feb 15, 2016

Use Eigen #135

Use Eigen #135

Conversation

erikbern commented Feb 2, 2016

erikbern commented Feb 2, 2016

ummae commented Feb 10, 2016

erikbern commented Feb 10, 2016

erikbern commented Feb 10, 2016

erikbern commented Feb 10, 2016

ummae commented Feb 14, 2016

erikbern commented Feb 15, 2016

searchivarius commented Feb 15, 2016

erikbern commented Feb 15, 2016

searchivarius commented Feb 15, 2016