Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there any plans to employ SIMD? #128

Closed
cysin opened this issue Jan 20, 2016 · 15 comments
Closed

Are there any plans to employ SIMD? #128

cysin opened this issue Jan 20, 2016 · 15 comments

Comments

@cysin
Copy link

cysin commented Jan 20, 2016

Looks like current implementation doesn't use SIMD intrinsics so is there a plan to do so? I guess this will improve speed significantly

@erikbern
Copy link
Collaborator

no plans but would love it if you want to show me how to do (or even better – send a pull request!)

@ummae
Copy link

ummae commented Jan 20, 2016

@erikbern
http://eigen.tuxfamily.org
Eigen is one of promise linear algebra library. There is no dependencies and well tuned(SIMD, AVX, LAPACK..)

@erikbern
Copy link
Collaborator

Doesn't seem like eigen has built-in functions for Euclidean distance and cosine distance though?

I guess I could compute three dot products using Eigen at
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L119 and
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L180

It seems like a bit of overkill to use Eigen only for dot products though?

@ummae
Copy link

ummae commented Jan 20, 2016

Yes, we can get cosine similarity using Eigen's dot products and math library, it would be much faster.

I think container for feature vector could be changed by Eigen's DenceVector something similar..
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L111
but, I agreed it seems like little a bit over-spec ;)

@cysin
Copy link
Author

cysin commented Jan 25, 2016

Eigen or other SIMD libs might be suitable for spotify but I had never used any before. I only code with specific intrinsics. Here is a code snippet to calculate l2 norm distance with avx2, hope this helps

// note that the input vectors are quantized and with fixed size
int l2_norm_avx(unsigned char * f1, unsigned char * f2) {
    __m256i t = _mm256_setzero_si256();
    int i;
    for(i = 0; i < 128;i += 32) {
        __m256i a = _mm256_loadu_si256((__m256i *) (f1 + i));
        __m256i b = _mm256_loadu_si256((__m256i *) (f2 + i));
        __m256i sublo = _mm256_sub_epi16(
            _mm256_unpacklo_epi8(a, _mm256_setzero_si256()),
            _mm256_unpacklo_epi8(b, _mm256_setzero_si256())
        );
        t = _mm256_add_epi32(
            t,
            _mm256_madd_epi16(sublo, sublo)
        );
        __m256i subhi = _mm256_sub_epi16(
            _mm256_unpackhi_epi8(a, _mm256_setzero_si256()),
            _mm256_unpackhi_epi8(b, _mm256_setzero_si256())
        );
        t = _mm256_add_epi32(
            t,
            _mm256_madd_epi16(subhi, subhi)
        );
    }
    int temp[8];
    _mm256_storeu_si256((__m256i*)temp, t);

    return temp[0] + temp[1] + temp[2] + temp[3] + temp[4] + temp[5] + temp[6] + temp[7];
}

@erikbern
Copy link
Collaborator

cool – seems very hard to read though.

i'm using -march=native for gcc which should enable it in the compiler right? i'm not sure if i need to tell the compiler to align right though

@cysin
Copy link
Author

cysin commented Jan 26, 2016

Compiler can optimize but it is hard to tell if it does the job right. You can try to write code like:

for(int i = 0; i < LEN; i += 4) {
    data[i] = ...
    data[i + 1] = ...
    data[i + 2] = ...
    data[i + 3] = ...
}

In my mind this might help the compiler to vectorize data(process 4 or 8 items in a loop), but I am not sure. And for checking it out, this might help: http://superuser.com/questions/726395/how-to-check-if-a-binary-requires-sse4-or-avx-on-linux
Still, a simd library would be a better choice:)

@erikbern
Copy link
Collaborator

@ummae
Copy link

ummae commented Jan 31, 2016

In Eigen, just put "-lgomp -fopenmp -march=native -fPIC" will be enough for most cases, and actually, it's not that easy to optimize SIMD codes and make works for various cpu architectures and gccs. (Eigen does)

You don't need to digging SIMD instructions, Eigen is very easy to use. At the same time it's very fast.

If you want minimize refactoring cost, considering Map. It can map raw array data into Eigen container(Matrix, Array, Vectors..), so you can plugged-in Eigen codes into the place where you want without changing whole structures.

float raw_array[128];
Map<Matrix<float,16,8> > mm(raw_array);

@erikbern
Copy link
Collaborator

Thanks a lot – i'll give it a shot!

What happens if the vector is not aligned?

@ummae
Copy link

ummae commented Jan 31, 2016

I'm not sure but maybe.. it will not works, by the way, is there not aligned vector in annoy? If so, you will face to the problems..

This is documentation for alignment issues

It seems you need to replace allocator for aligned vector... and there is a way turn off checking alignment, though in that case, you don't take advantage of vectorization, it's big deal for performance.

Anyway, I look around annoylib.h a little bit, and now I'm sure Eigen would be helpful. Many functions written through for-loop, that means there are plenty of room for improvement through vectorization 😄 for example, distance function(https://github.com/spotify/annoy/blob/master/src/annoylib.h#L114) could be re-written one or two lines of code with Eigen like comment you written: a^2 / a^2 + b^2 / b^2 - 2ab/|a||b|

@piskvorky
Copy link
Contributor

Also somewhat related: this saga of improving Annoy baseline benchmark using BLAS (vectorization + memory caches + parallelization).

Such optimizations had greater effect there than here (as brute force has to do more low-level work; memory access patterns matter more) but it will be exciting to see this stuff in Annoy! Erik is pushing Annoy to ever more greatness :)

@erikbern
Copy link
Collaborator

erikbern commented Feb 1, 2016

I also wonder if just changing the definition of the Node class so that it's aligned against 128 bits (8 bytes) would be enough.

http://locklessinc.com/articles/vectorize/

The definition of Node is currently pretty dumb – it's 12 bytes and then the float* array

@erikbern
Copy link
Collaborator

erikbern commented Feb 2, 2016

See #135

I ran some basic tests on my laptop and it actually seemed marginally slower (2.2s instead of 2.0s). Was only with f=100.

@erikbern
Copy link
Collaborator

closing this as i'm pretty sure that gcc enables SIMD by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants