-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding benchmarks for some of the Non-Metric Space Library Methods #6
Conversation
SW-graph is good! Memory bandwidth: I also think it's the bottleneck. I don't suggest using byte value for SIFT either. So long everyone has the same disadvantage, the benchmark is still meaningful. I didn't observer much improvement of my manual SSE vectorization over GCC-generated code when I was comparing against FLANN. |
Hi Wei, thank you! Regarding the difference in performance: GCC auto-vectorizes in simple cases and uses AVX instructions. So, it results in slightly faster code in the case when your read from L1/L2. However, when I benchmark sequential searching, there is no difference on my PC (between faster and slower distance function). So, I suspect that sequential search is memory bound. It also seems to me that a single core cannot use full memory bandwidth. |
PS: However, Clang doesn't automatically vectorize yet. So, there is a difference when you use a custom SSE implementation (I triple checked this). Intel compiler vectorizes as well. However, their compiler isn't freely available any more. So, I am not sure if it's worth a support. |
Adding benchmarks for some of the Non-Metric Space Library Methods
make image ann-benchmarks not ann-benchmarks-base in install.py
Hi, please, consider the following pull request:
Should you decide to benchmark computationally intensive distances, we can add a couple of other methods.
The recent changes are in the ann-benchmark. We are going to propagate them to the master soon (and make a mini-release).
L2/Cosine implementations use SSE2, but not AVX (which is slightly faster).
One reason why bruteforce performance may suck is that Python doesn't store vectors contiguously. Accessing these vectors incurs a lot of cache misses. One cache miss is roughly 500 CPU cycles, or 4 computations of L2 distances. For L2, I suspect, memory bandwidth is becoming a bottleneck.
For SIFT signatures, you can store vectors as byte vectors and use an efficient Wei Dong's implementation of L2 that relies on SIMD. Apparently, this can boost performance, at least in the multithreaded mode (due to bandwidth savings, or maybe it also uses fewer CPU cycles). However, this is not a generic solution. In fact, I recently learned that RootSIFT performs better than raw SIFT. However, you can't apparently use the byte-storage trick with RootSIFT.
I made a test run on c4.4xlarge for some methods (results are below). However, I didn't re-run FLANN and only imported Eric's results (FLANN takes it really long to build an index):
GLove/angular
SIFT/l2