-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use BLAS in brute force (via NumPy) #5
Conversation
@@ -0,0 +1,3 @@ | |||
sudo apt-get install -y python-pip python-dev | |||
sudo apt-get install -y libatlas-dev libatlas3gf-base | |||
sudo apt-get install -y python-numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikbern I'm not terribly sure how the debian-packaged NumPy plays with BLAS... can you check that ATLAS is being picked up by NumPy (=dot calls are fast)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like the pypi version of numpy gets pulled in from some other package so it might not be needed anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright... can you check the timings for np.dot
anyway, just to be sure? What version of numpy is that?
LGTM except needs support for Euclidean distance as well. Btw I wonder if Annoy could leverage BLAS for fast dot products etc... probably means I have to rewrite some of it in matrix form |
Hmm, I could have sworn I read "tests are cossim only" in the README Principles, but I don't see it there now. Will try to add Euclidean when I have time again, but can't promise any ETA :( Btw, as a sanity check, what is the time for |
I changed the benchmarks a bit so that both cosine and Euclidean are represented |
I ran some of the tests on our machines (~dedicated server, not AWS, using ATLAS). The PR seems to work, no typos :) These look a bit different to your results, not sure why. It's possible I messed something up, it's not clear to me what the launch process is (I just deleted all files under Everything except FLANN, ANNOY and KGraph seems to perform worse than brute force, on the GloVe dataset. |
Btw the chosen colour palette of the plot is a benchmark on its own (for my eyesight) :) |
What surprises me is that LSHF does so badly. Does the author know about this benchmark? It's such a recent addition to sklearn. |
@maheshakya pinging you in case you'd like to do some parameter tweaking on this benchmark? |
Thanks for running! I think it's possible I'm doing something weird with LSHF but need to On benchmark discrepancy: might be caused by different compiler settings On vacation for another week so this might take a while. Btw do you think it makes sense to look at bigger data sets? On Friday, June 12, 2015, maciejkula [email protected] wrote:
|
Have you tried using hierarchical kmeans / svd (using the eigenvectors as splitting planes)? |
OK, I've added the Euclidean metric as well. SIFT results: (I only ran on Annoy and brute force because they're fast to build -- there were no changes to other algos = their relative performance should stay the same) I also noticed that the bottleneck is now NOT the distance computations, but rather sorting for "k nearest neighbours" at the very end :) So I optimized that too (assumes NumPy >= 1.8), GloVe results: The brute force algo is now practically on par with its implementation in gensim, at least for the single-query-vector version. So it should be a worthy baseline for all the ANNs :) |
@piskvorky does BLAS create its own copy of the data? |
@searchivarius I think I actually asked a question about this, many years ago, but there was no good answer. In any case, this is mostly relevant for BLAS level 3 calls (GEMM = matrix * matrix multiplications etc). This benchmark uses only level 2 (matrix * vector), and I'm pretty sure recent NumPy versions always do the sane thing, no matter the underlying BLAS implementation. |
I haven't even bothered about doubles. Is there any chance that BLAS uses multiple threads? And what is the server that you are using? |
@searchivarius server -- same one as used in my previous benchmarks. And yes, most BLAS implementations use threads internally (where they deem fit). |
@piskvorky this makes a lot of sense. If you have 8-16 cores, this can easily compensate for the random memory layout. However, this doesn't make a fair comparison. |
@searchivarius The machine has 4 cores. And all contestants run on the same machine -- isn't that the point of the benchmark? |
I don't think so:
|
Well, some algorithms might not scale well with the number of threads (though not very likely). However, it is always possible to test this by running a multi-threaded benchmark, which, however, executes a single-threaded implementation. |
Oops, good catch @searchivarius ! Must be one of the newer Principles I missed, I don't remember reading that earlier. Anyway, dumbing down BLAS would be too hard to do in general, I'm not going there. I'll leave this PR as is -- up to Erik. It's possible the speed-up is only 10x then :-) |
Does annoy use floats or doubles for this benchmark? |
This seems to have died; I still think fast pure-Python similarities (BLAS via NumPy) are a worthy baseline: simple, easy to deploy, maintain, fast. I won't have time for this, but if @aaalgo wants to add the OpenBlas threading restrictions, that would be cool! (I didn't realize the contenders must be single threaded, sorry). |
I'm also happy to change all benchmarks to be multi-threaded (in that case 2015-08-26 20:53 GMT-04:00 Radim Řehůřek [email protected]:
|
I think that makes good sense. I love that these ANN benchmarks are practical -- practical datasets, practical lib installs, practical implementations. A practical, reproducible HW setup fits the theme nicely IMO. |
I don't write the fastest Python code, but I'll add a parallel BLAS mode in my KGraph API just for the purpose of benchmark. Multi-threading can be enabled in KGraph by passing in the "threads=8" parameter in the python search API. It should be a little bit faster than an external thread pool. |
I have updated my KGraph repository. After rebuilding the source, BLAS mode can be enabled by passing "blas=True" as in "index.search(dataset, query, K=K, blas=True)". No need to call index.build if only brute force with BLAS is to be used. Speedup will only start to show as dimension is > 100. |
d5e174a
to
b7610c5
Compare
i merged this as a separate algo that is used to compute the correct results seems to be like 100x faster than before :) |
It could be, b/c Python brute-force is 10x slower than a single-thread brute-force. |
Hooray! \o/ Thanks @erikbern. Feel free to add a note that people can bug me regarding this code -- I'll be happy to maintain it, in case of any bugs / questions / extensions. |
@erikbern have the graphs on the main README page been regenerated? The numbers seem dodgy (brute force slower than FLANN on 100% acc, and almost on par with KD), which doesn't match my results on GloVe above. What BLAS was this using? |
brute force doesn't use BLAS in the benchmarks |
Do you mean you're letting NumPy automatically link against whatever BLAS is already installed in your system, or you're specifically disabling external BLAS during NumPy installation? |
https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/__init__.py#L331 see |
the reason I don't want to use BruteforceBLAS for benchmarks is that it uses multiple CPU cores by default and I'm not sure how to disable that. |
Aah, sorry, I thought BruteForce == BruteForceBLAS, for some reason. Never mind then :) |
i might rewrite the benchmarks so it runs on multiple threads instead... that way i don't have to worry about it. more realistic too |
Added MIH through subprocess system.
* u Signed-off-by: Nicky <[email protected]> * 0.8+R@10 Co-authored-by: Tinkerrr <[email protected]> Co-authored-by: Erik Bernhardsson <[email protected]>
Added MIH through subprocess system.
fix Docker build paths in install.py
Changes from mat
Replace slow brute force algo (sklearn) by a direct, fast BLAS call.
Make sure you have a fast BLAS installed -- a recent ATLAS, or OpenBlas, Intel's MKL, Apple's Accelerate etc. This can make a huge difference in performance.
PR not tested at all. @erikbern can you run it on the AWS machine? I just wrote the code, I didn't get a chance to run it (install assumes Debian). Sorry for any typos.