-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CML matrix multiplication is terribly inefficient #20
Comments
Sincere apologies for not responding to this. For some reason, I do not receive new issue notifications. Have you observed slowdown in a real application? If so, by how much and for what size matrices? In general, CML is targeted at small, often-changing matrices. Please correct me if I am wrong, but as I recall, cache-conscious matrix product won't make much difference unless the matrices are larger than the L1 and/or L2 cache. Consequently, I'd imagine that one of the cache-optimized BLAS implementations would be more appropriate outside of typical CML use cases. Similarly for SIMD, if you've observed a slowdown in a real application, I'd be very interested. I haven't verified recently, but for small, fixed-size vectors and matrices, loops should be compiler-unrolled automatically. Moreover, on Intel x64, SSE is required for floating-point arithmetic, and later Visual Studio releases should auto-parallelize the loops (this may require a special command-line argument, I don't recall at the moment). I can verify this and make sure I'm not doing anything to get in the compiler's way. Finally, I'm very much interested in these sorts of optimizations, so if you have performance data and/or code that supports your request, that would be fantastic and most welcome. (Of course, pull requests with supporting performance tests would also be welcome.) Thank you! Demian |
Hello Demian, I've found this performance issue during optimization of the graphics sample demonstrating parallel rendering, which is a part of Methane Kit project. Methane Kit is a modern graphics API abstraction library and it is using CML as a base vector math library. Asteroids sample demonstrates parallel rendering up to 50000 asteroids in realtime with animation updates on each frame. It was found that sample is CPU-bound because of animation updates are too slow even with multi-threaded execution. AsteroidsArray::Update function updates parameters of 50000 asteroids on each frame, including calculation of model and model-view-projection matrices, which are produced as the result of 4 multiplications of To fix this I've implemented SSE-optimized template specialization of the matrix product for I think that CML is great C++ vector math library with extensive functionality, but it lacks crtically needed SSE optimizations to be used in high-performance and realtime applications. Hope that you could add these optimizations in future! With Best Regards, |
Evgeny, thank you very much for the details. I will definitely look into it. Just out of curiosity, was the slowdown in the 32-bit or 64-bit build of the Asteroids sample (or both)? Thanks again! |
CML matrix multiplication is based on naive algorithm which does not take into account any cache memory optimizations. As the result matrix multiplications becomes bottleneck in any real-world applications using CML. It would be nice to implement at least some memory optimizations (https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0). Also it would be great to use SIMD intrinsics (https://codereview.stackexchange.com/questions/101144/simd-matrix-multiplication) if not too hard to implement in template code.
The text was updated successfully, but these errors were encountered: