Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow matrix-vector multiplication #334

Open
robsmith11 opened this issue May 1, 2018 · 6 comments
Open

Slow matrix-vector multiplication #334

robsmith11 opened this issue May 1, 2018 · 6 comments

Comments

@robsmith11
Copy link

I was trying to speed up my neural network calculations (which involve many matrix-vector multiplications and found that simply writing the multiplication with a reference to the matrix rather than the matrix itself produced a 36% reduction in run-time.

Is there any reason why m*v should be so much slower than &m*v? If not, would it be possible to make them both fast?

Example:

extern crate nalgebra;
use nalgebra::*;

fn main() {
    let mut s = 0f64;

    let m:MatrixMN<f64,U15,U20> = zero();
    let mut v:VectorN<f64,U20> = zero();
    let mut r:VectorN<f64,U15> = zero();

    for _ in 0..10000000 {
        r = &m*v;   // fast
        // r = m*v; // slow
        s += r.iter().sum::<f64>();
    }

    println!("{}", s);
}
@jswrenn
Copy link
Contributor

jswrenn commented May 3, 2018

Judging by a diff between the outputs of cargo asm on the fast and slow versions of your example, it looks like the primary difference between them is the slow one involves an extra memcpy.

The fact that rather significant memcpys can sneak into code using nalgebra is an unfortunate side-effect of Matrix implementing Copy. This issue is sorta tracked by #282, which mostly focuses on the benefits of being able to support non-Copy internal types. I would really like to see Copy removed from nalgebra's container types because of subtle performance pitfalls like the one you're describing.

@robsmith11
Copy link
Author

Thanks for tracking down the difference. I'm surprised to see Matrix implementing Copy.. Itsn't Copy intended only for types that are of trivial size/complexity?

@sebcrozet
Copy link
Member

sebcrozet commented May 4, 2018

That's an interesting observation @jswrenn. I'm actually surprised that the optimizer does not remove the memcpy, especially since the corresponding implementation of Mul does use references internally. So I expected the compiler not to generate any memcpy after inlining the mul method.

I agree removing Copy would avoid this kind of pitfalls. Though we still need it (for convenience) for small vectors and matrices. So a solution would be to implement Copy for all matrices/vectors up to a small dimension, e.g., 6 which seems to be the threshold where the extra memcpy impacts performances negatively on benchmarks.

@robsmith11 I'm not sure there is a consensus regarding which types should implement Copy. The documentation mentions:

Generally speaking, if your type can implement Copy, it should.

@jswrenn
Copy link
Contributor

jswrenn commented May 4, 2018

I also started a branch yesterday that completely removes Copy. I shared @robsmith11's intuition, but after reading the documentation, I'm having doubts:

  1. From the Copy documentation:

    When should my type implement Copy?
    Generally speaking, if your type can implement Copy, it should.

  2. From the array documentation

    Arrays of any size are Copy if the element type is Copy. This works because the Copy trait is specially known to the compiler.

The Rust documentation is clear: Matrix should implement Copy, regardless of its dimensions. Removing Copy just to get Rust to produce a use-after-move error message is an abuse of the language (indeed, it won't prevent the first needless memcpy, since that is all a move is). I'm now leaning against removing the Copy impl.

However, the fact remains that we want to guide users towards writing fast code, and we want their code to be fast by default. One option would be to only implement binops over references (which would present users with a "did you mean to pass a reference" error), but nalgebra already does some really clever things with re-using memory when operands are passed by moves and it would a shame to get rid of that. On the other hand, nalgebra's users are themselves inadvertently getting rid of that benefit by not realizing that Matrix implements Copy and performing binops on values instead of references.

@jswrenn
Copy link
Contributor

jswrenn commented May 4, 2018

So I expected the compiler not to generate any memcpy after inlining the mul method.

The multiplication isn't actually being inlined. In the asm from @robsmith11's example:

call    nalgebra::core::ops::<impl core::ops::arith::Mul<&'b nalgebra::core::matrix::Matrix<N, R2, C2, SB>> for &'a nalgebra::core::matrix::Matrix<N, R1, C1, SA>>::mul

I just tried replacing all instances of #[inline] with #[inline(always)] in ops.rs; this causes the mul to be inlined, but unfortunately it doesn't result in the extra memcpy being optimized away.

The un-elided memcpy issue seems related to rust-lang/rust#42763. Broadly speaking, we should probably be paying more attention to places in nalgebra where impls that consume self should really consume &self instead.

@RReverser
Copy link

I only just ran into this too. I didn't realise that nalgebra implements e.g. Mul for both values and references, which results in compiler choosing the most obvious matching trait.

E.g. in the following minimal example I'm pretty sure most users wouldn't write functions like bar and instead write functions like foo without realising their performance suffers due to extra copies involved:

pub struct S {
    m: nalgebra::Matrix4<f64>
}

// version where asm shows data being copied onto the stack
pub fn foo(a: &S, b: &S) -> S {
    S {
        m: a.m * b.m
    }
}

// version with direct multiplication
pub fn bar(a: &S, b: &S) -> S {
    S {
        m: &a.m * &b.m
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants