Slow matrix-vector multiplication #334

robsmith11 · 2018-05-01T14:57:28Z

I was trying to speed up my neural network calculations (which involve many matrix-vector multiplications and found that simply writing the multiplication with a reference to the matrix rather than the matrix itself produced a 36% reduction in run-time.

Is there any reason why m*v should be so much slower than &m*v? If not, would it be possible to make them both fast?

Example:

extern crate nalgebra;
use nalgebra::*;

fn main() {
    let mut s = 0f64;

    let m:MatrixMN<f64,U15,U20> = zero();
    let mut v:VectorN<f64,U20> = zero();
    let mut r:VectorN<f64,U15> = zero();

    for _ in 0..10000000 {
        r = &m*v;   // fast
        // r = m*v; // slow
        s += r.iter().sum::<f64>();
    }

    println!("{}", s);
}

The text was updated successfully, but these errors were encountered:

jswrenn · 2018-05-03T17:40:01Z

Judging by a diff between the outputs of cargo asm on the fast and slow versions of your example, it looks like the primary difference between them is the slow one involves an extra memcpy.

The fact that rather significant memcpys can sneak into code using nalgebra is an unfortunate side-effect of Matrix implementing Copy. This issue is sorta tracked by #282, which mostly focuses on the benefits of being able to support non-Copy internal types. I would really like to see Copy removed from nalgebra's container types because of subtle performance pitfalls like the one you're describing.

robsmith11 · 2018-05-03T18:56:01Z

Thanks for tracking down the difference. I'm surprised to see Matrix implementing Copy.. Itsn't Copy intended only for types that are of trivial size/complexity?

sebcrozet · 2018-05-04T09:15:12Z

That's an interesting observation @jswrenn. I'm actually surprised that the optimizer does not remove the memcpy, especially since the corresponding implementation of Mul does use references internally. So I expected the compiler not to generate any memcpy after inlining the mul method.

I agree removing Copy would avoid this kind of pitfalls. Though we still need it (for convenience) for small vectors and matrices. So a solution would be to implement Copy for all matrices/vectors up to a small dimension, e.g., 6 which seems to be the threshold where the extra memcpy impacts performances negatively on benchmarks.

@robsmith11 I'm not sure there is a consensus regarding which types should implement Copy. The documentation mentions:

Generally speaking, if your type can implement Copy, it should.

Fix #334.

jswrenn · 2018-05-04T16:03:55Z

I also started a branch yesterday that completely removes Copy. I shared @robsmith11's intuition, but after reading the documentation, I'm having doubts:

From the Copy documentation:

When should my type implement Copy?
Generally speaking, if your type can implement Copy, it should.
From the array documentation

Arrays of any size are Copy if the element type is Copy. This works because the Copy trait is specially known to the compiler.

The Rust documentation is clear: Matrix should implement Copy, regardless of its dimensions. Removing Copy just to get Rust to produce a use-after-move error message is an abuse of the language (indeed, it won't prevent the first needless memcpy, since that is all a move is). I'm now leaning against removing the Copy impl.

However, the fact remains that we want to guide users towards writing fast code, and we want their code to be fast by default. One option would be to only implement binops over references (which would present users with a "did you mean to pass a reference" error), but nalgebra already does some really clever things with re-using memory when operands are passed by moves and it would a shame to get rid of that. On the other hand, nalgebra's users are themselves inadvertently getting rid of that benefit by not realizing that Matrix implements Copy and performing binops on values instead of references.

jswrenn · 2018-05-04T17:18:54Z

So I expected the compiler not to generate any memcpy after inlining the mul method.

The multiplication isn't actually being inlined. In the asm from @robsmith11's example:

call    nalgebra::core::ops::<impl core::ops::arith::Mul<&'b nalgebra::core::matrix::Matrix<N, R2, C2, SB>> for &'a nalgebra::core::matrix::Matrix<N, R1, C1, SA>>::mul

I just tried replacing all instances of #[inline] with #[inline(always)] in ops.rs; this causes the mul to be inlined, but unfortunately it doesn't result in the extra memcpy being optimized away.

The un-elided memcpy issue seems related to rust-lang/rust#42763. Broadly speaking, we should probably be paying more attention to places in nalgebra where impls that consume self should really consume &self instead.

RReverser · 2022-06-15T21:22:46Z

I only just ran into this too. I didn't realise that nalgebra implements e.g. Mul for both values and references, which results in compiler choosing the most obvious matching trait.

E.g. in the following minimal example I'm pretty sure most users wouldn't write functions like bar and instead write functions like foo without realising their performance suffers due to extra copies involved:

pub struct S {
    m: nalgebra::Matrix4<f64>
}

// version where asm shows data being copied onto the stack
pub fn foo(a: &S, b: &S) -> S {
    S {
        m: a.m * b.m
    }
}

// version with direct multiplication
pub fn bar(a: &S, b: &S) -> S {
    S {
        m: &a.m * &b.m
    }
}

sebcrozet added a commit that referenced this issue May 4, 2018

Implement Copy for matrices and vetors up to dimension 6 only.

570e4e8

Fix #334.

sebcrozet mentioned this issue May 4, 2018

Implement Copy for matrices and vetors up to dimension 6 only. #337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow matrix-vector multiplication #334

Slow matrix-vector multiplication #334

robsmith11 commented May 1, 2018

jswrenn commented May 3, 2018

robsmith11 commented May 3, 2018

sebcrozet commented May 4, 2018 •

edited

Loading

jswrenn commented May 4, 2018

jswrenn commented May 4, 2018 •

edited

Loading

RReverser commented Jun 15, 2022

Slow matrix-vector multiplication #334

Slow matrix-vector multiplication #334

Comments

robsmith11 commented May 1, 2018

jswrenn commented May 3, 2018

robsmith11 commented May 3, 2018

sebcrozet commented May 4, 2018 • edited Loading

jswrenn commented May 4, 2018

jswrenn commented May 4, 2018 • edited Loading

RReverser commented Jun 15, 2022

sebcrozet commented May 4, 2018 •

edited

Loading

jswrenn commented May 4, 2018 •

edited

Loading