Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chasing an UB-smelling bug related to swap_rows #396

Closed
HadrienG2 opened this issue Sep 21, 2018 · 12 comments
Closed

Chasing an UB-smelling bug related to swap_rows #396

HadrienG2 opened this issue Sep 21, 2018 · 12 comments
Labels
bug P-high High priority

Comments

@HadrienG2
Copy link

HadrienG2 commented Sep 21, 2018

To the best of my knowledge, the following two code snippets (where mat is an owned mutable nalgebra Matrix) should produce the same final result:

// Clean version
mat.swap_rows(r1, r2);

// Ugly version
let row1 = mat.fixed_rows::<U1>(r1).into_owned();
let row2 = mat.fixed_rows::<U1>(r2).into_owned();
mat.fixed_rows_mut::<U1>(r1).copy_from(&row2);
mat.fixed_rows_mut::<U1>(r2).copy_from(&row1);

Unfortunately, I have a program where the former produces different (and wrong) results, with suspiciously UB-ish symptoms: if I add a print statement near the code that does the swap_rows, the results become correct.

I do not use unsafe myself, so I can only assume that the issue is in nalgebra. But looking at the source code of swap_rows did not reveal anything suspicious, and I was not successful in producing a simpler reproducer that I can share with you so far.

If you have more experience in debugging this kind of issues than I do, any tips on how to narrow down the problem further would be much appreciated.

@HadrienG2 HadrienG2 changed the title Chasing an UB bug related to swap_rows Chasing an UB-smelling bug related to swap_rows Sep 21, 2018
@HadrienG2
Copy link
Author

HadrienG2 commented Sep 21, 2018

I did not manage to produce a simple reproducer yet, but the full application is on Github if you feel like playing around with it:

Outgoing momenta (before sorting): 
  ┌                                                             ┐
  │ -3.0908192158204018   5.645130697066673  -2.889258435688299 │
  │  21.908710606285513   33.50386542925947 -3.6863019466404503 │
  │ -18.817891390465114  -39.14899612632614   6.575560382328753 │
  └                                                             ┘


Outgoing momenta (after sorting): 
  ┌                                                             ┐
  │ -18.817891390465114  -39.14899612632614   6.575560382328753 │
  │  21.908710606285513   33.50386542925947 -3.6863019466404503 │
  │ -3.0908192158204018   5.645130697066673  -2.889258435688299 │
  └                                                             ┘
  • Now, comment out the print statement and run in release mode again. On my computer (Haswell CPU + Rust 1.29), this produces a different result.
Outgoing momenta (after sorting): 
  ┌                                                             ┐
  │ -18.817891390465114  -39.14899612632614   6.575560382328753 │
  │  21.908710606285513   33.50386542925947 -3.6863019466404503 │
  │  21.908710606285513   33.50386542925947 -3.6863019466404503 │
  └                                                             ┘
  • The event-matrix branch of the same repo applies the workaround above and runs correctly (you can rebase annotate_sort on top of it if you want to compare).

There is no question that sorting the rows of a column-major matrix is highly questionable from a performance point of view (it's a long story...). But miscompiling it is going one step too far! :)

@sebcrozet sebcrozet added bug P-high High priority labels Sep 22, 2018
@sebcrozet
Copy link
Member

sebcrozet commented Sep 22, 2018

This definitely looks like an UB.
I've not found a good way to debug this yet. It appears the result is correct again after replacing the nalgebra call to mem::swap there by a ptr::swap or a ptr::swap_nonoverlapping, or by:

        if mem::size_of::<N>() < 32 {
            let z = ptr::read(a);
            ptr::copy_nonoverlapping(b, a, 1);
            ptr::write(b, z);
        } else {
            ptr::swap_nonoverlapping(a, b, 1);
        }

which is basically the stdlib implementation of mem::swap. So I am not sure how to track down this bug. Maybe looking at the generated assembly would help…

@sebcrozet
Copy link
Member

Also note that preventing inlining (#[inline(never)]) of the matrix swap_rows method also prevents the bug from happening.

@HadrienG2
Copy link
Author

HadrienG2 commented Sep 22, 2018

One hypothesis which I have concerning to this bug, which your observations tend to fuel, is that it might be related to the use of two &mut that point within the same memory region.

Material on unsafe Rust says that &mut must not alias, but does not go much further than that. This leaves some dangerous margin of interpretation for the compiler. For example, in our case, can a load/store of a certain size be extended into a masked load/store of a larger size for optimization, even if it potentially creates some aliasing which did not exist before?

To be more precise, Rust references are currently defined by the language reference to do what LLVM's noalias does. But LLVM's noalias is described in compiler-speak that is not very accessible to mere mortals and its implementation historically had many bugs which stopped rustc from using it until very recently. So there is a lot of room for mis-interpretation and mis-implementation here.

Hence instead of optimized assembly, which is highly sensitive to environmental conditions as we have both observed, maybe the right approach would be to take a simple swap_rows() example, generate LLVM IR from it, and go see the compiler team / unsafe code guidelines team to ask them if the IR really does what we think it does.

@HadrienG2
Copy link
Author

HadrienG2 commented Sep 22, 2018

Finally, a minimal reproducer!

extern crate nalgebra;

use nalgebra::Matrix3x4;

fn swappy() -> Matrix3x4<f32> {
    let mut mat = Matrix3x4::new(1., 2.,  3.,  4.,
                                 5., 6.,  7.,  8.,
                                 9., 10., 11., 12.);

    // NOTE: This printf makes the bug go away, suggesting UB or a codegen issue
    // println!("Result: {}", mat);

    for i in 0..2 {
        for j in i+1..3 {
            if mat[(j, 3)] > mat[(i, 3)] { mat.swap_rows(i, j); }
        }
    }

    mat
}

fn main() {
    let mat = swappy();
    println!("Result: {}", mat);
}

You must build in release mode and with codegen-units=1 to make it fail.

I did not manage to make enough sense of LLVM IR to extract a minimal subset of it, though, so I opened a ticket on the unsafe code guidelines repo and leave that part to them if they need it.

@HadrienG2
Copy link
Author

According to the discussion that occured on rust-lang/rust-memory-model#50, the code follows Rust's aliasing rules, and this is therefore a compiler bug. So, to rust-lang/rust I go next...

@sebcrozet
Copy link
Member

Thanks for all your investigations @HadrienG2 !

@HadrienG2
Copy link
Author

HadrienG2 commented Sep 27, 2018

Latest development at rust-lang/rust: the bug was reproduced with safe code, so this is definitely a compiler bug.

@HadrienG2
Copy link
Author

HadrienG2 commented Sep 30, 2018

A workaround which disables the compiler's assumption of no mutable aliasing ("noalias") for &mut until the root cause is found has landed in beta, should be part of the next stable release.

@pietroalbini
Copy link

This will also be fixed in Rust 1.29.2, which should be released tomorrow.

@HadrienG2
Copy link
Author

1.29.2 has now been officially released, and I can confirm that it fixes this bug.

@sebcrozet
Copy link
Member

Great ! We can close this then. Thanks for reporting and following this so closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug P-high High priority
Projects
None yet
Development

No branches or pull requests

3 participants