-
-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chasing an UB-smelling bug related to swap_rows #396
Comments
I did not manage to produce a simple reproducer yet, but the full application is on Github if you feel like playing around with it:
There is no question that sorting the rows of a column-major matrix is highly questionable from a performance point of view (it's a long story...). But miscompiling it is going one step too far! :) |
This definitely looks like an UB. if mem::size_of::<N>() < 32 {
let z = ptr::read(a);
ptr::copy_nonoverlapping(b, a, 1);
ptr::write(b, z);
} else {
ptr::swap_nonoverlapping(a, b, 1);
} which is basically the stdlib implementation of |
Also note that preventing inlining ( |
One hypothesis which I have concerning to this bug, which your observations tend to fuel, is that it might be related to the use of two &mut that point within the same memory region. Material on unsafe Rust says that &mut must not alias, but does not go much further than that. This leaves some dangerous margin of interpretation for the compiler. For example, in our case, can a load/store of a certain size be extended into a masked load/store of a larger size for optimization, even if it potentially creates some aliasing which did not exist before? To be more precise, Rust references are currently defined by the language reference to do what LLVM's noalias does. But LLVM's noalias is described in compiler-speak that is not very accessible to mere mortals and its implementation historically had many bugs which stopped rustc from using it until very recently. So there is a lot of room for mis-interpretation and mis-implementation here. Hence instead of optimized assembly, which is highly sensitive to environmental conditions as we have both observed, maybe the right approach would be to take a simple swap_rows() example, generate LLVM IR from it, and go see the compiler team / unsafe code guidelines team to ask them if the IR really does what we think it does. |
Finally, a minimal reproducer! extern crate nalgebra;
use nalgebra::Matrix3x4;
fn swappy() -> Matrix3x4<f32> {
let mut mat = Matrix3x4::new(1., 2., 3., 4.,
5., 6., 7., 8.,
9., 10., 11., 12.);
// NOTE: This printf makes the bug go away, suggesting UB or a codegen issue
// println!("Result: {}", mat);
for i in 0..2 {
for j in i+1..3 {
if mat[(j, 3)] > mat[(i, 3)] { mat.swap_rows(i, j); }
}
}
mat
}
fn main() {
let mat = swappy();
println!("Result: {}", mat);
} You must build in release mode and with codegen-units=1 to make it fail. I did not manage to make enough sense of LLVM IR to extract a minimal subset of it, though, so I opened a ticket on the unsafe code guidelines repo and leave that part to them if they need it. |
According to the discussion that occured on rust-lang/rust-memory-model#50, the code follows Rust's aliasing rules, and this is therefore a compiler bug. So, to rust-lang/rust I go next... |
Thanks for all your investigations @HadrienG2 ! |
Latest development at rust-lang/rust: the bug was reproduced with safe code, so this is definitely a compiler bug. |
A workaround which disables the compiler's assumption of no mutable aliasing ("noalias") for &mut until the root cause is found has landed in beta, should be part of the next stable release. |
This will also be fixed in Rust 1.29.2, which should be released tomorrow. |
1.29.2 has now been officially released, and I can confirm that it fixes this bug. |
Great ! We can close this then. Thanks for reporting and following this so closely. |
To the best of my knowledge, the following two code snippets (where
mat
is an owned mutable nalgebra Matrix) should produce the same final result:Unfortunately, I have a program where the former produces different (and wrong) results, with suspiciously UB-ish symptoms: if I add a print statement near the code that does the swap_rows, the results become correct.
I do not use unsafe myself, so I can only assume that the issue is in nalgebra. But looking at the source code of swap_rows did not reveal anything suspicious, and I was not successful in producing a simpler reproducer that I can share with you so far.
If you have more experience in debugging this kind of issues than I do, any tips on how to narrow down the problem further would be much appreciated.
The text was updated successfully, but these errors were encountered: