Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent FFT Results Across Different Optimization Levels on ESP32 Platform #180

Closed
ramtej opened this issue Jun 12, 2023 · 21 comments
Closed

Comments

@ramtej
Copy link

ramtej commented Jun 12, 2023

Description

I am comparing various Rust FFT implementations on both native (x86) and ESP32 (xtensa/esp32s3) platforms. The goal is to observe the performance and accuracy of the FFT libraries. I utilize 'std' Rust code for the embedded ESP platform via Rust bindings for ESP-IDF (Espressif's IoT Development Framework).

The following FFT libraries are currently used for testing:

  • FFTW
  • RustFFT / RealFFT
  • Microfft

Input and Expected Result

I generate a simple sinusoidal time series that's small enough to fit into the RAM of the ESP32 platform. This is then processed with a forward FFT operation, and a normalized amplitude of the spectrum is calculated. For x86 and ESP32, and for the different FFT libraries, the expected results should be as follows:

arch[x86_64]:ffw3.amplitudes[3] = 8

Actual Result

However, when the compiler optimization level is changed, I notice some inconsistencies in the results for the ESP32 platform (xtensa). This problem does not occur on the x86 platform.

For example, with opt-level = 0, MicroFFT produces an unexpected value at microfft.amplitudes[5] = 1, while with opt-level = 1, unexpected values are produced at various frequencies, as well as a frequency shift in RealFFT's peak. The same issue appears with opt-level = 3.

arch[xtensa]:microfft.amplitudes[0] = 0
arch[xtensa]:microfft.amplitudes[1] = 2 <- ?
arch[xtensa]:microfft.amplitudes[2] = 0
arch[xtensa]:microfft.amplitudes[3] = 3 <- ?
arch[xtensa]:microfft.amplitudes[4] = 0
arch[xtensa]:microfft.amplitudes[5] = 5 <- ?
arch[xtensa]:microfft.amplitudes[6] = 0
arch[xtensa]:microfft.amplitudes[7] = 4 <- ?
..
..
arch[xtensa]:realfft.amplitudes[0] = 0
arch[xtensa]:realfft.amplitudes[1] = 0
arch[xtensa]:realfft.amplitudes[2] = 0
arch[xtensa]:realfft.amplitudes[3] = 0
arch[xtensa]:realfft.amplitudes[4] = 0
arch[xtensa]:realfft.amplitudes[5] = 8 <- Valid value, but shifted in frequency!?
arch[xtensa]:realfft.amplitudes[6] = 0
arch[xtensa]:realfft.amplitudes[7] = 0
arch[xtensa]:realfft.amplitudes[8] = 0

This issue needs to be addressed to ensure the FFT results are consistent across all platforms and compiler optimization levels.

Steps to Reproduce

  • Generate the sinusoidal time series.
  • Perform a forward FFT operation.
  • Calculate a normalized amplitude of the spectrum.
  • Change the compiler optimization level.
  • Repeat steps 2 and 3 and compare results.

I have already implemented the above steps in the following repository:

https://github.com/ramtej/esp32-compiler-marvels-std-rs

Thank you,
Jiri

@MabezDev
Copy link
Member

I can reproduce this, looks like a misoptimization with hardware floats. It seems like switching to soft floats (disabling floating point hardware acceleration) generates the correct response regardless of the opt-level: rustflags = ["-C", "target-feature=-fp", ...].

Do you think you could try and reduce the test case and try and pinpoint which float operations are going wrong? I can then take a further look with that knowledge in hand.

@MabezDev MabezDev transferred this issue from esp-rs/esp-idf-hal Jun 12, 2023
@ramtej
Copy link
Author

ramtej commented Jun 12, 2023

Thank you! Alright, I'll try to identify the float operation or the combination of operations.

@ramtej
Copy link
Author

ramtej commented Jun 13, 2023

Ok, I was able to isolate the case a bit more, however, I guess more questions than answers remain.

pub fn test_compute_butterflies_narrow() {
    // Some arbitrary complex f32 numbers
    let mut x: Vec<Complex32> = vec![Complex32::new(-0.000000023849761, -0.9238794)];
    // Same effect as 'println!("arch[{}]:x = {:?}", ARCH, x);'
    std::hint::black_box(&x); // <- without this, the below computation is valid?!
    let y1 = x[0] * Complex32::new(0., -1.); // <- ISSUE HERE!
                                             // Note(jj): y1 has to wrong value, sign "flipped" in the imaginary part:
                                             // arch[x86_64]:y = -0.9238794+0.000000023849761i versus
                                             // arch[xtensa]:y = -0.9238794-0.000000023849761i
                                             // .. but only when the above println! or compiler hint (blackbox) is commented in!
    println!("arch[{}]:y.1 = {}", ARCH, y1);

    let y2 = Complex32::new(-0.000000023849761, -0.9238794) * Complex32::new(0., -1.);
    println!("arch[{}]:y.2 = {}", ARCH, y2);

    //  Note(jj): Fails on esp32 but works on x86
    // assert_eq!(y1, y2);
}

Output:

arch[x86_64]:y.1 = -0.9238794+0.000000023849761i
arch[x86_64]:y.2 = -0.9238794+0.000000023849761i
..
arch[xtensa]:y.1 = -0.9238794-0.000000023849761i  <- WRONG
arch[xtensa]:y.2 = -0.9238794+0.000000023849761i

See also here: https://github.com/ramtej/esp32-compiler-marvels-std-rs/blob/1032788fbcbf3733c832a0e3af039c00ba79f802/src/main.rs#L73

The problem is when two complex f32 numbes are multiplied, but only if one of these numbers is in an array (or vector?) as seen in the example above.

Another condition is that this vector must be referenced beforehand, e.g. with a println! or a compiler hint. Without this correct value is calculated. The delta is in the imaginary part of the complex number and differ only in sign.

Thank you,
Jiri

@ramtej
Copy link
Author

ramtej commented Jun 13, 2023

I was able to narrow down the topic further:

#[derive(Clone, Debug)]
struct MyComplex<T> {
    re: T,
    im: T,
}

impl MyComplex<f32> {
    fn new(re: f32, im: f32) -> Self {
        Self { re, im }
    }
}

pub fn test_compute_butterflies_narrow_v2() {
    let c1 = MyComplex::new(-1., -1.);
    let x: Vec<MyComplex<f32>> = vec![c1.clone()];

    std::hint::black_box(&x);
    //println!("arch[{}]:x = {:?}", ARCH, x);

    let c2 = MyComplex::new(0., -1.);
    let re = x[0].re * c2.re - x[0].im * c2.im;
    let im = x[0].re * c2.im + x[0].im * c2.re;
    println!("arch[{}]:re = {}", ARCH, re);
    println!("arch[{}]:im = {}", ARCH, im);
    assert_eq!(re, -1.0);
    assert_eq!(im, 1.0); // <- fails on esp32
}

Output

arch[x86_64]:re = -1
arch[x86_64]:im = 1
..
arch[xtensa]:re = -1
arch[xtensa]:im = -1 <- WRONG

On the other hand, if I perform the same calculation without a struct, correct ones are calculated.

pub fn test_compute_butterflies_narrow_v3() {
    let re1 = -1.;
    let im1 = -1.;
    let x = vec![re1, im1];

    std::hint::black_box(&x);

    let re2 = 0.0;
    let im2 = -1.0;

    let re = x[0] * re2 - x[1] * im2;
    let im = x[0] * im2 + x[1] * re2;
    println!("arch[{}]:re = {}", ARCH, re);
    println!("arch[{}]:im = {}", ARCH, im);

    assert_eq!(re, -1.0);
    assert_eq!(im, 1.0); // <- works on esp32
}

The struct + vec as well as 'println!' or the compiler hint lead to the issue.

Summary

im = -1 * -1 + -1 * 0 = 1 but on the xtensa arch it is -1. 

@ramtej
Copy link
Author

ramtej commented Jun 21, 2023

Is there anything else I can do? From my point of view, the issue is quite fundamental and I'm honestly surprised that it hasn't been discovered before. A regression test, fuzzer and/or quickcheck should have found it. What can I do?

Thank you.

@Kurielu
Copy link

Kurielu commented Jun 21, 2023

I was going to submit an issue but this seems like the same problem as I've got. So I will give my example to reproduce the error:

fn main() {
    let a = 4.215796;
    println!("{a}"); // print_1
    let pi_3 = PI / 3.0;
    println!("{pi_3}"); // print_2
    let c = 5.0 * pi_3;
    println!("{c}"); // print_3
    println!("equal and positive {} and {}", c - a, 5.0 * pi_3 - a);
}

Output:
"should be the same and positive 1.0201917 and -1.0201919"
Let's say this is +/- (there is only a problem with the sign) where it should be +/+
print_1 or print_2 commented out: result: +/+
print_3 commented out: -/-

rustc 1.70.0
esp-id v5.0 24b9d38
using esp32s3

@ramtej
Copy link
Author

ramtej commented Jun 21, 2023

I can reproduce it. It's even more narrowed down, thx!

https://rust.godbolt.org/ -> do we get further there, e.g. to see what the compiler actually generates? Any other ideas?

@Kurielu
Copy link

Kurielu commented Jun 21, 2023

It is not my specialization so won't be able to help further.
Here is a minimal example that I've found:

fn main() -> ! {
    let a = 0.0;
    let p: f32 = 1.0;
    println!("{p}");
    println!("{a}");
    let v = 0.99 * p - a;
    println!("{}", v); // "-0.99"!
    loop {}
}

The neat thing about it is that we can "force" the correct sign by changing

let v = 0.99 * p - a;
// into
let v = 0.99 * p + a;

which gives a very small assembly difference:

2933c2933
< 42000751:	4aa980        	madd.s	f10, f9, f8
---
> 42000751:	5aa980        	msub.s	f10, f9, f8

@ramtej
Copy link
Author

ramtej commented Jun 22, 2023

Hmm, are we about to invent new arithmetic? The problem is transitive, means it can occur in dependent crates.

@zRedShift
Copy link

Can confirm I have this issue, and had multiple times in the past. Soft FP always fixes it, but I would like the performance, heh.

In my (current) case, doing a final twiddle step in RDFT

fn twiddle<const N: usize>(twiddles: &[C; N], left: &mut [C; N], right: &mut [C; N]) {
    for ((&twiddle, left), right) in twiddles
        .iter()
        .zip(left.iter_mut())
        .zip(right.iter_mut().rev())
    {
        let sum = *left + *right;
        let diff = *left - *right;
        let twiddled_re_sum = sum * twiddle.re;
        let twiddled_im_sum = sum * twiddle.im;
        let twiddled_re_diff = diff * twiddle.re;
        let twiddled_im_diff = diff * twiddle.im;
        let half_sum_re = 0.5 * sum.re;
        let half_diff_im = 0.5 * diff.im;
        let output_twiddled_real = twiddled_re_sum.im + twiddled_im_diff.re;
        let output_twiddled_im = twiddled_im_sum.im - twiddled_re_diff.re;
        *left = C {
            re: half_sum_re + output_twiddled_real,
            im: half_diff_im + output_twiddled_im,
        };
        *right = C {
            re: half_sum_re - output_twiddled_real,
            im: output_twiddled_im - half_diff_im,
        };
    }
}

If I don't put logging in the middle, it will sometimes decide to send the values to the wrong places (swap the im values of left and right, for instance). This makes it apparent that this is a problem in basic arithmetic (addition -> subtraction), since in this case the calculations are anti-symmetric.

@zRedShift
Copy link

@Kurielu @ramtej can you try to reproduce on https://github.com/esp-rs/rust-build/releases/tag/v1.71.0.0 ? I don't have any reason to suspect it got fixed by itself, but who knows.

@zRedShift
Copy link

I can't check on a real esp32s3 right now, but from reading the compiled assembly for

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    0.99 * y - x
}

From what I can see it translates to

42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.13748823926208495174>:
42007488:       004136          entry   a1, 32
4200748b:       e49281          l32r    a8, 420006d4 <__default_double_exception+0x1c875e0>
4200748e:       fa8850          wfr     f8, a8
42007491:       fa9350          wfr     f9, a3
42007494:       faa250          wfr     f10, a2
42007497:       5aa980          msub.s  f10, f9, f8
4200749a:       fa2a40          rfr     a2, f10
4200749d:       f01d            retw.n

Read 0.99 into a8
Load a8 into f8, a fp register
Load a3(y) into f9
Load a2(x) into f10
write (f10 -(f9 * f8)) to f8
write f8 to a2 and return

So it should return x - y * 0.99 instead of 0.99 * y - x

I'll check that the madd.s/msub.s instruction encoding of the registers isn't flipped in XtensaInstrInfo.td

@Kurielu
Copy link

Kurielu commented Jul 12, 2023

Still the same.
Maybe this will be a clue for you.

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    0.99 * y - x
}

#[entry]
fn main() -> ! {
    let a = 0.0;
    let p = 0.99f32;
    let v = math(a, p);
    println!("{}", v);
    loop {}
}

#[inline(never)] breaks math right away without the need to print any variable.
With #[inline(always)] we get a positive number.

@zRedShift
Copy link

In my case it breaks regardless.
The instruction definition seems correct, but something else bothers me.

The instruction works like fr - (fs * ft), it doesn't support (fx * fy) - fz, so it makes sense that it's used this way, but it seems like a negation step is missing.
neg.s f10, f10

@zRedShift
Copy link

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    unsafe { core::intrinsics::fmaf32(0.99, y, -x) }
}

Generates

42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.6984577647175520593>:
42007488:       004136          entry   a1, 32
4200748b:       e33b81          l32r    a8, 42000178 <__default_double_exception+0x1c87084>
4200748e:       308280          xor     a8, a2, a8
42007491:       fa8850          wfr     f8, a8
42007494:       e49081          l32r    a8, 420006d4 <__default_double_exception+0x1c875e0>
42007497:       fa9850          wfr     f9, a8
4200749a:       faa350          wfr     f10, a3
4200749d:       4a8a90          madd.s  f8, f10, f9
420074a0:       fa2840          rfr     a2, f8
420074a3:       f01d            retw.n

Xor top bit of x to negate it, and write it to f8
calculate f8 + (y * 0.99)
Which is correct.
Checking LLVM DAG lowering code, specifically performMADD_MSUBCombine

@zRedShift
Copy link

@ramtej @MabezDev @Kurielu
Just an update, I've determined where the problem is indeed in the performMADD_MSUBCombine. it treats x * y ± z and z ± x * y symmetrically, basically equivalently, which is obviously a problem when the symbol is minus.

This code seems to have been copied with minor modifications from MIPS code, which doesn't have a fused multiply sub instruction, which means it handles only the FMA case.

Working on a fix that I will submit as a PR to espressif/llvm-project shortly.

@ramtej
Copy link
Author

ramtej commented Jul 13, 2023

Thanks a lot - you are a hero :-)

@MabezDev
Copy link
Member

Sorry folks, I've been on vacation. Thanks for looking into this and thank you @zRedShift for figuring it out a solution! I'll make sure we get that LLVM PR reviewed asap and cut a patch release of the rust compiler.

@zRedShift
Copy link

I tried my FFT twiddling code from earlier with a newly built version of 1.71.0.1 with my branch of llvm (rust-esp), and it gives the correct results, so all good on that point.

And I finally understand why the floating point values sometimes became correct when printing them (a heisenbug that eluded me for a while, since this miscompilation isn't new). The function performMADD_MSUBCombine checks if the multiplication results has one use (that is, no one cares about the result of the multiplication by itself), and only then does the fusing. I guess, when printing the intermediate values the printer became the second use of it, and thus the optimization pass was aborted.

@MabezDev
Copy link
Member

The esp 1.72 release is now available as a pre-release, and includes @zRedShift's LLVM patches: https://github.com/esp-rs/rust-build/releases/tag/v1.72.0.0

@ramtej
Copy link
Author

ramtej commented Aug 24, 2023

Hello, I can confirm this and my application now also works with hardware fp - I can measure a performance increase of ~3.6x. Thanks again for the support. I will close this issue now. See you all at the next ticket/issue, lol.

Regards,
Jiri

@ramtej ramtej closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants