Inconsistent FFT Results Across Different Optimization Levels on ESP32 Platform #180

ramtej · 2023-06-12T09:43:32Z

Description

I am comparing various Rust FFT implementations on both native (x86) and ESP32 (xtensa/esp32s3) platforms. The goal is to observe the performance and accuracy of the FFT libraries. I utilize 'std' Rust code for the embedded ESP platform via Rust bindings for ESP-IDF (Espressif's IoT Development Framework).

The following FFT libraries are currently used for testing:

FFTW
RustFFT / RealFFT
Microfft

Input and Expected Result

I generate a simple sinusoidal time series that's small enough to fit into the RAM of the ESP32 platform. This is then processed with a forward FFT operation, and a normalized amplitude of the spectrum is calculated. For x86 and ESP32, and for the different FFT libraries, the expected results should be as follows:

arch[x86_64]:ffw3.amplitudes[3] = 8

Actual Result

However, when the compiler optimization level is changed, I notice some inconsistencies in the results for the ESP32 platform (xtensa). This problem does not occur on the x86 platform.

For example, with opt-level = 0, MicroFFT produces an unexpected value at microfft.amplitudes[5] = 1, while with opt-level = 1, unexpected values are produced at various frequencies, as well as a frequency shift in RealFFT's peak. The same issue appears with opt-level = 3.

arch[xtensa]:microfft.amplitudes[0] = 0
arch[xtensa]:microfft.amplitudes[1] = 2 <- ?
arch[xtensa]:microfft.amplitudes[2] = 0
arch[xtensa]:microfft.amplitudes[3] = 3 <- ?
arch[xtensa]:microfft.amplitudes[4] = 0
arch[xtensa]:microfft.amplitudes[5] = 5 <- ?
arch[xtensa]:microfft.amplitudes[6] = 0
arch[xtensa]:microfft.amplitudes[7] = 4 <- ?
..
..
arch[xtensa]:realfft.amplitudes[0] = 0
arch[xtensa]:realfft.amplitudes[1] = 0
arch[xtensa]:realfft.amplitudes[2] = 0
arch[xtensa]:realfft.amplitudes[3] = 0
arch[xtensa]:realfft.amplitudes[4] = 0
arch[xtensa]:realfft.amplitudes[5] = 8 <- Valid value, but shifted in frequency!?
arch[xtensa]:realfft.amplitudes[6] = 0
arch[xtensa]:realfft.amplitudes[7] = 0
arch[xtensa]:realfft.amplitudes[8] = 0

This issue needs to be addressed to ensure the FFT results are consistent across all platforms and compiler optimization levels.

Steps to Reproduce

Generate the sinusoidal time series.
Perform a forward FFT operation.
Calculate a normalized amplitude of the spectrum.
Change the compiler optimization level.
Repeat steps 2 and 3 and compare results.

I have already implemented the above steps in the following repository:

https://github.com/ramtej/esp32-compiler-marvels-std-rs

Thank you,
Jiri

The text was updated successfully, but these errors were encountered:

MabezDev · 2023-06-12T11:29:48Z

I can reproduce this, looks like a misoptimization with hardware floats. It seems like switching to soft floats (disabling floating point hardware acceleration) generates the correct response regardless of the opt-level: rustflags = ["-C", "target-feature=-fp", ...].

Do you think you could try and reduce the test case and try and pinpoint which float operations are going wrong? I can then take a further look with that knowledge in hand.

ramtej · 2023-06-12T11:36:35Z

Thank you! Alright, I'll try to identify the float operation or the combination of operations.

ramtej · 2023-06-13T13:33:58Z

Ok, I was able to isolate the case a bit more, however, I guess more questions than answers remain.

pub fn test_compute_butterflies_narrow() {
    // Some arbitrary complex f32 numbers
    let mut x: Vec<Complex32> = vec![Complex32::new(-0.000000023849761, -0.9238794)];
    // Same effect as 'println!("arch[{}]:x = {:?}", ARCH, x);'
    std::hint::black_box(&x); // <- without this, the below computation is valid?!
    let y1 = x[0] * Complex32::new(0., -1.); // <- ISSUE HERE!
                                             // Note(jj): y1 has to wrong value, sign "flipped" in the imaginary part:
                                             // arch[x86_64]:y = -0.9238794+0.000000023849761i versus
                                             // arch[xtensa]:y = -0.9238794-0.000000023849761i
                                             // .. but only when the above println! or compiler hint (blackbox) is commented in!
    println!("arch[{}]:y.1 = {}", ARCH, y1);

    let y2 = Complex32::new(-0.000000023849761, -0.9238794) * Complex32::new(0., -1.);
    println!("arch[{}]:y.2 = {}", ARCH, y2);

    //  Note(jj): Fails on esp32 but works on x86
    // assert_eq!(y1, y2);
}

Output:

arch[x86_64]:y.1 = -0.9238794+0.000000023849761i
arch[x86_64]:y.2 = -0.9238794+0.000000023849761i
..
arch[xtensa]:y.1 = -0.9238794-0.000000023849761i  <- WRONG
arch[xtensa]:y.2 = -0.9238794+0.000000023849761i

See also here: https://github.com/ramtej/esp32-compiler-marvels-std-rs/blob/1032788fbcbf3733c832a0e3af039c00ba79f802/src/main.rs#L73

The problem is when two complex f32 numbes are multiplied, but only if one of these numbers is in an array (or vector?) as seen in the example above.

Another condition is that this vector must be referenced beforehand, e.g. with a println! or a compiler hint. Without this correct value is calculated. The delta is in the imaginary part of the complex number and differ only in sign.

Thank you,
Jiri

ramtej · 2023-06-13T16:35:49Z

I was able to narrow down the topic further:

#[derive(Clone, Debug)]
struct MyComplex<T> {
    re: T,
    im: T,
}

impl MyComplex<f32> {
    fn new(re: f32, im: f32) -> Self {
        Self { re, im }
    }
}

pub fn test_compute_butterflies_narrow_v2() {
    let c1 = MyComplex::new(-1., -1.);
    let x: Vec<MyComplex<f32>> = vec![c1.clone()];

    std::hint::black_box(&x);
    //println!("arch[{}]:x = {:?}", ARCH, x);

    let c2 = MyComplex::new(0., -1.);
    let re = x[0].re * c2.re - x[0].im * c2.im;
    let im = x[0].re * c2.im + x[0].im * c2.re;
    println!("arch[{}]:re = {}", ARCH, re);
    println!("arch[{}]:im = {}", ARCH, im);
    assert_eq!(re, -1.0);
    assert_eq!(im, 1.0); // <- fails on esp32
}

Output

arch[x86_64]:re = -1
arch[x86_64]:im = 1
..
arch[xtensa]:re = -1
arch[xtensa]:im = -1 <- WRONG

On the other hand, if I perform the same calculation without a struct, correct ones are calculated.

pub fn test_compute_butterflies_narrow_v3() {
    let re1 = -1.;
    let im1 = -1.;
    let x = vec![re1, im1];

    std::hint::black_box(&x);

    let re2 = 0.0;
    let im2 = -1.0;

    let re = x[0] * re2 - x[1] * im2;
    let im = x[0] * im2 + x[1] * re2;
    println!("arch[{}]:re = {}", ARCH, re);
    println!("arch[{}]:im = {}", ARCH, im);

    assert_eq!(re, -1.0);
    assert_eq!(im, 1.0); // <- works on esp32
}

The struct + vec as well as 'println!' or the compiler hint lead to the issue.

Summary

im = -1 * -1 + -1 * 0 = 1 but on the xtensa arch it is -1.

ramtej · 2023-06-21T08:38:35Z

Is there anything else I can do? From my point of view, the issue is quite fundamental and I'm honestly surprised that it hasn't been discovered before. A regression test, fuzzer and/or quickcheck should have found it. What can I do?

Thank you.

Kurielu · 2023-06-21T14:31:48Z

I was going to submit an issue but this seems like the same problem as I've got. So I will give my example to reproduce the error:

fn main() {
    let a = 4.215796;
    println!("{a}"); // print_1
    let pi_3 = PI / 3.0;
    println!("{pi_3}"); // print_2
    let c = 5.0 * pi_3;
    println!("{c}"); // print_3
    println!("equal and positive {} and {}", c - a, 5.0 * pi_3 - a);
}

Output:
"should be the same and positive 1.0201917 and -1.0201919"
Let's say this is +/- (there is only a problem with the sign) where it should be +/+
print_1 or print_2 commented out: result: +/+
print_3 commented out: -/-

rustc 1.70.0
esp-id v5.0 24b9d38
using esp32s3

ramtej · 2023-06-21T15:30:10Z

I can reproduce it. It's even more narrowed down, thx!

https://rust.godbolt.org/ -> do we get further there, e.g. to see what the compiler actually generates? Any other ideas?

Kurielu · 2023-06-21T17:36:08Z

It is not my specialization so won't be able to help further.
Here is a minimal example that I've found:

fn main() -> ! {
    let a = 0.0;
    let p: f32 = 1.0;
    println!("{p}");
    println!("{a}");
    let v = 0.99 * p - a;
    println!("{}", v); // "-0.99"!
    loop {}
}

The neat thing about it is that we can "force" the correct sign by changing

let v = 0.99 * p - a;
// into
let v = 0.99 * p + a;

which gives a very small assembly difference:

2933c2933
< 42000751:	4aa980        	madd.s	f10, f9, f8
---
> 42000751:	5aa980        	msub.s	f10, f9, f8

ramtej · 2023-06-22T09:26:03Z

Hmm, are we about to invent new arithmetic? The problem is transitive, means it can occur in dependent crates.

zRedShift · 2023-07-06T17:39:50Z

Can confirm I have this issue, and had multiple times in the past. Soft FP always fixes it, but I would like the performance, heh.

In my (current) case, doing a final twiddle step in RDFT

fn twiddle<const N: usize>(twiddles: &[C; N], left: &mut [C; N], right: &mut [C; N]) {
    for ((&twiddle, left), right) in twiddles
        .iter()
        .zip(left.iter_mut())
        .zip(right.iter_mut().rev())
    {
        let sum = *left + *right;
        let diff = *left - *right;
        let twiddled_re_sum = sum * twiddle.re;
        let twiddled_im_sum = sum * twiddle.im;
        let twiddled_re_diff = diff * twiddle.re;
        let twiddled_im_diff = diff * twiddle.im;
        let half_sum_re = 0.5 * sum.re;
        let half_diff_im = 0.5 * diff.im;
        let output_twiddled_real = twiddled_re_sum.im + twiddled_im_diff.re;
        let output_twiddled_im = twiddled_im_sum.im - twiddled_re_diff.re;
        *left = C {
            re: half_sum_re + output_twiddled_real,
            im: half_diff_im + output_twiddled_im,
        };
        *right = C {
            re: half_sum_re - output_twiddled_real,
            im: output_twiddled_im - half_diff_im,
        };
    }
}

If I don't put logging in the middle, it will sometimes decide to send the values to the wrong places (swap the im values of left and right, for instance). This makes it apparent that this is a problem in basic arithmetic (addition -> subtraction), since in this case the calculations are anti-symmetric.

zRedShift · 2023-07-12T20:29:08Z

@Kurielu @ramtej can you try to reproduce on https://github.com/esp-rs/rust-build/releases/tag/v1.71.0.0 ? I don't have any reason to suspect it got fixed by itself, but who knows.

zRedShift · 2023-07-12T20:38:53Z

I can't check on a real esp32s3 right now, but from reading the compiled assembly for

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    0.99 * y - x
}

From what I can see it translates to

42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.13748823926208495174>:
42007488:       004136          entry   a1, 32
4200748b:       e49281          l32r    a8, 420006d4 <__default_double_exception+0x1c875e0>
4200748e:       fa8850          wfr     f8, a8
42007491:       fa9350          wfr     f9, a3
42007494:       faa250          wfr     f10, a2
42007497:       5aa980          msub.s  f10, f9, f8
4200749a:       fa2a40          rfr     a2, f10
4200749d:       f01d            retw.n

Read 0.99 into a8
Load a8 into f8, a fp register
Load a3(y) into f9
Load a2(x) into f10
write (f10 -(f9 * f8)) to f8
write f8 to a2 and return

So it should return x - y * 0.99 instead of 0.99 * y - x

I'll check that the madd.s/msub.s instruction encoding of the registers isn't flipped in XtensaInstrInfo.td

Kurielu · 2023-07-12T21:09:03Z

Still the same.
Maybe this will be a clue for you.

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    0.99 * y - x
}

#[entry]
fn main() -> ! {
    let a = 0.0;
    let p = 0.99f32;
    let v = math(a, p);
    println!("{}", v);
    loop {}
}

#[inline(never)] breaks math right away without the need to print any variable.
With #[inline(always)] we get a positive number.

zRedShift · 2023-07-12T21:18:10Z

In my case it breaks regardless.
The instruction definition seems correct, but something else bothers me.

The instruction works like fr - (fs * ft), it doesn't support (fx * fy) - fz, so it makes sense that it's used this way, but it seems like a negation step is missing.
neg.s f10, f10

zRedShift · 2023-07-12T21:29:18Z

#[inline(never)]
fn math(x: f32, y: f32) -> f32 {
    unsafe { core::intrinsics::fmaf32(0.99, y, -x) }
}

Generates

42007488 <_ZN17fp_miscompilation4math17h29f7850c19794fe1E.llvm.6984577647175520593>:
42007488:       004136          entry   a1, 32
4200748b:       e33b81          l32r    a8, 42000178 <__default_double_exception+0x1c87084>
4200748e:       308280          xor     a8, a2, a8
42007491:       fa8850          wfr     f8, a8
42007494:       e49081          l32r    a8, 420006d4 <__default_double_exception+0x1c875e0>
42007497:       fa9850          wfr     f9, a8
4200749a:       faa350          wfr     f10, a3
4200749d:       4a8a90          madd.s  f8, f10, f9
420074a0:       fa2840          rfr     a2, f8
420074a3:       f01d            retw.n

Xor top bit of x to negate it, and write it to f8
calculate f8 + (y * 0.99)
Which is correct.
Checking LLVM DAG lowering code, specifically performMADD_MSUBCombine

zRedShift · 2023-07-13T15:54:18Z

@ramtej @MabezDev @Kurielu
Just an update, I've determined where the problem is indeed in the performMADD_MSUBCombine. it treats x * y ± z and z ± x * y symmetrically, basically equivalently, which is obviously a problem when the symbol is minus.

This code seems to have been copied with minor modifications from MIPS code, which doesn't have a fused multiply sub instruction, which means it handles only the FMA case.

Working on a fix that I will submit as a PR to espressif/llvm-project shortly.

ramtej · 2023-07-13T18:26:23Z

Thanks a lot - you are a hero :-)

MabezDev · 2023-07-13T18:48:24Z

Sorry folks, I've been on vacation. Thanks for looking into this and thank you @zRedShift for figuring it out a solution! I'll make sure we get that LLVM PR reviewed asap and cut a patch release of the rust compiler.

zRedShift · 2023-07-13T20:41:23Z

I tried my FFT twiddling code from earlier with a newly built version of 1.71.0.1 with my branch of llvm (rust-esp), and it gives the correct results, so all good on that point.

And I finally understand why the floating point values sometimes became correct when printing them (a heisenbug that eluded me for a while, since this miscompilation isn't new). The function performMADD_MSUBCombine checks if the multiplication results has one use (that is, no one cares about the result of the multiplication by itself), and only then does the fusing. I guess, when printing the intermediate values the printer became the second use of it, and thus the optimization pass was aborted.

MabezDev · 2023-08-24T13:58:25Z

The esp 1.72 release is now available as a pre-release, and includes @zRedShift's LLVM patches: https://github.com/esp-rs/rust-build/releases/tag/v1.72.0.0

ramtej · 2023-08-24T17:33:50Z

Hello, I can confirm this and my application now also works with hardware fp - I can measure a performance increase of ~3.6x. Thanks again for the support. I will close this issue now. See you all at the next ticket/issue, lol.

Regards,
Jiri

MabezDev transferred this issue from esp-rs/esp-idf-hal Jun 12, 2023

ramtej mentioned this issue Jun 22, 2023

FFT HW Acceleration #190

Closed

zRedShift mentioned this issue Jul 8, 2023

Bad f32 math results, occassional crashes #185

Closed

zRedShift mentioned this issue Jul 13, 2023

[Xtensa] Fix FP mul-sub fusion (LLVM-276) espressif/llvm-project#76

Closed

MabezDev mentioned this issue Jul 18, 2023

esp32s3 miscompilation (with lto = false) #186

Closed

ramtej closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent FFT Results Across Different Optimization Levels on ESP32 Platform #180

Inconsistent FFT Results Across Different Optimization Levels on ESP32 Platform #180

ramtej commented Jun 12, 2023

MabezDev commented Jun 12, 2023

ramtej commented Jun 12, 2023

ramtej commented Jun 13, 2023 •

edited

Loading

ramtej commented Jun 13, 2023 •

edited

Loading

ramtej commented Jun 21, 2023

Kurielu commented Jun 21, 2023

ramtej commented Jun 21, 2023

Kurielu commented Jun 21, 2023 •

edited

Loading

ramtej commented Jun 22, 2023

zRedShift commented Jul 6, 2023

zRedShift commented Jul 12, 2023

zRedShift commented Jul 12, 2023

Kurielu commented Jul 12, 2023

zRedShift commented Jul 12, 2023

zRedShift commented Jul 12, 2023

zRedShift commented Jul 13, 2023

ramtej commented Jul 13, 2023

MabezDev commented Jul 13, 2023

zRedShift commented Jul 13, 2023

MabezDev commented Aug 24, 2023

ramtej commented Aug 24, 2023

Inconsistent FFT Results Across Different Optimization Levels on ESP32 Platform #180

Inconsistent FFT Results Across Different Optimization Levels on ESP32 Platform #180

Comments

ramtej commented Jun 12, 2023

Description

Input and Expected Result

Actual Result

Steps to Reproduce

MabezDev commented Jun 12, 2023

ramtej commented Jun 12, 2023

ramtej commented Jun 13, 2023 • edited Loading

ramtej commented Jun 13, 2023 • edited Loading

ramtej commented Jun 21, 2023

Kurielu commented Jun 21, 2023

ramtej commented Jun 21, 2023

Kurielu commented Jun 21, 2023 • edited Loading

ramtej commented Jun 22, 2023

zRedShift commented Jul 6, 2023

zRedShift commented Jul 12, 2023

zRedShift commented Jul 12, 2023

Kurielu commented Jul 12, 2023

zRedShift commented Jul 12, 2023

zRedShift commented Jul 12, 2023

zRedShift commented Jul 13, 2023

ramtej commented Jul 13, 2023

MabezDev commented Jul 13, 2023

zRedShift commented Jul 13, 2023

MabezDev commented Aug 24, 2023

ramtej commented Aug 24, 2023

ramtej commented Jun 13, 2023 •

edited

Loading

ramtej commented Jun 13, 2023 •

edited

Loading

Kurielu commented Jun 21, 2023 •

edited

Loading