Performance regression in pattern matching #36283

pmarcelll · 2016-09-05T14:04:44Z

I discovered the cause of a fairly big performance regression (from ~22s to ~29s in a specific benchmark) in my small BF interpreter and I succesfully minified the code:

#[derive(Clone, Copy)]
enum Inner {
    A,
    B,
    C,
}

enum Outer {
    X(u32),
    Y(Inner)
}

#[inline(never)]
fn print(x: &Outer) {
    match *x {
        Outer::X(num) => println!("{:?}", num),
        // Replace this...
        Outer::Y(i) => {
            match i {
                Inner::A => println!("A"),
                Inner::B => println!("B"),
                Inner::C => println!("C"),
            }
        }
        // ...with this
        // Outer::Y(ref i) => {
        //     match *i {
        //         Inner::A => println!("A"),
        //         Inner::B => println!("B"),
        //         Inner::C => println!("C"),
        //     }
        // }
    }
}

fn main(){
    let x = Outer::Y(Inner::B);

    print(&x);
}

On stable, both version of this code compiles to the exact same binary (and also on 1.9.0 and 1.10.0, both old-trans and MIR on each version respectively). On beta and nightly, the commented out code becomes slower. My understanding is that this should compile to the same binary, so this is incorrect behavior.

The last correct version is rustc 1.12.0-nightly (7333c4ac2 2016-07-31), right before the update to LLVM 3.9.

EDIT: I noticed that the minified example does compile to the same binary (I somehow got confused) so I added the #[inline(never)].

The text was updated successfully, but these errors were encountered:

Aatch · 2016-09-05T21:49:24Z

The timing suggests this might be an LLVM issue. We're getting more and more of these lately, which is concerning.

brson · 2016-09-13T16:44:06Z

Needs a P-tag.

nikomatsakis · 2016-09-13T16:47:35Z

@nagisa also theorized that this was an LLVM problem (aliasing?). Do we have an easy way to test for that?

nikomatsakis · 2016-09-15T16:18:08Z

Hmm. So I popped these examples into play and built the resulting assembly (on nightly, with optimizations enabled). They came out pretty similar:

lunch-box. diff ~/tmp/print-{ok,ref}
1c1
< .text
---
>     .text
12a13
>     andb    $3, %sil

nikomatsakis · 2016-09-15T16:24:17Z

Same results for beta:

--- /home/nmatsakis/tmp/print-ok-beta   2016-09-15 12:22:36.604531321 -0400
+++ /home/nmatsakis/tmp/print-ref-beta  2016-09-15 12:23:04.984530867 -0400
@@ -10,6 +10,7 @@
     .cfi_def_cfa_offset 80
     cmpb    $1, %dil
     jne    .LBB0_1
+    andb    $3, %sil
     cmpb    $1, %sil
     je    .LBB0_7
     cmpb    $2, %sil

nikomatsakis · 2016-09-15T16:24:48Z

Is it possible we fixed this? I know we did some LLVM fixing.

cc @eddyb, I think you managed most of that.

nikomatsakis · 2016-09-15T16:26:39Z

That said, stable builds are identical. Though I find it hard to imagine that the andb is really responsible for a massive performance drop. (But perhaps that causes problems in some larger scale scenario?)

eddyb · 2016-09-15T16:29:30Z

It could easily be something else, identical binaries are no guarantee, so the reduction could be unrelated.
Using perf record on the original benchmark could pinpoint that an andb instruction is responsible.

EDIT: as for LLVM, nothing I've seen would suggest any sort of improvements with switch.

nikomatsakis · 2016-09-15T16:29:51Z

Also, stable vs nightly look quite different, even if changing between the two versions of the arms makes no little difference.

nikomatsakis · 2016-09-15T16:33:14Z

@pmarcelll how confident are you that this reduction is accurate? can you make a version of the original benchmark available?

pmarcelll · 2016-09-15T16:51:41Z

Here's the code: link.
This BF code is used for benchmarking.
EDIT: the slow version is in a comment

pmarcelll · 2016-09-15T16:58:11Z

The problematic part is in the hottest code path, so the slowdown is really noticable.

pmarcelll · 2016-09-15T17:08:14Z

Also, if the enum is not Copy or Clone, the ref version must be used, that compiles to the slightly slower version.

nrc · 2016-09-15T20:13:51Z

Discussed at compiler team meeting, but we didn't have a good answer for priority since investigation seems to be ongoing. P-high for now, feel free to downgrade if we get new info.

pnkfelix · 2016-09-20T16:21:01Z

Results of some investigation: I can reproduce a regression for this benchmark at the same point in the history, though its not as severe a regression.

Gist with the (slightly modified, self-contained) bf interpreter benchmark, and the transcript + summary of my runs: https://gist.github.com/pnkfelix/9e727884890b1ec867d58bd280cbbf82

Interestingly, other small perturbations also remove the regression. In particular, even just pulling the body of that match arm into a separate method and marking that method #[inline(always)] also removes the regression.

eddyb · 2016-09-20T16:39:11Z

If anyone has access to LLVM 3.8 that they can build Rust with, it would help to confirm that this is yet another LLVM 3.9 perf regression.
AFAICT, what #[inline(always)] ends up doing at the LLVM level is change the order and number of passes ran on the IR of the callee.

MagaTailor · 2016-09-20T17:48:00Z

@eddyb You're in luck today! I finished building the same aarch64 nightly a little earlier against 3.9 and 3.8, and can confirm the difference: 2m41s vs 2m15s respectively.

BTW, make install doesn't work any longer, probably due to --disable-docs and this line.

eddyb · 2016-09-20T17:56:00Z

@petevine Really useful, thanks! And so LLVM 3.9 delivers, yet another perf regression.

pnkfelix · 2016-09-21T14:35:47Z

As a followup note regarding the gist I posted: I have done a little digging into the generated assembly for the Tape::run code, comparing the manually inlined to the LLVM-inlined codes (since those seem to end up being the most similar in output while retaining the bulk of the regression), and the main difference between the code for the two versions of Tape::run seems quite similar to the one that @nikomatsakis noted above.

manually inlined (slow)              |LLVM-inlined (fast)
---------------------------------------------------------------------------
.LBB7_2:                             |.LBB7_2:
        cmpb    $1, (%r14)           |        cmpb    $1, (%r14)
        je      .LBB7_5              |        je      .LBB7_31
        movb    4(%r14), %al         |        movl    4(%r14), %eax
        andb    $31, %al             |        addl    $-45, %eax
        addb    $-13, %al            |        cmpl    $17, %eax
        movzbl  %al, %eax            |        ja      .LBB7_5
        cmpb    $17, %al             |        movslq  (%r15,%rax,4), %rax
        ja      .LBB7_8              |        addq    %r15, %rax
        movslq  (%r15,%rax,4), %rax  |        jmpq    *%rax
        addq    %r15, %rax           |.LBB7_7:
        jmpq    *%rax                | 
.LBB7_12:                            |

In particular: the slower version on the left has an extra andb instruction. (It also has that movzbl and is using the b suffixed instructions rather than l suffixed ones.)

Anyway, at this point I'm inclined to agree with @eddyb that this is probably an LLVM issue, not a Rust compiler one. We could investigate the order and number of passes, or we could try to file an LLVM bug, but either way, I think we should downgrade the priority of this ticket.

nikomatsakis · 2016-09-21T16:56:13Z

Inclined to agree.

triage: P-medium

pnkfelix · 2016-09-22T16:32:13Z

this is probably an LLVM issue, not a Rust compiler one. We could investigate the order and number of passes, or we could try to file an LLVM bug

I think the path forward is to file an LLVM bug.

Next steps:

Create isolated test case suitable for reporting to LLVM
File bug there, ask LLVM devs for assistance

@eddyb notes that majnemer is often willing to do bisect against LLVM code base on small self-contained test case.

nagisa · 2016-09-23T16:24:02Z

Filled https://llvm.org/bugs/show_bug.cgi?id=30505

Boscop · 2017-08-30T20:42:57Z

Any update on this? :)

pmarcelll · 2017-09-01T22:34:09Z

It wasn't fixed in LLVM 4.0, but it might be fixed in 5.0, I'll check when the support is merged. The ergonomics improvements in pattern matching might also improve on this.

Boscop · 2018-04-22T17:26:45Z

Is it fixed now? :)

pmarcelll · 2018-04-24T15:20:26Z

I tried it a couple of days ago with the BF interpreter I linked earlier and it was slow no matter what compiler flag I used. As I wrote in the first comment, the benchmark completed just under 22 seconds, now it ran for 30 seconds with only a --release flag, and I couldn't push it under 25 seconds with other compiler flags. I didn't have time to further investigate, but I reran the tests today and I noticed that the nightly compiler produced identical binaries for both the slow and the fast version, so I put the code into the Godbolt assembly analyzer. According to the analysis, rustc 1.23.0 was the last compiler that produced different binaries, after that, the fast version regressed, and now it's identical to the slow version.

Boscop · 2018-09-30T17:33:58Z

Any update on this?

pmarcelll · 2018-10-01T23:55:34Z

I tried it when rustc was updated to LLVM 8 (so fairly recently), and now with the latest nightly. A plain release build produces a binary that completes the benchmark just under 27 second on my machine, with LTO it's close to 26, but reducing the number of codegen units or aborting on panic negatively impacts performance, it confuses the optimizer in LLVM somehow (it's not really a bug, fine-tuning an optimizer is hard), and because this program is basically a loop that performs a small set of operations over and over, every missed optimization is magnified.

All in all, rustc became slightly better recently, so the plain release version is fast, but this bug still slows down the test program.

pmarcelll · 2018-12-28T16:41:15Z

I just checked it on nightly and it seems to be fixed (even on the latest beta). I don't know what changed, but I looked at the assembly on Godbolt and the previously slow part is now equivalent to the old version.

I even managed to get the lowest runtime so far, only 19.543 seconds, with
cargo rustc --release -- -Clto=fat -Ccodegen-units=1 -Ctarget-cpu=sandybridge -Cpanic=abort (for some reason, usually Rust binaries optimized for Sandy Bridge by LLVM run really fast on Haswell CPUs).

If this regression is unlikely to return and it was indeed fixed in LLVM then it can be closed.

pmarcelll · 2018-12-29T18:47:33Z

I tried to bisect it, my program became fast again right after #55835 was merged, so it looks like it was indeed fixed by an LLVM upgrade. @nagisa can you confirm it by checking the LLVM bug report? Thanks.

nikic · 2018-12-29T23:08:10Z

The LLVM issue has been fixed by llvm-mirror/llvm@238b886, which has tests and comments to prevent this from regressing without corresponding backend fixes. As such, I think we can consider this resolved...

pmarcelll · 2018-12-30T01:27:48Z

@nikic I was thinking about the bug report in #36283 (comment), it looks like that one can also be closed.

pmarcelll changed the title ~~Regression in pattern maching~~ Regression in pattern matching Sep 5, 2016

pnkfelix changed the title ~~Regression in pattern matching~~ Performance regression in pattern matching Sep 5, 2016

sfackler added regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Sep 5, 2016

brson added I-slow Issue: Problems and improvements with respect to performance of generated code. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-nominated labels Sep 13, 2016

nrc added P-high High priority and removed I-nominated labels Sep 15, 2016

nrc assigned nikomatsakis Sep 15, 2016

rust-highfive added P-medium Medium priority and removed P-high High priority labels Sep 21, 2016

pnkfelix added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Sep 29, 2016

nikomatsakis removed their assignment Oct 4, 2016

Mark-Simulacrum added C-bug Category: This is a bug. C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-bug Category: This is a bug. labels Jul 26, 2017

nagisa mentioned this issue Nov 6, 2017

Rust generates guard instructions around match statements #45792

Closed

nikic closed this as completed Dec 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in pattern matching #36283

Performance regression in pattern matching #36283

pmarcelll commented Sep 5, 2016 •

edited

Loading

Aatch commented Sep 5, 2016

brson commented Sep 13, 2016

nikomatsakis commented Sep 13, 2016

nikomatsakis commented Sep 15, 2016 •

edited

Loading

nikomatsakis commented Sep 15, 2016

nikomatsakis commented Sep 15, 2016

nikomatsakis commented Sep 15, 2016 •

edited

Loading

eddyb commented Sep 15, 2016 •

edited

Loading

nikomatsakis commented Sep 15, 2016 •

edited

Loading

nikomatsakis commented Sep 15, 2016

pmarcelll commented Sep 15, 2016 •

edited

Loading

pmarcelll commented Sep 15, 2016

pmarcelll commented Sep 15, 2016

nrc commented Sep 15, 2016

pnkfelix commented Sep 20, 2016

eddyb commented Sep 20, 2016

MagaTailor commented Sep 20, 2016 •

edited

Loading

eddyb commented Sep 20, 2016

pnkfelix commented Sep 21, 2016 •

edited

Loading

nikomatsakis commented Sep 21, 2016

pnkfelix commented Sep 22, 2016

nagisa commented Sep 23, 2016

Boscop commented Aug 30, 2017

pmarcelll commented Sep 1, 2017

Boscop commented Apr 22, 2018

pmarcelll commented Apr 24, 2018

Boscop commented Sep 30, 2018

pmarcelll commented Oct 1, 2018

pmarcelll commented Dec 28, 2018

pmarcelll commented Dec 29, 2018

nikic commented Dec 29, 2018 •

edited

Loading

pmarcelll commented Dec 30, 2018

Performance regression in pattern matching #36283

Performance regression in pattern matching #36283

Comments

pmarcelll commented Sep 5, 2016 • edited Loading

Aatch commented Sep 5, 2016

brson commented Sep 13, 2016

nikomatsakis commented Sep 13, 2016

nikomatsakis commented Sep 15, 2016 • edited Loading

nikomatsakis commented Sep 15, 2016

nikomatsakis commented Sep 15, 2016

nikomatsakis commented Sep 15, 2016 • edited Loading

eddyb commented Sep 15, 2016 • edited Loading

nikomatsakis commented Sep 15, 2016 • edited Loading

nikomatsakis commented Sep 15, 2016

pmarcelll commented Sep 15, 2016 • edited Loading

pmarcelll commented Sep 15, 2016

pmarcelll commented Sep 15, 2016

nrc commented Sep 15, 2016

pnkfelix commented Sep 20, 2016

eddyb commented Sep 20, 2016

MagaTailor commented Sep 20, 2016 • edited Loading

eddyb commented Sep 20, 2016

pnkfelix commented Sep 21, 2016 • edited Loading

nikomatsakis commented Sep 21, 2016

pnkfelix commented Sep 22, 2016

nagisa commented Sep 23, 2016

Boscop commented Aug 30, 2017

pmarcelll commented Sep 1, 2017

Boscop commented Apr 22, 2018

pmarcelll commented Apr 24, 2018

Boscop commented Sep 30, 2018

pmarcelll commented Oct 1, 2018

pmarcelll commented Dec 28, 2018

pmarcelll commented Dec 29, 2018

nikic commented Dec 29, 2018 • edited Loading

pmarcelll commented Dec 30, 2018

pmarcelll commented Sep 5, 2016 •

edited

Loading

nikomatsakis commented Sep 15, 2016 •

edited

Loading

nikomatsakis commented Sep 15, 2016 •

edited

Loading

eddyb commented Sep 15, 2016 •

edited

Loading

nikomatsakis commented Sep 15, 2016 •

edited

Loading

pmarcelll commented Sep 15, 2016 •

edited

Loading

MagaTailor commented Sep 20, 2016 •

edited

Loading

pnkfelix commented Sep 21, 2016 •

edited

Loading

nikic commented Dec 29, 2018 •

edited

Loading