Skip to content

Commit

Permalink
Optimise Entity with repr align & manual PartialOrd/Ord (bevyen…
Browse files Browse the repository at this point in the history
…gine#10558)

# Objective

- Follow up on bevyengine#10519, diving
deeper into optimising `Entity` due to the `derive`d `PartialOrd`
`partial_cmp` not being optimal with codegen:
rust-lang/rust#106107
- Fixes bevyengine#2346.

## Solution

Given the previous PR's solution and the other existing LLVM codegen
bug, there seemed to be a potential further optimisation possible with
`Entity`. In exploring providing manual `PartialOrd` impl, it turned out
initially that the resulting codegen was not immediately better than the
derived version. However, once `Entity` was given `#[repr(align(8)]`,
the codegen improved remarkably, even more once the fields in `Entity`
were rearranged to correspond to a `u64` layout (Rust doesn't
automatically reorder fields correctly it seems). The field order and
`align(8)` additions also improved `to_bits` codegen to be a single
`mov` op. In turn, this led me to replace the previous
"non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits`
comparison.

The result was remarkably better codegen across the board, even for
hastable lookups.

The current baseline codegen is as follows:
https://godbolt.org/z/zTW1h8PnY

Assuming the following example struct that mirrors with the existing
`Entity` definition:

```rust
#[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)]
pub struct FakeU64 {
    high: u32,
    low: u32,
}
```

the output for `to_bits` is as follows:

```
example::FakeU64::to_bits:
        shl     rdi, 32
        mov     eax, esi
        or      rax, rdi
        ret
```

Changing the struct to:
```rust
#[derive(Clone, Copy, Eq)]
#[repr(align(8))]
pub struct FakeU64 {
    low: u32,
    high: u32,
}
```
and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`,
`to_bits` now optimises to:
```
example::FakeU64::to_bits:
        mov     rax, rdi
        ret
```
The full codegen example for this PR is here for reference:
https://godbolt.org/z/n4Mjx165a

To highlight, `gt` comparison goes from
```
example::greater_than:
        cmp     edi, edx
        jae     .LBB3_2
        xor     eax, eax
        ret
.LBB3_2:
        setne   dl
        cmp     esi, ecx
        seta    al
        or      al, dl
        ret
```
to
```
example::greater_than:
        cmp     rdi, rsi
        seta    al
        ret
```

As explained on Discord by @scottmcm :

>The root issue here, as far as I understand it, is that LLVM's
middle-end is inexplicably unwilling to merge loads if that would make
them under-aligned. It leaves that entirely up to its target-specific
back-end, and thus a bunch of the things that you'd expect it to do that
would fix this just don't happen.

## Benchmarks

Before discussing benchmarks, everything was tested on the following
specs:

AMD Ryzen 7950X 16C/32T CPU
64GB 5200 RAM
AMD RX7900XT 20GB Gfx card
Manjaro KDE on Wayland

I made use of the new entity hashing benchmarks to see how this PR would
improve things there. With the changes in place, I first did an
implementation keeping the existing "non shortcircuit" `PartialEq`
implementation in place, but with the alignment and field ordering
changes, which in the benchmark is the `ord_shortcircuit` column. The
`to_bits` `PartialEq` implementation is the `ord_to_bits` column. The
main_ord column is the current existing baseline from `main` branch.


![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261)

My machine is not super set-up for benchmarking, so some results are
within noise, but there's not just a clear improvement between the
non-shortcircuiting implementation, but even further optimisation taking
place with the `to_bits` implementation.

On my machine, a fair number of the stress tests were not showing any
difference (indicating other bottlenecks), but I was able to get a clear
difference with `many_foxes` with a fox count of 10,000:

Test with `cargo run --example many_foxes --features
bevy/trace_tracy,wayland --release -- --count 10000`:


![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3)

On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This
trace" represents the current PR's perf, while "External trace"
represents the `main` branch baseline.

## Changelog

Changed: micro-optimized Entity align and field ordering as well as
providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further.

## Migration Guide

Any `unsafe` code relying on field ordering of `Entity` or sufficiently
cursed shenanigans should change to reflect the different internal
representation and alignment requirements of `Entity`.

Co-authored-by: james7132 <[email protected]>
Co-authored-by: NathanW <[email protected]>
  • Loading branch information
3 people authored and Ray Redondo committed Jan 9, 2024
1 parent 7e6d30a commit 367ec8a
Showing 1 changed file with 48 additions and 2 deletions.
50 changes: 48 additions & 2 deletions crates/bevy_ecs/src/entity/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,17 @@ type IdCursor = isize;
/// [`EntityCommands`]: crate::system::EntityCommands
/// [`Query::get`]: crate::system::Query::get
/// [`World`]: crate::world::World
#[derive(Clone, Copy, Eq, Ord, PartialOrd)]
#[derive(Clone, Copy)]
// Alignment repr necessary to allow LLVM to better output
// optimised codegen for `to_bits`, `PartialEq` and `Ord`.
#[repr(C, align(8))]
pub struct Entity {
// Do not reorder the fields here. The ordering is explicitly used by repr(C)
// to make this struct equivalent to a u64.
#[cfg(target_endian = "little")]
index: u32,
generation: u32,
#[cfg(target_endian = "big")]
index: u32,
}

Expand All @@ -126,7 +134,42 @@ pub struct Entity {
impl PartialEq for Entity {
#[inline]
fn eq(&self, other: &Entity) -> bool {
(self.generation == other.generation) & (self.index == other.index)
// By using `to_bits`, the codegen can be optimised out even
// further potentially. Relies on the correct alignment/field
// order of `Entity`.
self.to_bits() == other.to_bits()
}
}

impl Eq for Entity {}

// The derive macro codegen output is not optimal and can't be optimised as well
// by the compiler. This impl resolves the issue of non-optimal codegen by relying
// on comparing against the bit representation of `Entity` instead of comparing
// the fields. The result is then LLVM is able to optimise the codegen for Entity
// far beyond what the derive macro can.
// See <https://github.com/rust-lang/rust/issues/106107>
impl PartialOrd for Entity {
#[inline]
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
// Make use of our `Ord` impl to ensure optimal codegen output
Some(self.cmp(other))
}
}

// The derive macro codegen output is not optimal and can't be optimised as well
// by the compiler. This impl resolves the issue of non-optimal codegen by relying
// on comparing against the bit representation of `Entity` instead of comparing
// the fields. The result is then LLVM is able to optimise the codegen for Entity
// far beyond what the derive macro can.
// See <https://github.com/rust-lang/rust/issues/106107>
impl Ord for Entity {
#[inline]
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
// This will result in better codegen for ordering comparisons, plus
// avoids pitfalls with regards to macro codegen relying on property
// position when we want to compare against the bit representation.
self.to_bits().cmp(&other.to_bits())
}
}

Expand Down Expand Up @@ -197,6 +240,7 @@ impl Entity {
/// In general, one should not try to synchronize the ECS by attempting to ensure that
/// `Entity` lines up between instances, but instead insert a secondary identifier as
/// a component.
#[inline]
pub const fn from_raw(index: u32) -> Entity {
Entity {
index,
Expand All @@ -210,13 +254,15 @@ impl Entity {
/// for serialization between runs.
///
/// No particular structure is guaranteed for the returned bits.
#[inline(always)]
pub const fn to_bits(self) -> u64 {
(self.generation as u64) << 32 | self.index as u64
}

/// Reconstruct an `Entity` previously destructured with [`Entity::to_bits`].
///
/// Only useful when applied to results from `to_bits` in the same instance of an application.
#[inline(always)]
pub const fn from_bits(bits: u64) -> Self {
Self {
generation: (bits >> 32) as u32,
Expand Down

0 comments on commit 367ec8a

Please sign in to comment.