Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ReorderGlobals ordering #6625

Merged
merged 84 commits into from
May 31, 2024
Merged

Optimize ReorderGlobals ordering #6625

merged 84 commits into from
May 31, 2024

Conversation

kripken
Copy link
Member

@kripken kripken commented May 22, 2024

The old ordering in that pass did a topological sort while sorting by uses
both within topological groups and between them. That could be unoptimal
in some cases, however, and actually on J2CL output this pass made the
binary larger, which is how we noticed this.

The problem is that such a toplogical sort keeps topological groups in
place, but it can be useful to interleave them sometimes. Imagine this:

     $c - $a
    /
  $e
    \
     $d - $b

Here $e depends on $c, etc. The optimal order may interleave the two
arms here, e.g. $a, $b, $d, $c, $e. That is because the dependencies define
a partial order, and so the arms here are actually independent.

Sorting by toplogical depth first might help in some cases, but also is not
optimal in general, as we may want to mix toplogical depths:
$a, $c, $b, $d, $e does so, and it may be the best ordering.

This PR implements a natural greedy algorithm that picks the global with
the highest use count at each step, out of the set of possible globals, which
is the set of globals that have no unresolved dependencies. So we start by
picking the first global with no dependencies and add at at the front; then
that unlocks anything that depended on it and we pick from that set, and
so forth.

This may also not be optimal, but it is easy to make it more flexible by
customizing the counts, and we consider 4 sorts here:

  1. Set all counts to 0. This means we only take into account dependencies,
    and we break ties by the original order, so this is as close to the original
    order as we can be.
  2. Use the actual use counts. This is the simple greedy algorithm.
  3. Set the count of each global to also contain the counts of its children,
    so the count is the total that might be unlocked. This gives more weight
    to globals that can unlock more later, so it is less greedy.
  4. As 3, but weight children's counts lower in an exponential way, which
    makes sense as they may depend on other globals too.

In practice it is simple to generate cases where 1, 2, or 3 is optimal (see
new tests), but on real-world J2CL I see that 4 (with a particular exponential
coefficient) is best, so the pass computes all 4 and picks the best. As a
result it will never worsen the size and it has a good chance of
improving.

The differences between these are small, so in theory we could pick any
of them, but given they are all modifications of a single algorithm it is
very easy to compute them all with little code complexity.

Some data on J2CL:

Method Size (less is better)
Greedy 4338066
Original order 4336535
Old (before this PR) 4336518
Sum 4336503
Exponential (0.5) 4336373
Exponential (PR's value) 4336127

There is a slight runtime cost to this: J2CL goes from 0.9666 to 1.1351
seconds. As this is one of our faster passes the slight slowdown seems
worth it in return for the guarantee to never increase size, and the small
improvement.

@kripken kripken requested a review from tlively May 22, 2024 20:05
@gkdn
Copy link
Contributor

gkdn commented May 22, 2024

Interesting problem! (cc @rluble)

The number differences here are small; makes me wonder if magic-import that pushes higher indices benefit more.

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a test for which the exponential sort was optimal?

// each one moves, which is logically a mapping between indices.
using IndexIndexMap = std::vector<Index>;

// We will also track counts of uses for each global.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth briefly explaining why we use a double here.

src/passes/ReorderGlobals.cpp Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Outdated Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Outdated Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Outdated Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Outdated Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Outdated Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Show resolved Hide resolved
src/passes/ReorderGlobals.cpp Outdated Show resolved Hide resolved
@kripken
Copy link
Member Author

kripken commented May 30, 2024

Was there a test for which the exponential sort was optimal?

Sadly no. It's very hard to make such a test, as it would need very many deep dependency chains (on which it pays to take them into account a little, but not too much).

(I would not include code for that in the pass, except that it ends up as just another constant value, and we do measure the sizes, so it seems safe.)

src/passes/ReorderGlobals.cpp Outdated Show resolved Hide resolved
@kripken kripken merged commit f8086ad into WebAssembly:main May 31, 2024
13 checks passed
@kripken kripken deleted the globses branch May 31, 2024 21:56
@gkdn gkdn mentioned this pull request Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants