Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a pass to remove temporary regs #8732

Merged
merged 11 commits into from
May 13, 2016
Merged

Conversation

unknownbrackets
Copy link
Collaborator

I didn't see much benefit from this, performance wise, unfortunately. It does trigger and reduce code size sometimes, I think it's still worthwhile.

The three most common types of things I saw this catch were:

  1. ins temporaries, e.g. when inserting zero, so all the temporaries fold out. This prevents TEMP0 from even being stored in this case.
  2. lw/sw/etc. temporaries, e.g. li a0, addr1; lw t0, 0(a0); li a0, addr2; lw t1 0(a0) etc.
  3. Unused loads. Usually loads right next to each other, or sometimes a load right after a store, unused and clobbered later. Most might've been branching things, at least one definitely looked like a compiler bug / zero-optimizatons thing.

-[Unknown]

@unknownbrackets
Copy link
Collaborator Author

This also works, although maybe reducing load width could make other optimizations (like load combining) harder... in some ways we might even want to expand load widths when it's masked anyway.

bool ReduceLoads(const IRWriter &in, IRWriter &out) {
    for (u32 value : in.GetConstants()) {
        out.AddConstant(value);
    }

    // This tells us to skip an AND op that has been optimized out.
    // Maybe we could skip multiple, but that'd slow things down and is pretty uncommon.
    int nextSkip = -1;

    bool logBlocks = false;
    for (int i = 0, n = (int)in.GetInstructions().size(); i < n; i++) {
        IRInst inst = in.GetInstructions()[i];

        if (inst.op == IROp::Load32 || inst.op == IROp::Load16 || inst.op == IROp::Load16Ext) {
            int dest = IRDestGPR(inst);
            for (int j = i + 1; j < n; j++) {
                const IRInst &laterInst = in.GetInstructions()[j];
                const IRMeta *m = GetIRMeta(laterInst.op);

                if ((m->flags & IRFLAG_EXIT) != 0) {
                    // Exit, so we can't do the optimization.
                    break;
                }
                if (IRReadsFromGPR(laterInst, dest)) {
                    if (IRDestGPR(laterInst) == dest && laterInst.op == IROp::AndConst) {
                        const u32 mask = in.GetConstants()[laterInst.src2];
                        // Here we are, maybe we can reduce the load size based on the mask.
                        if ((mask & 0xffffff00) == 0) {
                            inst.op = IROp::Load8;
                            if (mask == 0xff) {
                                nextSkip = j;
                            }
                        } else if ((mask & 0xffff0000) == 0 && inst.op == IROp::Load32) {
                            inst.op = IROp::Load16;
                            if (mask == 0xffff) {
                                nextSkip = j;
                            }
                        }
                    }
                    // If it was read, we can't do the optimization.
                    break;
                }
                if (IRDestGPR(laterInst) == dest) {
                    // Someone else wrote, so we can't do the optimization.
                    break;
                }
            }
        }

        if (i != nextSkip) {
            out.Write(inst);
        }
    }

    return logBlocks;
}

Totally skipped ANDs aren't that common (happens a good number of times in Crisis Core, though.) However, reducing width was more common for sure.

-[Unknown]

@hrydgard
Copy link
Owner

Even if there's not a big immediate performance gain, passes like this will help generate better code in real compiler passes. When we can do things like this across blocks, I'm sure we can eliminate a ton of constant stores to regular registers as well (if a register is later clobbered on both possible exits from a branch, it does not need to be written).

I think most passes where you need to do an additional inner loop looking for things can be solved in O(n) by looping backwards and keeping track of some state instead.

As for the load-shrinking pass, that makes sense as well. I don't think it's likely to hinder many optimizations.

@hrydgard
Copy link
Owner

Other fun passes we should do (some of which I did in the old IR branch):

  • Bubble loads and stores upwards in blocks as far as possible.
  • Sort sequences of loads that access similar offsets through the same registers (stack flushes, etc) to make things easier for later codegen
  • Bubble slt-family instructions down towards branches, looking to merge in the future. A problem with that is the output register of the slt that still needs to be written, but we may be able to elide those in the future when optimizing across blocks.

@unknownbrackets
Copy link
Collaborator Author

Rebased this, but I didn't clean up the ReduceLoads thing.

-[Unknown]

Without this, Gods Eater Burst crashes before going in game.
@unknownbrackets
Copy link
Collaborator Author

Added a few more bugfixes.

-[Unknown]

@hrydgard hrydgard merged commit 49b1339 into hrydgard:ir-jit May 13, 2016
@unknownbrackets
Copy link
Collaborator Author

unknownbrackets commented May 17, 2016

So far, here's some bubbling to the top:

bool ReorderLoadStore(const IRWriter &in, IRWriter &out) {
    bool logBlocks = false;

    enum class RegState : u8 {
        UNUSED = 0,
        READ = 1,
        CHANGED = 2,
    };

    bool queuing = false;
    std::vector<IRInst> loadStoreQueue;
    std::vector<IRInst> otherQueue;
    RegState otherRegs[256] = {};

    auto flushQueue = [&]() {
        if (!queuing) {
            return;
        }

        // TODO: Time to reorder loadStoreQueue.

        queuing = false;
        for (IRInst queued : loadStoreQueue) {
            out.Write(queued);
        }
        for (IRInst queued : otherQueue) {
            out.Write(queued);
        }
        loadStoreQueue.clear();
        otherQueue.clear();
        memset(otherRegs, 0, sizeof(otherRegs));
    };

    for (int i = 0; i < (int)in.GetInstructions().size(); i++) {
        IRInst inst = in.GetInstructions()[i];
        switch (inst.op) {
        case IROp::Load32:
            // To move a load up, its dest can't be changed by things we move down.
            if (otherRegs[inst.dest] != RegState::UNUSED || otherRegs[inst.src1] == RegState::CHANGED) {
                flushQueue();
            }

            queuing = true;
            loadStoreQueue.push_back(inst);
            break;

        case IROp::Store32:
            // A store can move above even if it's read, as long as it's not changed by the other ops.
            if (otherRegs[inst.src3] == RegState::CHANGED || otherRegs[inst.src1] == RegState::CHANGED) {
                flushQueue();
            }

            queuing = true;
            loadStoreQueue.push_back(inst);
            break;

        case IROp::LoadFloat:
        case IROp::StoreFloat:
            // Floats can always move as long as their address is safe.
            if (otherRegs[inst.src1] == RegState::CHANGED) {
                flushQueue();
            }

            queuing = true;
            loadStoreQueue.push_back(inst);
            break;

        case IROp::Sub:
        case IROp::Slt:
        case IROp::SltU:
        case IROp::Add:
        case IROp::And:
        case IROp::Or:
        case IROp::Xor:
        case IROp::Shl:
        case IROp::Shr:
        case IROp::Ror:
        case IROp::Sar:
            // We'll try to move this downward.
            otherRegs[inst.dest] = RegState::CHANGED;
            if (inst.src1 && otherRegs[inst.src1] != RegState::CHANGED)
                otherRegs[inst.src1] = RegState::READ;
            if (inst.src2 && otherRegs[inst.src2] != RegState::CHANGED)
                otherRegs[inst.src2] = RegState::READ;
            otherQueue.push_back(inst);
            queuing = true;
            break;

        case IROp::Neg:
        case IROp::Not:
        case IROp::BSwap16:
        case IROp::BSwap32:
        case IROp::Ext8to32:
        case IROp::Ext16to32:
        case IROp::ReverseBits:
        case IROp::Clz:
        case IROp::AddConst:
        case IROp::SubConst:
        case IROp::AndConst:
        case IROp::OrConst:
        case IROp::XorConst:
        case IROp::SltConst:
        case IROp::SltUConst:
        case IROp::ShlImm:
        case IROp::ShrImm:
        case IROp::RorImm:
        case IROp::SarImm:
        case IROp::Mov:
            // We'll try to move this downward.
            otherRegs[inst.dest] = RegState::CHANGED;
            if (inst.src1 && otherRegs[inst.src1] != RegState::CHANGED)
                otherRegs[inst.src1] = RegState::READ;
            otherQueue.push_back(inst);
            queuing = true;
            break;

        case IROp::Downcount:
            if (queuing) {
                // This one's a freebie.  Sometimes helps with delay slots.
                otherQueue.push_back(inst);
            } else {
                out.Write(inst);
            }
            break;

        default:
            flushQueue();
            out.Write(inst);
            break;
        }
    }

    // Can reuse the old constants array - not touching constants in this pass.
    for (u32 value : in.GetConstants()) {
        out.AddConstant(value);
    }
    return logBlocks;
}

Sorting then just means matching same src1 and only reordering matching types.

-[Unknown]

@hrydgard
Copy link
Owner

That's a really nice way to do it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants