Add a pass to remove temporary regs #8732

unknownbrackets · 2016-05-10T05:37:49Z

I didn't see much benefit from this, performance wise, unfortunately. It does trigger and reduce code size sometimes, I think it's still worthwhile.

The three most common types of things I saw this catch were:

ins temporaries, e.g. when inserting zero, so all the temporaries fold out. This prevents TEMP0 from even being stored in this case.
lw/sw/etc. temporaries, e.g. li a0, addr1; lw t0, 0(a0); li a0, addr2; lw t1 0(a0) etc.
Unused loads. Usually loads right next to each other, or sometimes a load right after a store, unused and clobbered later. Most might've been branching things, at least one definitely looked like a compiler bug / zero-optimizatons thing.

-[Unknown]

unknownbrackets · 2016-05-10T07:03:17Z

This also works, although maybe reducing load width could make other optimizations (like load combining) harder... in some ways we might even want to expand load widths when it's masked anyway.

bool ReduceLoads(const IRWriter &in, IRWriter &out) {
    for (u32 value : in.GetConstants()) {
        out.AddConstant(value);
    }

    // This tells us to skip an AND op that has been optimized out.
    // Maybe we could skip multiple, but that'd slow things down and is pretty uncommon.
    int nextSkip = -1;

    bool logBlocks = false;
    for (int i = 0, n = (int)in.GetInstructions().size(); i < n; i++) {
        IRInst inst = in.GetInstructions()[i];

        if (inst.op == IROp::Load32 || inst.op == IROp::Load16 || inst.op == IROp::Load16Ext) {
            int dest = IRDestGPR(inst);
            for (int j = i + 1; j < n; j++) {
                const IRInst &laterInst = in.GetInstructions()[j];
                const IRMeta *m = GetIRMeta(laterInst.op);

                if ((m->flags & IRFLAG_EXIT) != 0) {
                    // Exit, so we can't do the optimization.
                    break;
                }
                if (IRReadsFromGPR(laterInst, dest)) {
                    if (IRDestGPR(laterInst) == dest && laterInst.op == IROp::AndConst) {
                        const u32 mask = in.GetConstants()[laterInst.src2];
                        // Here we are, maybe we can reduce the load size based on the mask.
                        if ((mask & 0xffffff00) == 0) {
                            inst.op = IROp::Load8;
                            if (mask == 0xff) {
                                nextSkip = j;
                            }
                        } else if ((mask & 0xffff0000) == 0 && inst.op == IROp::Load32) {
                            inst.op = IROp::Load16;
                            if (mask == 0xffff) {
                                nextSkip = j;
                            }
                        }
                    }
                    // If it was read, we can't do the optimization.
                    break;
                }
                if (IRDestGPR(laterInst) == dest) {
                    // Someone else wrote, so we can't do the optimization.
                    break;
                }
            }
        }

        if (i != nextSkip) {
            out.Write(inst);
        }
    }

    return logBlocks;
}

Totally skipped ANDs aren't that common (happens a good number of times in Crisis Core, though.) However, reducing width was more common for sure.

-[Unknown]

hrydgard · 2016-05-10T07:44:05Z

Even if there's not a big immediate performance gain, passes like this will help generate better code in real compiler passes. When we can do things like this across blocks, I'm sure we can eliminate a ton of constant stores to regular registers as well (if a register is later clobbered on both possible exits from a branch, it does not need to be written).

I think most passes where you need to do an additional inner loop looking for things can be solved in O(n) by looping backwards and keeping track of some state instead.

As for the load-shrinking pass, that makes sense as well. I don't think it's likely to hinder many optimizations.

hrydgard · 2016-05-10T07:50:26Z

Other fun passes we should do (some of which I did in the old IR branch):

Bubble loads and stores upwards in blocks as far as possible.
Sort sequences of loads that access similar offsets through the same registers (stack flushes, etc) to make things easier for later codegen
Bubble slt-family instructions down towards branches, looking to merge in the future. A problem with that is the output register of the slt that still needs to be written, but we may be able to elide those in the future when optimizing across blocks.

This way, if constants have made the temp obsolete (common with ins, for example), it won't even get set anymore.

unknownbrackets · 2016-05-13T01:37:42Z

Rebased this, but I didn't clean up the ReduceLoads thing.

-[Unknown]

Without this, Gods Eater Burst is horribly broken.

Without this, Gods Eater Burst crashes before going in game.

unknownbrackets · 2016-05-13T05:54:39Z

Added a few more bugfixes.

-[Unknown]

unknownbrackets · 2016-05-17T08:01:37Z

So far, here's some bubbling to the top:

bool ReorderLoadStore(const IRWriter &in, IRWriter &out) {
    bool logBlocks = false;

    enum class RegState : u8 {
        UNUSED = 0,
        READ = 1,
        CHANGED = 2,
    };

    bool queuing = false;
    std::vector<IRInst> loadStoreQueue;
    std::vector<IRInst> otherQueue;
    RegState otherRegs[256] = {};

    auto flushQueue = [&]() {
        if (!queuing) {
            return;
        }

        // TODO: Time to reorder loadStoreQueue.

        queuing = false;
        for (IRInst queued : loadStoreQueue) {
            out.Write(queued);
        }
        for (IRInst queued : otherQueue) {
            out.Write(queued);
        }
        loadStoreQueue.clear();
        otherQueue.clear();
        memset(otherRegs, 0, sizeof(otherRegs));
    };

    for (int i = 0; i < (int)in.GetInstructions().size(); i++) {
        IRInst inst = in.GetInstructions()[i];
        switch (inst.op) {
        case IROp::Load32:
            // To move a load up, its dest can't be changed by things we move down.
            if (otherRegs[inst.dest] != RegState::UNUSED || otherRegs[inst.src1] == RegState::CHANGED) {
                flushQueue();
            }

            queuing = true;
            loadStoreQueue.push_back(inst);
            break;

        case IROp::Store32:
            // A store can move above even if it's read, as long as it's not changed by the other ops.
            if (otherRegs[inst.src3] == RegState::CHANGED || otherRegs[inst.src1] == RegState::CHANGED) {
                flushQueue();
            }

            queuing = true;
            loadStoreQueue.push_back(inst);
            break;

        case IROp::LoadFloat:
        case IROp::StoreFloat:
            // Floats can always move as long as their address is safe.
            if (otherRegs[inst.src1] == RegState::CHANGED) {
                flushQueue();
            }

            queuing = true;
            loadStoreQueue.push_back(inst);
            break;

        case IROp::Sub:
        case IROp::Slt:
        case IROp::SltU:
        case IROp::Add:
        case IROp::And:
        case IROp::Or:
        case IROp::Xor:
        case IROp::Shl:
        case IROp::Shr:
        case IROp::Ror:
        case IROp::Sar:
            // We'll try to move this downward.
            otherRegs[inst.dest] = RegState::CHANGED;
            if (inst.src1 && otherRegs[inst.src1] != RegState::CHANGED)
                otherRegs[inst.src1] = RegState::READ;
            if (inst.src2 && otherRegs[inst.src2] != RegState::CHANGED)
                otherRegs[inst.src2] = RegState::READ;
            otherQueue.push_back(inst);
            queuing = true;
            break;

        case IROp::Neg:
        case IROp::Not:
        case IROp::BSwap16:
        case IROp::BSwap32:
        case IROp::Ext8to32:
        case IROp::Ext16to32:
        case IROp::ReverseBits:
        case IROp::Clz:
        case IROp::AddConst:
        case IROp::SubConst:
        case IROp::AndConst:
        case IROp::OrConst:
        case IROp::XorConst:
        case IROp::SltConst:
        case IROp::SltUConst:
        case IROp::ShlImm:
        case IROp::ShrImm:
        case IROp::RorImm:
        case IROp::SarImm:
        case IROp::Mov:
            // We'll try to move this downward.
            otherRegs[inst.dest] = RegState::CHANGED;
            if (inst.src1 && otherRegs[inst.src1] != RegState::CHANGED)
                otherRegs[inst.src1] = RegState::READ;
            otherQueue.push_back(inst);
            queuing = true;
            break;

        case IROp::Downcount:
            if (queuing) {
                // This one's a freebie.  Sometimes helps with delay slots.
                otherQueue.push_back(inst);
            } else {
                out.Write(inst);
            }
            break;

        default:
            flushQueue();
            out.Write(inst);
            break;
        }
    }

    // Can reuse the old constants array - not touching constants in this pass.
    for (u32 value : in.GetConstants()) {
        out.AddConstant(value);
    }
    return logBlocks;
}

Sorting then just means matching same src1 and only reordering matching types.

-[Unknown]

hrydgard · 2016-05-17T08:43:50Z

That's a really nice way to do it!

unknownbrackets added 4 commits May 12, 2016 18:30

jit-ir: Optimize out unused temp regs.

99468c6

This way, if constants have made the temp obsolete (common with ins, for example), it won't even get set anymore.

jit-ir: Expand unused regs to regular GPRs.

d06c6c0

jit-ir: Prevent reading ahead for each reg write.

9f183c9

jit-ir: ExitToReg doesn't write to registers.

29ed8d2

unknownbrackets force-pushed the ir-jit branch from b38f2d3 to 29ed8d2 Compare May 13, 2016 01:37

unknownbrackets added 3 commits May 12, 2016 20:56

jit-ir: Add some missing CONDITIONAL_DISABLEs.

a8126ca

jit-ir: Avoid flushing on a few Vec4 ops.

57b3dbf

jit-ir: Fix non-SSE Vec4Scale.

1ddb2fb

unknownbrackets force-pushed the ir-jit branch from a614f07 to 1ddb2fb Compare May 13, 2016 04:03

unknownbrackets added 3 commits May 12, 2016 21:29

jir-ir: Correct vftm SIMD regs.

c11c046

jit-ir: Apply prefixes for vector init ops.

f521203

Without this, Gods Eater Burst is horribly broken.

jit-ir: Improve vidt to handle more cases.

9e3572d

unknownbrackets force-pushed the ir-jit branch from 5bb0e6a to 9e3572d Compare May 13, 2016 05:40

jit-ir: Add nan/inf compares.

7b43e0e

Without this, Gods Eater Burst crashes before going in game.

hrydgard merged commit 49b1339 into hrydgard:ir-jit May 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a pass to remove temporary regs #8732

Add a pass to remove temporary regs #8732

unknownbrackets commented May 10, 2016

unknownbrackets commented May 10, 2016

hrydgard commented May 10, 2016

hrydgard commented May 10, 2016

unknownbrackets commented May 13, 2016

unknownbrackets commented May 13, 2016

unknownbrackets commented May 17, 2016 •

edited

Loading

hrydgard commented May 17, 2016

Add a pass to remove temporary regs #8732

Add a pass to remove temporary regs #8732

Conversation

unknownbrackets commented May 10, 2016

unknownbrackets commented May 10, 2016

hrydgard commented May 10, 2016

hrydgard commented May 10, 2016

unknownbrackets commented May 13, 2016

unknownbrackets commented May 13, 2016

unknownbrackets commented May 17, 2016 • edited Loading

hrydgard commented May 17, 2016

unknownbrackets commented May 17, 2016 •

edited

Loading