Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Opcode fusion of branch and comparison instructions #712

Closed
Robbepop opened this issue Mar 30, 2023 · 2 comments · Fixed by #789
Closed

Optimization: Opcode fusion of branch and comparison instructions #712

Robbepop opened this issue Mar 30, 2023 · 2 comments · Fixed by #789
Assignees
Labels
optimization An performance optimization issue. register-machine A work item for the register-machine engine.

Comments

@Robbepop
Copy link
Member

Robbepop commented Mar 30, 2023

Blocked by: #729

In the current wasmi bytecode comparison instructions such as I32LtS, I64GtU etc. and conditional branch instructions such as BrIfNez and BrIfEqz could be fused since most often they are used together. Executing fewer instructions usually results in improved performance for interpreters like wasmi.

For example the following Wasm bytecode:

(br_if
    0 ;; some branch target
    (i32.lt_s
        (local.get 0)
        (local.get 1)
    )
)

Which represents something similar to:

if get_local(0) as i32 < get_local(1) as i32 {
    break; // goto branch target
}

Could be reduced to the following new wasmi instruction:

(br.i32.lt_s
    (local.get 0)
    (local.get 1)
)

Of course this means that we need to add plenty of new wasmi bytecode instructions which is a big drawback to this potential optimizations. Due to the increased number of instructions it is also unclear whether this optimization actually improves performance. We are not able to replace old unfused wasmi instructions by the new wasmi instructions since they might still occur in isolation.

Bytecode Changes

Compare + Branch

New wasmi bytecode branch instructions include:

  • br.{i32, i64, f32, f64}.{eq, ne}: 8 new instructions.
  • br.{i32, i64}.{lt, le, gt, ge}{_s, _u}: 16 new instructions.
  • br.{f32, f64}.{lt, le, gt, ge}: 8 new instructions.

In total: 32 new branch+compare instructions. Furthermore for immediate versions we are going to need yet another 32 branch+compare with immediate instructions. Totalling 64 instructions.

Note: We only need immediate versions for the second operand since we can rewrite bytecode where the first operand is an immediate value always as follows: br.i32.lt_s 42 reg0 -> br.i32.gt_s reg0 42.

Note: Wasms br_eqz can be represented by br.{i32, i64}.eq_imm reg 0. Where the _imm suffix represents a bytecode instruction that takes an immediate value as its second operator.

Compare + Return

Beyond branch instructions we could and probably should also introduce new return instruction variants for all the comparators since a branch to the outermost level is simply a return instruction and there already exist conditional return instructions in return.nez and return.eqz in wasmi which act as living examples.

  • return.{i32, i64, f32, f64}.{eq, ne}: 8 new instructions.
  • return.{i32, i64}.{lt, le, gt, ge}{_s, _u}: 16 new instructions.
  • return.{f32, f64}.{lt, le, gt, ge}: 8 new instructions.

Which is yet another 32 new instructions.

Blocked By

This optimization is blocked by the implementation of the register machine bytecode.

Conclusion

The biggest advantage of this potential optimization is that those fused instruction sequences usually occur within loops where they would have the biggest positive impact on the performance of the executed Wasm program.

@Robbepop Robbepop changed the title Optimization: Opcode fusion of brach and comparison instructions Optimization: Opcode fusion of branch and comparison instructions Mar 30, 2023
@Robbepop Robbepop added the optimization An performance optimization issue. label Mar 30, 2023
@Robbepop
Copy link
Member Author

Robbepop commented Jul 6, 2023

Today I thought a bit about this issue and since this is "just" an optimization based on the Instruction type of the wasmi register machine implementation I came up with the following idea:

What if we just provide the Instruction variants that are actually going to be very efficient at runtime?
This will leave us with the following variants:

enum Instruction {
    ...
    BrI32Eq { .. },
    BrI32EqImm { .. },
    BrI64Eq { .. },
    BrI64EqImm { .. },
    BrF32Eq { .. },
    BrF32EqImm { .. },
    BrF64Eq { .. },
    BrF64EqImm { .. },

    BrI32Ne { .. },
    BrI32NeImm { .. },
    BrI64Ne { .. },
    BrI64NeImm { .. },
    BrF32Ne { .. },
    BrF32NeImm { .. },
    BrF64Ne { .. },
    BrF64NeImm { .. },

    BrI32LtS { .. },
    BrI32LtSImm { .. },
    BrI32GtS { .. },
    BrI32GtSImm { .. },
    BrI32LeS { .. },
    BrI32LeSImm { .. },
    BrI32GeS { .. },
    BrI32GeSImm { .. },

    BrI64LtS { .. },
    BrI64LtSImm { .. },
    BrI64GtS { .. },
    BrI64GtSImm { .. },
    BrI64LeS { .. },
    BrI64LeSImm { .. },
    BrI64GeS { .. },
    BrI64GeSImm { .. },

    BrF32Lt { .. },
    BrF32Gt { .. },
    BrF32Le { .. },
    BrF32Ge { .. },

    BrF64Lt { .. },
    BrF64Gt { .. },
    BrF64Le { .. },
    BrF64Ge { .. },
    ...
}

With this internal structure as represented exemplary by BrI32Eq and BrI32EqImm:

BrI32Eq {
    lhs: Register,
    rhs: Register,
    offset: BranchOffset16,
},
BrI32EqImm {
    lhs: Register,
    rhs: Const16,
    offset: BranchOffset16,
},

This implies that we only apply this optimization if BranchOffset (which usually is encoded as 32-bit value) can be encoded as 16-bit value AND in case of instructions with an immutable rhs if the rhs value can also be encoded as 16-bit value.

For example this holds true for the common cases of Wasm's i32.eqz or i64.eqz which are both translated to

I{32,64}EqImm16 { result: Register, lhs: Register, imm: Const16::from(0) }

And if followed by a br_if could be optimized to

BrI{32,64}EqImm { lhs: Register, imm: Const16::from(0) }

For most common and practical use cases I assume that a BranchOffset16 will nearly always be encodable so the only real limitation are the 16-bit encoded immutable values.

Since return instructions are more complex than br instructions in the linked PR implementation I propose NOT to optimize conditional return instructions this way.

@Robbepop Robbepop added the register-machine A work item for the register-machine engine. label Sep 22, 2023
@Robbepop Robbepop moved this from Draft to Open in Parity Roadmap Sep 22, 2023
@Robbepop Robbepop self-assigned this Sep 22, 2023
@Robbepop
Copy link
Member Author

Robbepop commented Nov 18, 2023

The register-machine wasmi engine backend has been implemented. It is not yet finalized and likely still contains bugs but I was just running some coremark.wasm benchmarks while counting the executions per wasmi instruction and this is the result:
https://gist.github.com/Robbepop/16ac42197e1c745dd5ad3ab41524ae2d

We can clearly see that conditional branches (branch_nez and branch_eqz) are the most common instruction executed. This might indicate that this optimization might yield valuable results.

The wasmi branch that implements instruction execution counting can be found here:
https://github.com/paritytech/wasmi/tree/rf-count-instruction-executions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization An performance optimization issue. register-machine A work item for the register-machine engine.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant