-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Register Relay VM design #2915
Comments
I proposed an alternative VM design using registers instead of stack. We can discuss and compare which one works better. cc @jroesch @tqchen @wweic @zhiics @MarisaKirisame @junrushao1994 @abergeron @ajtulloch @antinucleon @Ravenwater |
In the long tradition of competitive engineering, let's develop both VMs concurrently. That way we will generate actual engineering metrics of the difference between a stack-based and a register-based VM. The engineering dependencies are too complex to productively discuss without actually going through the implementation. The arguments that @jroesch presented, i.e. the stack VM simplifies the instruction management, and the fact that you would need to unroll to discover how operators interact was considered acceptable given the fact that most concurrency is expected to be in the operators themselves. What the stack vs register organization does to the complexity of resulting compiler analysis code in the context of tensor compositions I think is an open question that can only be answered by doing both implementations. The beauty of that approach is that it will generate a level set of these questions and will offer many an opportunity to publish the results and go on to lucrative commercial engagements as one of the few people in the world that actually has the engineering know-how about the technical details. |
Thanks for the proposal. I think we can have more discussions on this topic. To make progress on discussions, let us try to divide our argument by points, so we can agree on some of them and see the major disagreements. For example, instead of saying: register vm allows data flow discovery, but dataflow discovery don't have to be done at the VM level. We break it into two points:
In our case, most people can agree on (1) but not necessarily (2). This way, we can find what we actually agree on more and move on from that point. I do think we should aim to have only one version of VM after we agree on things. Mainly because it is a better way to have the community pushing for a consistent technical direction. |
Some brain dumps from my side. Design points to consider
As mentioned above, one possible tradeoff is to simply use the stack as registers, so the register number indicates an offset to the begin ptr of the current function. This could unify quite a lot of things. Cost of Register VMThe main argument against a register vm is the implementation complexity given that stack vm is a "fine and concise impl", the question really boils down to do we need these additional features and cost that comes with them, so far the costs are
|
I explored @icemelon9 's register vm design in my branch: https://github.com/wweic/tvm/commits/relay-rts. Would like to share some data points. We need to add special registers in VM for function arguments and return value. Return register is necessary because when callee function returns, its register file has gone. So return value must be persisted outside of register file. Opcode generation in register VM is easier, compiler doesn't need to maintain stack offsets all along, just memorize variable to register mapping. Then we need register allocation, I tried liveness analysis + linear scan. It's pretty straightforward since opcode is relatively simple. I'm not sure I understand @tqchen 's concern about handing control flow in register allocation. Note that Haichen's register VM uses local register file per function. So we are doing register allocation per function, which shouldn't need to worry about interprocedural register allocations. As for branches inside a function, In register VM we need to allocate a register file( Overall register VM is slightly easier to work with, debugging is painless. But I'm not sure if this is convincing enough to favor register VM. Also register VM's opcode size is larger, it might be a concern for embedded devices. The major selling point of register VM is easy to discover dataflow. But it begs the questions that is register VM better at leveraging the dataflow information? I think it requires more thought. Fundamentally both VMs have call stack and will have similar problems when integrating with async execution engines. |
Summary@tqchen @icemelon9 @jroesch @zhiics @yongwww we discuss in person. Reached the following consensus:
Let me know if I miss anything or said something wrong. I'll take out liveness analysis on opcodes from my branch and polish remaining stuff(register VM + linear scan + interfaces). Since we don't have liveness analysis on Relay AST now, I'll simply generate live interval for each register with the full opcodes range, so register allocator can assign unique slot for each register. |
I want to highlight that due to public archive principle. The summary of in person discussion only serve as summary information and suggestions instead of the final design decision. The design decision should be make in this thread, allowing everyone to participate. So at this moment, the discussion is still open as RFC |
I propose two changes to instructions.
AllocTensor
Allocate a tensor object with shape stored in register $1 and save to register $2. Select
Select VMObject either in $2 or $3 based on condition register $1, and store in $4. |
Seems we have reached consensus about VM design, close for now |
Register VM Design
Current Relay VM RFC (#2810) proposes stack-based VM design which uses push and pop to maintain a unified stack. Though the design itself is simple to implement, it is cumbersome in tasks such as dataflow analysis and enforces certain orders in the execution.
I propose a register-based VM design as an alternative design. The main difference is that register-based VM uses registers to designate the operands and results instead of using the stack. Registers in the VM are virtual registers and each is a reference to a
VMObject
. We assume there are infinite registers so runtime doesn't need to worry about register spilling. Registers are in SSA form where its index is unique in each VMFunction.Calling convention in register VM is also simple. Each function has its own local register file. The first
k
registers in the register file of callee function are arguments passed in by caller function, and the VMObject in return register is allocated by callee function instead of caller function to avoid unnecessary memory copy. Caller function then assigns the reference to VMObject pointed by return register into its own register (see detail inInvoke
instruction).I summarize some pros and cons for register VM compared to stack VM.
Pros:
Phi
instruction to select the result from either branch.Cons:
Instructions
We modify the current stack VM instructions to use registers as operands and results.
Registers are designated by
$k
wherek
is the register index.*reg
indicates a list of registers.AllocTensor
Allocate a tensor object and stores to register
$1
.AllocDatatype
Allocate a datatype object and stores to register
$1
.*regs
is a list of registers containing fields in the datatype.AllocClosure
Allocate a closure object, where there are
num_freevars
in*regs
.LoadConst
Load the constant at
const_index
from the constant pool.Mov
Create a reference to VMObject in register
$1
and stores it in$2
.Note: No data copy happens in this instruction. The real data copy should be a PackedFunction.
Phi
Takes the
VMObject
either in$1
or$2
and stores in$3
.Note: This instruction requires VMObject in register
$1
and$2
having the same type, and only one of them should be valid during the runtime.Ret
Returns the register
$1
.GetField
Get the field at
field_index
in$1
.If
Check if
$1
is true or false. If true relative jump by true_branch, else relative jump by false_branch.Goto
Relative unconditional jump by
pc_offset
.InvokePacked
Invoke the packed function denoted by
packed_index
Note: Number of registers in
*regs
should bearity + output_size
where firstarity
registers are arguments and rest are output registers.Invoke
Invoke VM function at
func_index
Note: Number of registers in
*regs
should bearity
. Register$1
will be a reference to the VMObject in the return register.InvokeClosure
Invoke closure function in register
$2
and save the result into register$2
.Note: Register
$2
must be aVMClosureObject
.Stack and State
In order to convert to register-based VM, we also need to adjust the data structure in the VM. We assume there are infinite registers available, and each function has its own register file. Each time an
Invoke
instruction is executed, the runtime creates a newVMFrame
. We pre-allocate max number of registers used in the callee function (this number can be derived duringVMCompile
) and assigns the arguments to the firstargs
slots in the registers.The text was updated successfully, but these errors were encountered: