-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending pc integral trick to non-uniform op costs #185
Comments
Thankfully, static analysis is always possible as there are no arbitrary code jumps in eBPF - jumps are always to an inlined offset. |
@jackcmay @Lichtso , for the reason that static analysis may be expensive in the malicious or even non-malicious scenario, I think the above approach may be a good way to perform "intermediate JIT" in the non-uniform opcode cost case. The static analysis could then only be applied to programs that live for a long time in the executor cache, as a one-time cost. Rather than eagerly JITing+statically analysing every program. So the compilation hierarchy:
|
Even if the static analysis is expensive, it would probably be possible to compile asynchronously in a background thread, and silently replace the JITed binary by acquiring a write lock to the cache. |
Unfortunately, we have
Yes, I don't see why not. In fact it is a lot simpler than you think and works with the current approach out-of-the-box. |
Here, I sketched a small python prototype to illustrate how it works: import random
def generate_randomized_trace(cost_limit, cost_of):
cumulative_cost = reduce(lambda c, x: c + [c[-1] + x], cost_of, [0])
program_length = len(cost_of)
interpreter_instruction_meter = 0
jit_instruction_meter = cost_limit
cost_limit_reached = False
ip_pc = 0
while True:
if ip_pc + 1 >= program_length:
print("Program reached the end")
return
interpreter_instruction_meter += cost_of[ip_pc]
print("ip/pc={} interpreter={}/{} jit={}/{}".format(ip_pc,
interpreter_instruction_meter, cost_limit,
cumulative_cost[ip_pc + 1], jit_instruction_meter,
))
interpreter_limit_reached = cost_limit <= interpreter_instruction_meter
jit_limit_reached = jit_instruction_meter <= cumulative_cost[ip_pc + 1]
if (interpreter_limit_reached or jit_limit_reached) and not cost_limit_reached:
cost_limit_reached = True
print("Cost limit reached by interpeter={} jit={}".format(
interpreter_limit_reached, jit_limit_reached))
if random.randint(0, 100) < 20:
target_ip_pc = random.randint(0, program_length - 1)
jit_instruction_meter -= cumulative_cost[ip_pc + 1]
jit_instruction_meter += cumulative_cost[target_ip_pc]
print("Jump to ip/pc={}".format(target_ip_pc))
if cost_limit_reached:
print("JIT aborted branch")
return
else:
ip_pc = target_ip_pc
else:
ip_pc += 1
generate_randomized_trace(64, [random.randint(1, 3) for _ in range(0, 100)]) Which, for example generates the following trace:
So the only difference to how it currently is, is that |
Yep, that integral trick is very elegant. Still makes me worried about long straight-line programs. |
We could emit phony branches if we detect long linear runs, but these are not particularly useful anyway. Edit: Just remembered that the last return / exit counts as a branch too, so that would check if the program did run too long. |
@Lichtso my worry is not the detection per-se but rather to catch such programs in the act so they don't run for say 100ms rather than say a target 10ms. There's actually a separate issue involved, which is that if the program throws an error or panics after it was supposed to have run out of budget, then it may return a different return type. I guess this wouldn't be a problem if every error or branch into an error also first checks the compute meter (perhaps already subsumed under the conditions for returning) |
Interesting idea, might not be covered so far, but would have to write a test case for that. Line 436 in 42ca251
|
There seems to be the likelihood that certain ops, especially memory ops (i.e. reg load, jmp which requires a random instruction seek), might need to be charged higher amount of compute units, for the reason that they can be several times slower than an alu op (100ns RAM seek is ~300x slower than an i32 add). At 200_000 compute units, at 100ns per RAM operation, this would result in a program time of at least 20ms, which I think is a lot higher than desired.
This is a good explanation for Serum's bad CU/us ratio, with a median of 7, which is consistent with 100ns per opcode. That's because Serum's critbit tree data structure does a lot of random reads from a 1MB+memory region, which maybe it lives in L3 cache or even loaded cold from RAM - so dominated by opcodes 30-300x slower than ALU opcode.
In the model we consider, the majority of opcodes have a cost of 1, and they dominate.
For every expensive opcode, during emission, we increase the compute_meter by the opcode cost minus 1. Opcode extra costs are stored in an array, so one can index directly into it during instruction emission via the u8 insn.opc. The load and add instruction is not emitted when the extra cost is 0 (obviously).
At the very least, the cost of adding a constant value to the compute meter during execution, being an ALU operation, is at least comparable to the cost of an expensive operation. For instance, a memory operation will probably always have to be at least 30x CUs of an ALU, so the cost of loading the compute meter, probably living in the L1 cache, and adding a constant to it will consume far fewer cycles by comparison.
Another option is if the bpf can be analysed and segmented into basic blocks, thus reducing the need to deal with arbitrary jumps. (Edit: I guess this is ongoing work with static analysis).
For every ordinary opcode, we do nothing as per the constant 1 integral.
The text was updated successfully, but these errors were encountered: