-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster stack clearing #5
Comments
Interrupts do make things a little trickier. Before I was thinking that we will simply stall the instruction in EX stage if it accesses a stack area not yet zeroed. With interrupt, we probably have to abort the instruction and restart it later. That's a bit different from the current interrupt semantics (wait for current instruction to finish and then take interrupt). Anyhow it should still be doable but needs to spend a little more time to think through it carefully. Also - can we rely on firmware (switcher) to remember the stop pointer? Doing it in hardware might be problematic especially if there is nested interrupts (do we actually allow that?). If |
As long as we can read the state, the switcher can preserve it across context switches. We currently do stack zeroing with interrupts disables, so stalling the store until the state machine has caught up would be no worse than the current behaviour (though not idea, it fine for an initial version). |
I think, on context switch, we'd need to save and restore two words of state: the top and the bottom of the range being zeroed. It's possible that CSP has been modified between the start and end and so it would be nice if we could capture this as a capability and a stop address. We currently have two CSRs for the stack high-water mark, the base (CSR_MSHWMB) and the current watermark (CSR_MSHWM). We modify the base only on context switch and we modify the top on call and return. For asynchronous stack zeroing, we additionally need the following state:
I would propose the following interface: Writing an address to a new CSR (protected with ASR permission) starts zeroing. This takes the Zcap from $csp, Zbase from the current CSR_MSHWM and Ztop from the new CSR_MSHWM value and updates CSR_MSHWM with the written value. If the value written here is not in the bounds of CSP, this traps. When an interrupt fires, the zeroing stops (can be after an in-flight store has retired if necessary) and must be restarted. For context switch, Zcap, Ztop, and Zbase are exposed as CSRs. We can define an order so that Ztop and Zbase must be written before Zcap and Zcap then triggers the zeroing to resume. This would let us store 16 bytes of extra state in the thread structure and have simple control flow in the switcher for all paths. |
The interface sounds good. Still thinking about the implication on load/store instructions. Wouldn't this mean we have to do 2 checks in parallel for each load/store, one against the capability referred by the instruction, the other against the stack zeroing (stall if accessing stack area not yet zero'd)? Or we can simply stall all load/store while zeroing in progress? |
We would need to do two comparisons, but only when the stack zeroing is in use. That will depend a bit on how many cross compartment calls we use, but if we have a flip flow that’s set when the zeroing is finished then we can skip the additional checks while zeroing (and maybe power gate the comparators?). If we’re concerned about area, we could make loads and stores take an additional cycle while zeroing is happening (it would still be faster than doing it synchronously) and reuse the comparators from the capability check. We don’t need to stall loads, we just return zero. We need to stall stores. |
The extra cycle idea sounds good - basically we can make all load/store at least 2 cycles when zeroizing the stack. May still stall both load and store since
1. Would like to make the 1st stalling decision as simple as possible for timing purpose, since it has to be done combinatorial, and the decision feeds into a lot of things. The subsequent decisions can be registered and less critical
2. It's true we don't have to stall load when zeroing, but extra cycle still buys more time for the address check logic to make decision on whether to issue the actual read.. From side channel perspective we'd rather not to issue read (vs read and replace the return data with zero).
The logic is kind of intricate and will take some time to implement.. But I guess it is worth the effort since it would really overlay the thread start time with stack zeroing.
From: David Chisnall ***@***.***>
Sent: Monday, July 31, 2023 11:10 PM
To: microsoft/cheriot-ibex ***@***.***>
Cc: Comment ***@***.***>; Manual ***@***.***>; Subscribed ***@***.***>
Subject: Re: [microsoft/cheriot-ibex] Faster stack clearing (Issue #5)
We would need to do two comparisons, but only when the stack zeroing is in use. That will depend a bit on how many cross compartment calls we use, but if we have a flip flow that's set when the zeroing is finished then we can skip the additional checks while zeroing (and maybe power gate the comparators?). If we're concerned about area, we could make loads and stores take an additional cycle while zeroing is happening (it would still be faster than doing it synchronously) and reuse the comparators from the capability check.
We don't need to stall loads, we just return zero. We need to stall stores.
-
Reply to this email directly, view it on GitHub<#5 (comment)> or unsubscribe<https://github.com/notifications/unsubscribe-auth/A3V7IMEXCM5SGID3DNXXV3DXTCMUDBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE2TSMBRGM3DKOBZQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHAZDAMZUG42TKMFHORZGSZ3HMVZKMY3SMVQXIZI>.
You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@davidchisnall, in the case of interrupts when a load/store (targeting the un-scrubbed area) is still stalling, we can either abort and throw a fault (mepcc updated with the address of aborted instruction and mcause set to cheri fault), or treat it more like the normal interrupt (mepcc points the next instruction and firmware has to restart). Which way do you prefer? In both cases we can still make sure that memory access doesn't really happen. Also, for now can we stall loads to the unscrubbed stack region as well? I know we can return 0's but stalling might be simpler for hardware and I assume normally software won't read the unscrubbed stack so it's not a performance concern? |
Actually couple more questions
|
For an asynchronous interrupt, if the load / store hasn't happened then we want the MEPCC to point to the instruction that should be restarted.
I'm not sure how common this will be in software. Maybe stall for now but add a couple of performance counters to see for how long we're stalling for (on loads and stores).
They're unrelated code paths in software. The scrubber is more urgent though so it's fine to suspend the revoker while the scrubber is running and use a single load/store pipeline for both.
We want to stop when we context switch and resume when the suspended thread is resumed, so the switcher need to be able to run it before and after.
My assumption was that both ztop and zbase are just addresses, whereas zcap is the capability that authorises them. |
Ok I added the feature in new commit (228c615). FPGA build looks okay.
|
@davidchisnall, @nwf-msr, I realized there are still a few things to be sorted out when we switched to ztop as an SCR.
|
On context switch, we want to stop zeroing as soon as we switch away from the thread, for two reasons:
We currently always swap the CSR with null on interrupt and write the new thread’s value when resuming. As long as writing null stops, this is fine.
That’s perfect, I can delete a conditional branch from the switcher code with this guarantee.
We don’t need this, since we will explicitly stop it a few instructions into the interrupt handler. It doesn’t hurt though.
The ideal behaviour here would be to not fault, but to rewind the
That should be fine. We treat the zeroization state as just another part of thread state. As long as we can stop it for a region and restart it for a region, we don’t care if one of those regions is empty. We just spill and reload the untagged value and the hardware ignores it. If we interrupt a thread during zeroing, we store its ztop and resume it later. |
Rewinding mepcc is the fault behavior though.. And since we only support direct/non-vectored exceptions, to me it seems not much different. we can certainly use a different mcause to signal this is a special case? |
The thing that I don't like is needing to enter the interrupt handler twice to deliver the interrupt, once for the fault and once for the interrupt. Ideally, we'd just report the interrupt, but with an MEPCC value that meant that we could resume. I don't really want to get a fault here, because it isn't a fault. A fault will trigger the error-handling code paths, but there wasn't an error and so they will do the wrong thing. If we had a different error code, we could just fall into the interrupt code path, but we don't need the interrupt. Ideally, we'd have the interrupt cause, but the MEPCC set to the correct value for a fault. |
Writing up the discussion and adding a few more thoughts:
We zero a chunk of the stack on every call and return from cross-compartment calls. It would be nice to have a state machine that zeroes a range of memory, starting at the top and moving downwards. This will have a top and a bottom, where the top is moved downwards on each store. If the main pipeline loads between the top and the bottom, it should read zeroes. If the main pipeline stores between the top and the bottom, it should stall until the top has moved past the location of the store.
In the common case, this state machine would zero enough of the stack in the background that the next function to run would not block.
If we context switch (take an interrupt) then we need to be able to stop this pause zeroing and resume later.
Ideally, we'd integrate this with the stack high watermark control. In normal operation, we will:
Can we combine these operations so that moving the stack high-water mark is sufficient to start the revocation, using the current $csp as the authorising capability?
The text was updated successfully, but these errors were encountered: