add extra information into exception reporting #373

tariqkurd-repo · 2024-09-19T12:57:55Z

We need to have the exception faulting address when a CHERI data memory access faults.
Not having so is causing problems, so this adds two fields:

mtval.CAP_IDX - which cap caused the fault
mtval2 - XLEN wide address which faulted (already a standard CSR)
stval2 - as mtval2

src/riscv-integration.adoc

arichardson

Missing the vstval2reg.edn file otherwise LGTM.

jrtc27 · 2024-09-19T21:22:42Z

Why do you need the address, and why do you need CAP_IDX? That was always meant to be a crutch that went away in the standard version.

src/riscv-integration.adoc

src/img/stval2reg.edn

tariqkurd-repo · 2024-09-20T08:16:51Z

Why do you need the address, and why do you need CAP_IDX? That was always meant to be a crutch that went away in the standard version.

So David asked me to post the case where I would have wanted the fault address of a CHERI fault to be available:
On a (normal) page fault the linux kernel has this check:
Well the linux kernel does this on a page fault in kernel mode:

    if (!user_mode(regs) && addr < TASK_SIZE && unlikely(!(regs->status & SR_SUM))) {
            if (fixup_exception(regs))
                    return;

            die_kernel_fault("access to user memory without uaccess routines", addr, regs);
    }

Ignoring the SR_SUM check as it is irrelevant to the discussion what this does on a (normal page fault is):

Check if the code that triggered the fault is executing in kernel mode (!user_mode(regs))
Check that the address that causes the fault is a user address (addr < TASK_SIZE)
Here addr is the fault address provided by hardware on a page fault.
The fixup_exception() will then check if the instruction pointer is in one of the user_copy routines and if this is true will abort the user copy operation with an error that will eventually be reported to the user as an EFAULT return value from the attempted syscall.
Now if we do a copy_to_user in CHERI, we want the user to provide a capability for the target of the copy and the kernel will use that capability as it would use an address.
If the user provided an invalid or unsuitable capability the result should be the same as for a page fault (at least that's what I think).
I.e. we don't want the user process to abort with a CHERI fault. Instead we want to terminate the syscall with an error and let user mode handle it.
The most prominent example where this showed up is a user mode program (seen in various syscall test suites) passing in a NULL pointer to read simply to test the EFAULT behaviour but of course the case of a real invalid capability must be handled as well.
Now, if we want to do the same check for a CHERI fault we would also need to know if we are executing in kernel mode and we would need to check the fault address against the user space address limit. The latter is what I want the fault address for.

jrtc27 · 2024-09-20T08:25:31Z

We do not need the faulting address in order to disambiguate the various cases in CheriBSD. I am perplexed that there is the belief Linux does.

jrtc27 · 2024-09-20T08:29:48Z

If you take a capability fault and PC is within copyin/out (copy_to/from_user) then you return an error from the routines. If you take a capability fault and PC is not within copyin/out then that's a kernel error (whether because a kernel access was junk or because it failed to use copyin/out) and deserves a panic.

tariqkurd-repo · 2024-09-20T10:10:43Z

another issue is whether we want to carry CAP_IDX down the pipe, our feeling is that just reporting the address in Xtval2 is enough for the reported case

djmsip · 2024-09-20T12:19:48Z

I haven't looked at this particular case in any detail, but my understanding is that having the address allows the code to basically use the same path as the MMU fault. Without the address we have to omit the check the code is doing on the MMU fault case in the CHERI case. As to why the check is done, that is a good question which we are looking into, but all the arches do it.

There are a few other points on this as well:

Most of the people who have been working on this have complained that the faulting address is not available easily
It does help with debugging in my opinion.
This will be the only memory access trap that does not report the faulting address as far as I know.
All the information is already available as I understand it, so it is not hard to implement
I'm concerned that there may be some other operating systems or use case that may find this useful - I just don't know. Can we say we will never need the faulting address?

Of course, we can always find out what the address is by looking at the instruction and operands (and potentially the boundaries of the instruction as well) , but given the machine already knows all this, why not make the information available? Better a crutch now than legless later. Feel free to shoot me down:-)

tariqkurd-repo · 2024-09-20T12:41:35Z

I haven't looked at this particular case in any detail, but my understanding is that having the address allows the code to basically use the same path as the MMU fault. Without the address we have to omit the check the code is doing on the MMU fault case in the CHERI case. As to why the check is done, that is a good question which we are looking into, but all the arches do it.

There are a few other points on this as well:

Most of the people who have been working on this have complained that the faulting address is not available easily

It does help with debugging in my opinion.

This will be the only memory access trap that does not report the faulting address as far as I know.

All the information is already available as I understand it, so it is not hard to implement

I'm concerned that there may be some other operating systems or use case that may find this useful - I just don't know. Can we say we will never need the faulting address?

Of course, we can always find out what the address is by looking at the instruction and operands (and potentially the boundaries of the instruction as well) , but given the machine already knows all this, why not make the information available? Better a crutch now than legless later. Feel free to shoot me down:-)

The address is already available, as if we take (e.g.) a page fault we'll report it. It's hard to justify not reporting it in this case, as many people report wanting to know it.

jrtc27 · 2024-09-20T12:51:06Z

The reason information is communicated to trap handlers for page faults is so that you can do all the usual CoW, swapping, etc tricks that rely on knowing what page is being accessed. Yes, we can provide the implementation, but we can provide lots of other things too. There is a concern in my head that, if you provide the information, it implies to people that they should be using it to recover from the trap and resume execution, and without the full capability what are you going to do with it beyond say "yes that's an address"? The only thing you can usefully do is change what errors you report, but you still need to report an error, and it's not clear to me that it's useful to differentiate cases, especially since you can't trust anything about the address other than that it was the address the program attempted to use, and you don't have the capability that was used as the authority.

tariqkurd-repo · 2024-09-20T13:24:57Z

The reason information is communicated to trap handlers for page faults is so that you can do all the usual CoW, swapping, etc tricks that rely on knowing what page is being accessed. Yes, we can provide the implementation, but we can provide lots of other things too. There is a concern in my head that, if you provide the information, it implies to people that they should be using it to recover from the trap and resume execution, and without the full capability what are you going to do with it beyond say "yes that's an address"? The only thing you can usefully do is change what errors you report, but you still need to report an error, and it's not clear to me that it's useful to differentiate cases, especially since you can't trust anything about the address other than that it was the address the program attempted to use, and you don't have the capability that was used as the authority.

The viable alternative is to make all the new CSRs cap width and include the metadata with the tag cleared as we don't want to add rep checks. If the rep check would have failed on the access (particularly for a vector indexed load store where the index is very large) is the metadata useful? I suspect that it is useful in almost all cases.

Then the information is more complete.

Again - the address is already reported in many cases, so is trivial to add. Adding the metadata is also ok.

but we can provide lots of other things too.

no-one has asked for anything else AFAIK, and this one has come up repeatedly in software development. It's important for this spec to be usable by developers and the feedback I'm getting is that it has this problem we need to address.

djmsip · 2024-09-20T14:02:34Z

Ok, that is a different argument, about intent, which I hadn't thought of I confess. Personally, I don't think we can stop people doing dumb things, after all they could do that by decoding the instruction. I think it is useful to have the address (metadata would useful as well I think) for ease of debugging if nothing else. In this case, IIUC it means the changes for CHERI are a bit less as we can common up more code, though I need to look at the code in depth. I don't like the fact that the program has tried to access something and we don't immediately know what, even though the machine does.

src/riscv-integration.adoc

src/hypervisor-integration.adoc

src/riscv-integration.adoc

andresag01

Happy with the changes. It would be good to remove the comments and addressing some of the minor typos/comments before merging though

arichardson

I think reporting the address so you can put it in the SEGV addr field without decoding the instruction makes a lot of sense.

Capidx is just nice to have information that I guess could be dropped and we'd force GDB to decode the instruction instead.

@bsdjhb what do you think?

Co-authored-by: Andres Amaya Garcia <[email protected]> Signed-off-by: Tariq Kurd <[email protected]>

src/riscv-integration.adoc

Signed-off-by: Tariq Kurd <[email protected]>

Co-authored-by: Andres Amaya Garcia <[email protected]> Signed-off-by: Tariq Kurd <[email protected]>

src/riscv-integration.adoc

Signed-off-by: Tariq Kurd <[email protected]>

…xceptions

bsdjhb · 2024-09-24T15:53:16Z

As a general comment, yes, I do think it would be useful to have the faulting address. I have need of it in a CheriBSD PR where I convert CHERI faults for hybrid processes into SIGSEGV instead of SIGPROT. The current PR works on Morello but not on CHERI RISC-V due to not having the faulting address to hand.

CAPIDX is not really needed IMO. Morello doesn't provide it, and any adaptation of CHERI to x86 can never provide it (not all memory operands in x86 are in registers). In general for a debugger, a user can just look at the faulting instruction disassembly to work out which register is involved. More typically a user will be looking at the source line that faulted and printing variables anyway. If you do not need a separate error code field in Xtval (e.g. if you are using separate Xcause codes for tag vs length faults like Morello), then I would suggest just storing the effective faulting VA in Xtval similar to MMU faults.

bsdjhb · 2024-09-24T15:54:08Z

For reference, the CheriBSD PR I mentioned: CTSRD-CHERI/cheribsd#2145

jrtc27 · 2024-09-24T16:01:38Z

Hm, the SIGSEGV points are fair. In that case, I would suggest putting the address in xtval and the error code in xtval2 to avoid special casing in both hardware and software.

tariqkurd-repo · 2024-09-24T16:04:18Z

It would be more regular to have the address in Xtval, certainly

djmsip · 2024-09-25T11:28:31Z

I think we may as well move it now, it does make it more regular and I suspect it would get shifted in review anyway.

Swap the roles of xtval and xtval2 such that xtval holds the address on a CHERI fault and xtval2 holds the extra CHERI cause and type information for the exception. The following also had to change along the way: * Update old references to xtval that must now point to xtval2 * Add `htval2` because otherwise we do not get any CHERI information in HS-mode * Adjust diagrams and descriptions for xtval and xtval2 This PR is a follow up to #373 --------- Signed-off-by: Tariq Kurd <[email protected]> Signed-off-by: Andres Amaya Garcia <[email protected]> Co-authored-by: Tariq Kurd <[email protected]>

We need to have the exception faulting address when a CHERI data memory access faults. Not having so is causing problems, so this adds two fields: mtval.CAP_IDX - which cap caused the fault mtval2 - XLEN wide address which faulted (already a standard CSR) stval2 - as mtval2 --------- Signed-off-by: Tariq Kurd <[email protected]> Co-authored-by: Andres Amaya Garcia <[email protected]>

Swap the roles of xtval and xtval2 such that xtval holds the address on a CHERI fault and xtval2 holds the extra CHERI cause and type information for the exception. The following also had to change along the way: * Update old references to xtval that must now point to xtval2 * Add `htval2` because otherwise we do not get any CHERI information in HS-mode * Adjust diagrams and descriptions for xtval and xtval2 This PR is a follow up to riscv#373 --------- Signed-off-by: Tariq Kurd <[email protected]> Signed-off-by: Andres Amaya Garcia <[email protected]> Co-authored-by: Tariq Kurd <[email protected]>

add extra information into exception reoprting

e1a458a

tariqkurd-repo changed the title ~~add extra information into exception reoprting~~ add extra information into exception reporting Sep 19, 2024

tariqkurd-repo requested review from arichardson and andresag01 September 19, 2024 13:01

andresag01 reviewed Sep 19, 2024

View reviewed changes

src/riscv-integration.adoc Show resolved Hide resolved

clarify that stval fields match mtval for cheri faults

99e7c82

andresag01 reviewed Sep 19, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 19, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

clarify mtval2 address

e5c2864

arichardson reviewed Sep 19, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

add vstval2

1ff37d1

arichardson approved these changes Sep 19, 2024

View reviewed changes

carlrobshaw reviewed Sep 19, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

carlrobshaw reviewed Sep 19, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

carlrobshaw reviewed Sep 19, 2024

View reviewed changes

src/img/stval2reg.edn Show resolved Hide resolved

tariqkurd-repo added 3 commits September 20, 2024 09:47

add missing file

a3550a7

move vstval2 to hypervisor section

e43c090

clarify it's the effective address

acea223

Add vstval and remove cap_idx from Xtval2

9dc0424

add new file

31e1376

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/hypervisor-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/hypervisor-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/hypervisor-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/hypervisor-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

andresag01 reviewed Sep 20, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

andresag01 approved these changes Sep 20, 2024

View reviewed changes

arichardson approved these changes Sep 20, 2024

View reviewed changes

tariqkurd-repo and others added 3 commits September 23, 2024 09:56

Update src/riscv-integration.adoc

332b2bc

Co-authored-by: Andres Amaya Garcia <[email protected]> Signed-off-by: Tariq Kurd <[email protected]>

Update src/hypervisor-integration.adoc

db7239f

Co-authored-by: Andres Amaya Garcia <[email protected]> Signed-off-by: Tariq Kurd <[email protected]>

Update src/hypervisor-integration.adoc

7e3f642

Co-authored-by: Andres Amaya Garcia <[email protected]> Signed-off-by: Tariq Kurd <[email protected]>

tariqkurd-repo commented Sep 23, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

tariqkurd-repo and others added 3 commits September 23, 2024 10:09

Update src/riscv-integration.adoc

50d46f5

Signed-off-by: Tariq Kurd <[email protected]>

remove comments

d816c67

Update src/hypervisor-integration.adoc

ee7b991

Co-authored-by: Andres Amaya Garcia <[email protected]> Signed-off-by: Tariq Kurd <[email protected]>

tariqkurd-repo commented Sep 23, 2024

View reviewed changes

src/riscv-integration.adoc Outdated Show resolved Hide resolved

tariqkurd-repo and others added 2 commits September 23, 2024 15:47

Update src/riscv-integration.adoc

e2fec8b

Signed-off-by: Tariq Kurd <[email protected]>

clarify that Xtval2 regs are all updated the same way for all cheri e…

ac69a96

…xceptions

tariqkurd-repo merged commit 21a53e6 into riscv:main Sep 23, 2024
3 checks passed

andresag01 mentioned this pull request Sep 25, 2024

Swap roles of xtval and xtval2 #379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add extra information into exception reporting #373

add extra information into exception reporting #373

tariqkurd-repo commented Sep 19, 2024

arichardson left a comment

jrtc27 commented Sep 19, 2024

tariqkurd-repo commented Sep 20, 2024

jrtc27 commented Sep 20, 2024

jrtc27 commented Sep 20, 2024

tariqkurd-repo commented Sep 20, 2024

djmsip commented Sep 20, 2024

tariqkurd-repo commented Sep 20, 2024

jrtc27 commented Sep 20, 2024

tariqkurd-repo commented Sep 20, 2024 •

edited

Loading

djmsip commented Sep 20, 2024

andresag01 left a comment

arichardson left a comment

bsdjhb commented Sep 24, 2024

bsdjhb commented Sep 24, 2024

jrtc27 commented Sep 24, 2024

tariqkurd-repo commented Sep 24, 2024

djmsip commented Sep 25, 2024

add extra information into exception reporting #373

add extra information into exception reporting #373

Conversation

tariqkurd-repo commented Sep 19, 2024

arichardson left a comment

Choose a reason for hiding this comment

jrtc27 commented Sep 19, 2024

tariqkurd-repo commented Sep 20, 2024

jrtc27 commented Sep 20, 2024

jrtc27 commented Sep 20, 2024

tariqkurd-repo commented Sep 20, 2024

djmsip commented Sep 20, 2024

tariqkurd-repo commented Sep 20, 2024

jrtc27 commented Sep 20, 2024

tariqkurd-repo commented Sep 20, 2024 • edited Loading

djmsip commented Sep 20, 2024

andresag01 left a comment

Choose a reason for hiding this comment

arichardson left a comment

Choose a reason for hiding this comment

bsdjhb commented Sep 24, 2024

bsdjhb commented Sep 24, 2024

jrtc27 commented Sep 24, 2024

tariqkurd-repo commented Sep 24, 2024

djmsip commented Sep 25, 2024

tariqkurd-repo commented Sep 20, 2024 •

edited

Loading