Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add extra information into exception reporting #373

Merged
merged 17 commits into from
Sep 23, 2024

Conversation

tariqkurd-repo
Copy link
Collaborator

We need to have the exception faulting address when a CHERI data memory access faults.
Not having so is causing problems, so this adds two fields:

mtval.CAP_IDX - which cap caused the fault
mtval2 - XLEN wide address which faulted (already a standard CSR)
stval2 - as mtval2

@tariqkurd-repo tariqkurd-repo changed the title add extra information into exception reoprting add extra information into exception reporting Sep 19, 2024
Copy link
Collaborator

@arichardson arichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the vstval2reg.edn file otherwise LGTM.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 19, 2024

Why do you need the address, and why do you need CAP_IDX? That was always meant to be a crutch that went away in the standard version.

@tariqkurd-repo
Copy link
Collaborator Author

Why do you need the address, and why do you need CAP_IDX? That was always meant to be a crutch that went away in the standard version.

So David asked me to post the case where I would have wanted the fault address of a CHERI fault to be available:
On a (normal) page fault the linux kernel has this check:
Well the linux kernel does this on a page fault in kernel mode:

    if (!user_mode(regs) && addr < TASK_SIZE && unlikely(!(regs->status & SR_SUM))) {
            if (fixup_exception(regs))
                    return;

            die_kernel_fault("access to user memory without uaccess routines", addr, regs);
    }

Ignoring the SR_SUM check as it is irrelevant to the discussion what this does on a (normal page fault is):

Check if the code that triggered the fault is executing in kernel mode (!user_mode(regs))
Check that the address that causes the fault is a user address (addr < TASK_SIZE)
Here addr is the fault address provided by hardware on a page fault.
The fixup_exception() will then check if the instruction pointer is in one of the user_copy routines and if this is true will abort the user copy operation with an error that will eventually be reported to the user as an EFAULT return value from the attempted syscall.
Now if we do a copy_to_user in CHERI, we want the user to provide a capability for the target of the copy and the kernel will use that capability as it would use an address.
If the user provided an invalid or unsuitable capability the result should be the same as for a page fault (at least that's what I think).
I.e. we don't want the user process to abort with a CHERI fault. Instead we want to terminate the syscall with an error and let user mode handle it.
The most prominent example where this showed up is a user mode program (seen in various syscall test suites) passing in a NULL pointer to read simply to test the EFAULT behaviour but of course the case of a real invalid capability must be handled as well.
Now, if we want to do the same check for a CHERI fault we would also need to know if we are executing in kernel mode and we would need to check the fault address against the user space address limit. The latter is what I want the fault address for.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 20, 2024

We do not need the faulting address in order to disambiguate the various cases in CheriBSD. I am perplexed that there is the belief Linux does.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 20, 2024

If you take a capability fault and PC is within copyin/out (copy_to/from_user) then you return an error from the routines. If you take a capability fault and PC is not within copyin/out then that's a kernel error (whether because a kernel access was junk or because it failed to use copyin/out) and deserves a panic.

@tariqkurd-repo
Copy link
Collaborator Author

another issue is whether we want to carry CAP_IDX down the pipe, our feeling is that just reporting the address in Xtval2 is enough for the reported case

@djmsip
Copy link

djmsip commented Sep 20, 2024

I haven't looked at this particular case in any detail, but my understanding is that having the address allows the code to basically use the same path as the MMU fault. Without the address we have to omit the check the code is doing on the MMU fault case in the CHERI case. As to why the check is done, that is a good question which we are looking into, but all the arches do it.

There are a few other points on this as well:

  • Most of the people who have been working on this have complained that the faulting address is not available easily
  • It does help with debugging in my opinion.
  • This will be the only memory access trap that does not report the faulting address as far as I know.
  • All the information is already available as I understand it, so it is not hard to implement
  • I'm concerned that there may be some other operating systems or use case that may find this useful - I just don't know. Can we say we will never need the faulting address?

Of course, we can always find out what the address is by looking at the instruction and operands (and potentially the boundaries of the instruction as well) , but given the machine already knows all this, why not make the information available? Better a crutch now than legless later. Feel free to shoot me down:-)

@tariqkurd-repo
Copy link
Collaborator Author

I haven't looked at this particular case in any detail, but my understanding is that having the address allows the code to basically use the same path as the MMU fault. Without the address we have to omit the check the code is doing on the MMU fault case in the CHERI case. As to why the check is done, that is a good question which we are looking into, but all the arches do it.

There are a few other points on this as well:

  • Most of the people who have been working on this have complained that the faulting address is not available easily
  • It does help with debugging in my opinion.
  • This will be the only memory access trap that does not report the faulting address as far as I know.
  • All the information is already available as I understand it, so it is not hard to implement
  • I'm concerned that there may be some other operating systems or use case that may find this useful - I just don't know. Can we say we will never need the faulting address?

Of course, we can always find out what the address is by looking at the instruction and operands (and potentially the boundaries of the instruction as well) , but given the machine already knows all this, why not make the information available? Better a crutch now than legless later. Feel free to shoot me down:-)

The address is already available, as if we take (e.g.) a page fault we'll report it. It's hard to justify not reporting it in this case, as many people report wanting to know it.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 20, 2024

The reason information is communicated to trap handlers for page faults is so that you can do all the usual CoW, swapping, etc tricks that rely on knowing what page is being accessed. Yes, we can provide the implementation, but we can provide lots of other things too. There is a concern in my head that, if you provide the information, it implies to people that they should be using it to recover from the trap and resume execution, and without the full capability what are you going to do with it beyond say "yes that's an address"? The only thing you can usefully do is change what errors you report, but you still need to report an error, and it's not clear to me that it's useful to differentiate cases, especially since you can't trust anything about the address other than that it was the address the program attempted to use, and you don't have the capability that was used as the authority.

@tariqkurd-repo
Copy link
Collaborator Author

tariqkurd-repo commented Sep 20, 2024

The reason information is communicated to trap handlers for page faults is so that you can do all the usual CoW, swapping, etc tricks that rely on knowing what page is being accessed. Yes, we can provide the implementation, but we can provide lots of other things too. There is a concern in my head that, if you provide the information, it implies to people that they should be using it to recover from the trap and resume execution, and without the full capability what are you going to do with it beyond say "yes that's an address"? The only thing you can usefully do is change what errors you report, but you still need to report an error, and it's not clear to me that it's useful to differentiate cases, especially since you can't trust anything about the address other than that it was the address the program attempted to use, and you don't have the capability that was used as the authority.

The viable alternative is to make all the new CSRs cap width and include the metadata with the tag cleared as we don't want to add rep checks. If the rep check would have failed on the access (particularly for a vector indexed load store where the index is very large) is the metadata useful? I suspect that it is useful in almost all cases.

Then the information is more complete.

Again - the address is already reported in many cases, so is trivial to add. Adding the metadata is also ok.

but we can provide lots of other things too.

no-one has asked for anything else AFAIK, and this one has come up repeatedly in software development. It's important for this spec to be usable by developers and the feedback I'm getting is that it has this problem we need to address.

@djmsip
Copy link

djmsip commented Sep 20, 2024

Ok, that is a different argument, about intent, which I hadn't thought of I confess. Personally, I don't think we can stop people doing dumb things, after all they could do that by decoding the instruction. I think it is useful to have the address (metadata would useful as well I think) for ease of debugging if nothing else. In this case, IIUC it means the changes for CHERI are a bit less as we can common up more code, though I need to look at the code in depth. I don't like the fact that the program has tried to access something and we don't immediately know what, even though the machine does.

Copy link
Collaborator

@andresag01 andresag01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with the changes. It would be good to remove the comments and addressing some of the minor typos/comments before merging though

Copy link
Collaborator

@arichardson arichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reporting the address so you can put it in the SEGV addr field without decoding the instruction makes a lot of sense.

Capidx is just nice to have information that I guess could be dropped and we'd force GDB to decode the instruction instead.

@bsdjhb what do you think?

tariqkurd-repo and others added 3 commits September 23, 2024 09:56
Co-authored-by: Andres Amaya Garcia <[email protected]>
Signed-off-by: Tariq Kurd <[email protected]>
Co-authored-by: Andres Amaya Garcia <[email protected]>
Signed-off-by: Tariq Kurd <[email protected]>
Co-authored-by: Andres Amaya Garcia <[email protected]>
Signed-off-by: Tariq Kurd <[email protected]>
@tariqkurd-repo tariqkurd-repo merged commit 21a53e6 into riscv:main Sep 23, 2024
3 checks passed
@bsdjhb
Copy link

bsdjhb commented Sep 24, 2024

As a general comment, yes, I do think it would be useful to have the faulting address. I have need of it in a CheriBSD PR where I convert CHERI faults for hybrid processes into SIGSEGV instead of SIGPROT. The current PR works on Morello but not on CHERI RISC-V due to not having the faulting address to hand.

CAPIDX is not really needed IMO. Morello doesn't provide it, and any adaptation of CHERI to x86 can never provide it (not all memory operands in x86 are in registers). In general for a debugger, a user can just look at the faulting instruction disassembly to work out which register is involved. More typically a user will be looking at the source line that faulted and printing variables anyway. If you do not need a separate error code field in Xtval (e.g. if you are using separate Xcause codes for tag vs length faults like Morello), then I would suggest just storing the effective faulting VA in Xtval similar to MMU faults.

@bsdjhb
Copy link

bsdjhb commented Sep 24, 2024

For reference, the CheriBSD PR I mentioned: CTSRD-CHERI/cheribsd#2145

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 24, 2024

Hm, the SIGSEGV points are fair. In that case, I would suggest putting the address in xtval and the error code in xtval2 to avoid special casing in both hardware and software.

@tariqkurd-repo
Copy link
Collaborator Author

It would be more regular to have the address in Xtval, certainly

@djmsip
Copy link

djmsip commented Sep 25, 2024

I think we may as well move it now, it does make it more regular and I suspect it would get shifted in review anyway.

andresag01 added a commit that referenced this pull request Sep 26, 2024
Swap the roles of xtval and xtval2 such that xtval holds the address on
a CHERI fault and xtval2 holds the extra CHERI cause and type
information for the exception. The following also had to change along
the way:

* Update old references to xtval that must now point to xtval2
* Add `htval2` because otherwise we do not get any CHERI information in
HS-mode
* Adjust diagrams and descriptions for xtval and xtval2

This PR is a follow up to #373

---------

Signed-off-by: Tariq Kurd <[email protected]>
Signed-off-by: Andres Amaya Garcia <[email protected]>
Co-authored-by: Tariq Kurd <[email protected]>
tariqkurd-repo added a commit to tariqkurd-repo/riscv-cheri that referenced this pull request Oct 9, 2024
We need to have the exception faulting address when a CHERI data memory
access faults.
Not having so is causing problems, so this adds two fields:

mtval.CAP_IDX - which cap caused the fault
mtval2 - XLEN wide address which faulted (already a standard CSR)
stval2 - as mtval2

---------

Signed-off-by: Tariq Kurd <[email protected]>
Co-authored-by: Andres Amaya Garcia <[email protected]>
tariqkurd-repo added a commit to tariqkurd-repo/riscv-cheri that referenced this pull request Oct 9, 2024
Swap the roles of xtval and xtval2 such that xtval holds the address on
a CHERI fault and xtval2 holds the extra CHERI cause and type
information for the exception. The following also had to change along
the way:

* Update old references to xtval that must now point to xtval2
* Add `htval2` because otherwise we do not get any CHERI information in
HS-mode
* Adjust diagrams and descriptions for xtval and xtval2

This PR is a follow up to riscv#373

---------

Signed-off-by: Tariq Kurd <[email protected]>
Signed-off-by: Andres Amaya Garcia <[email protected]>
Co-authored-by: Tariq Kurd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants