Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[edn] Elevate CSRNG status errors to fatal alerts #15894

Closed
cindychip opened this issue Nov 1, 2022 · 18 comments · Fixed by #21280
Closed

[edn] Elevate CSRNG status errors to fatal alerts #15894

cindychip opened this issue Nov 1, 2022 · 18 comments · Fixed by #21280
Assignees
Labels
Component:RTL Earlgrey-PROD Candidate Temporary label to triage issues into Earlgrey-PROD Milestones Hotlist:Security Security Opinion Needed IP:edn Subsystem:Entropy entropy_src, csrng, or edn related issues

Comments

@cindychip
Copy link
Contributor

cindychip commented Nov 1, 2022

I would like to confirm the behavior when EDN encounters csrng errors (csrng_rsp_sts):

1). On the EDN side, upon receiving a csrng_rsp_sts error, it will report to cmd_sts register. But will still distribute the received data to all EDN requesters.

2). On CSRNG side, (only from reading the spec, I do not actually know the simulation results). It seems like CSRNG will fire an interrupt.

I would like to confirm if these behaviors are enough to treat the csrng_rsp_sts error?

@tjaychen @moidx @martin-lueker @mwbranstad

Thanks,
Cindy

estimate 16

@cindychip cindychip added the Type:Question Questions label Nov 1, 2022
@cindychip cindychip added this to the Project: M3 milestone Nov 1, 2022
@tjaychen
Copy link

tjaychen commented Nov 1, 2022

it seems like to me that if we encounter this situation, the only way to handle right now is to completely disable / re-enable the entropy complex. (or reboot).

It does feel like however this should be linked into recoverable alerts, but it doesn't seem like that right now.

@mwbranstad
Copy link
Contributor

the only case where the sts bit will be on is when a silicon error occurs, or some attacker is driving this bit on. In other words there is no functional reason that this bit will set. I would probably rate this bit being set as more severe, such as a fatal alert.

@tjaychen
Copy link

tjaychen commented Nov 2, 2022

@mwbranstad if it's alright with you, i can go in and make that change.
To not disturb things further, we can leave the interrupt and also add the fatal indication.
What do you think?

@mwbranstad
Copy link
Contributor

@tjaychen sounds good, just add function only at this point.

@tjaychen
Copy link

tjaychen commented Nov 2, 2022

sounds good.
If it is okay with everyone, let us keep this at M3 as Cindy has done.
I don't think this is something we absolutely have to get in now.

Let me know if there are any objections.

@tjaychen tjaychen removed the Type:Question Questions label Nov 2, 2022
@tjaychen tjaychen changed the title [edn/csrng] Response to CSRNG errors [edn/csrng] Elevate csrng status errors to fatal alerts Nov 2, 2022
@vogelpi
Copy link
Contributor

vogelpi commented Nov 25, 2022

The exact same question cam also up during the work for CSRNG V2/V2S (#16516). And I think there are two different questions to answer here:

  1. What should EDN do in case it gets a csrng_rsp_sts error signaled from CSRNG?
  2. When should CSRNG signal a csrng_rsp_sts error to SW/EDN?

The second question is probably better discussed in #16516 which gives some more context. I am thus removing the CSRNG label here.

As for what EDN should do I think the best would be to not distribute the data received upon getting an sts error from CSRNG. AS @mwbranstad right now the sts bit cannot be driven to 1 by CSRNG unless someone injects a fault. But from the spec there might be other reasons to (let's discuss in #16516). What I am getting at is that EDN doesn't need to know why sts got asserted. It's just an upstream error condition that got signaled. And to me it seems that just continuing regular operation upon getting such an error isn't right.

@vogelpi vogelpi removed the IP:csrng label Nov 25, 2022
@vogelpi vogelpi changed the title [edn/csrng] Elevate csrng status errors to fatal alerts [csrng] Elevate CSRNG status errors to fatal alerts Nov 25, 2022
@tjaychen
Copy link

but for random bits already generated and distributed to the EDN fifo's, wouldn't it be okay for them to continue distribution? I do agree though that edn could halt further requests / acks with csrng though.

I guess if we go down this route though, we need to do a check of the edn consumers to make sure that if the edn interface doesn't ack (because something has stopped due to errors upstream), it should actually halt the end point. For example I don't think keymgr handles this very well right now.

@cdgori
Copy link

cdgori commented Dec 29, 2022

but for random bits already generated and distributed to the EDN fifo's, wouldn't it be okay for them to continue distribution? I do agree though that edn could halt further requests / acks with csrng though.

Probably good to elevate CSRNG status errors to fatal, but we should review what the causes of those errors are in case any of them should not be elevated (not sure what that would be off the top of my head though). EDN <-> CSRNG req/ack should halt once that error is flagged/reported.

As far as entropy already down the pipe:

One can certainly argue that entropy from the CSRNG already distributed to the EDN prior to the status error is safe to continue to distribute until the EDN runs dry - those entropy samples passed all the relevant checks at the time of whitening/processing and no errors were present at that time.

One could also be ultra-paranoid and take an aggressive flushing posture as soon as any error is detected, but that could prevent tidying-up actions that might want to use entropy (e.g. to clear), which is problematic. I don't think the ultra-paranoia is justified here but happy to hear other opinions on this posture. (@mwbranstad / @martin-lueker / @johannheyszl / @moidx)

100% on your non-ack comment, that needs to be halted as it indicates some more serious fault that EDN consumers need to be aware of and handle properly.

@johannheyszl johannheyszl added the Hotlist:Security Security Opinion Needed label Jan 10, 2023
@andreaskurth andreaskurth changed the title [csrng] Elevate CSRNG status errors to fatal alerts [edn] Elevate CSRNG status errors to fatal alerts Mar 1, 2023
@andreaskurth
Copy link
Contributor

Triaged for edn. Assigning to M2.5 with Priority:P1 Priority: high because we need to take a decision on this, as any change will require time for implementation and verification.

@vogelpi
Copy link
Contributor

vogelpi commented Mar 3, 2023

P1 sounds right to me @andreaskurth .

I agree with @tjaychen and @cdgori that entropy already inside EDN when the csrng_rsp_sts error is signaled should be okay to distribute. Not sending new requests afterwards , not accepting new entropy coming in from CSRNG after that point would probably be the right thing to do.

@andreaskurth
Copy link
Contributor

SGTM. As we've discussed yesterday, when this case happens, SW must be able to detect and handle it. This may require adding some status bit or even interrupt to EDN.

@marnovandermaas
Copy link
Contributor

marnovandermaas commented Mar 23, 2023

M2.5 effort estimate, range 2-16 depending on whether we decide to make design changes:

estimate 16

@vogelpi
Copy link
Contributor

vogelpi commented Mar 30, 2023

We've discussed this in the Security WG meeting on Mar 30 2023 and concluded the following:

  • Without changing CSRNG to actually produce high csrng_rsp_sts bits (see [csrng] Define how CSRNG status response should operate #16516), it makes little sense to fix this for M2.5.
  • In case CSRNG detects an internal error condition (e.g. invalid command), its main FSM stops and signals a recoverable alert. It stops serving entropy to the EDN eventually. EDN thus doesn't receive entropy after an error condition has occurred upstream (which would have to be signaled using the status bit). This means from a security perspective things are okay.
  • Given the expected effort and our deadlines, we have agreed to leave this part of the design as is for M2.5.

However, it is also understood that this should be fixed for a future release. I am thus changing the labels accordingly.

@vogelpi vogelpi removed the Hotlist:Security Security Opinion Needed label Mar 30, 2023
@vogelpi vogelpi modified the milestones: Discrete: M2.5, Backlog Mar 30, 2023
@vogelpi vogelpi added Priority:P2 Priority: medium and removed Priority:P1 Priority: high labels Mar 30, 2023
@moidx moidx added Priority:P3 Priority: low and removed Priority:P2 Priority: medium Type:FutureRelease Not relevant to currently planned releases/milestones labels Apr 6, 2023
@msfschaffner msfschaffner added the Earlgrey-PROD Candidate Temporary label to triage issues into Earlgrey-PROD Milestones label Oct 7, 2023
@msfschaffner msfschaffner modified the milestones: Backlog, Earlgrey-PROD.M2 Nov 7, 2023
@msfschaffner msfschaffner added Hotlist:Security Security Opinion Needed and removed Triaged labels Nov 7, 2023
@msfschaffner msfschaffner added the Subsystem:Entropy entropy_src, csrng, or edn related issues label Dec 21, 2023
@msfschaffner msfschaffner added Priority:P2 Priority: medium and removed Priority:P3 Priority: low labels Dec 21, 2023
@johannheyszl johannheyszl removed the Priority:P2 Priority: medium label Jan 12, 2024
@vogelpi
Copy link
Contributor

vogelpi commented Jan 15, 2024

FYI @h-filali , this is very much related to the status register work you're currently doing. We should also fix this for production. We should fix this as described in my earlier comment.

@h-filali
Copy link

@vogelpi thanks for linking me. I suggest the EDN should enter the error state in the main SM upon receiving an status error from CSRNG. It should then signal a recoverable alert and not accept any further data from CSRNG until the whole entropy complex is disabled and re-enabled. This would be needed since continuing normal operation after receiving an error status signal from CSRNG could be catastrophic. A really bad example would be that the generate in boot mode could fail and thus boot entropy could not be delivered while the system keeps on waiting for boot entropy.
Both the error state and the recoverable alerts already exist.

Regarding what happens with the entropy, inside the EDN, which was received before the failure:
currently when the error state is reached the remaining entropy won't be sent out to the endpoints. This is probably solved best by not entering the error state in the ack_sm when certain errors like this one occur. This way, the endpoints can still request entropy while the genbits FIFO is not empty yet. We should make sure however, that the genbits FIFO won't accept any further data when the main SM enters the error state.

The above covers the HW modes (boot and auto mode). For SW mode the sw_cmd_sts signal should be fed through to the sw_cmd_sts register which can be read by firmware. From there it could either be the responsibility of the firmware to restart the entropy complex. In this case the EDN continues operation. Or we could proceed as above for the HW modes. What do you think @vogelpi.

@vogelpi
Copy link
Contributor

vogelpi commented Feb 5, 2024

I've already discussed this offline with @h-filali but for others previously involved in this discussion here: This sounds all good to me.

We've discussed to add a new recoverable error state into the main FSM out of which the FSM can get by a disable/enable cycle. The current error state is fatal which is not suitable for the behavior needed here. The Ack FSM we can leave untouched. End points will get the remaining entropy already received but further commands won't be sent out to CSRNG.

Also, some more status/ack bits have to be added for the hardware modes. Similar to the recently added bits to the SW_CMD_STS register. Hakim has come up with a plan for doing this.

@johngt
Copy link

johngt commented Feb 9, 2024

@vogelpi - I assume that based on the conversation there might be significant work here. When it is possible could you please update the total effort expected and propagate to the top down + tracker doc. TIA!

@vogelpi
Copy link
Contributor

vogelpi commented Feb 9, 2024

@johngt , this is the issue @h-filali is about to finish. No need to update any effort estimates or the tracker doc. The PR is expected to arrive later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component:RTL Earlgrey-PROD Candidate Temporary label to triage issues into Earlgrey-PROD Milestones Hotlist:Security Security Opinion Needed IP:edn Subsystem:Entropy entropy_src, csrng, or edn related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.