Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chip-test] alert_handler_reverse_ping_in_deep_sleep #14663

Merged
merged 1 commit into from
Sep 13, 2022

Conversation

moidx
Copy link
Contributor

@moidx moidx commented Aug 30, 2022

Check that escalation receivers located inside always-on blocks do not
auto-escalate due to the reverse ping feature while the system is in deep
sleep.

Signed-off-by: Miguel Osorio [email protected]

@moidx moidx requested review from a team and cfrantz as code owners August 30, 2022 23:01
@moidx moidx requested review from rasmus-madsen and engdoreis and removed request for a team August 30, 2022 23:01
@moidx moidx force-pushed the chip-test-alert-handler-reverse-ping branch from 294c14d to 8518316 Compare August 30, 2022 23:08
Copy link
Contributor

@engdoreis engdoreis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just left a few minor comments.

@moidx moidx force-pushed the chip-test-alert-handler-reverse-ping branch from 8518316 to 9639385 Compare August 31, 2022 22:22
@moidx moidx requested review from a team, ctopal, weicaiyang and msfschaffner and removed request for a team and ctopal September 1, 2022 00:13
@moidx moidx changed the title [test] alert_handler_reverse_ping_in_deep_sleep [chip-test] alert_handler_reverse_ping_in_deep_sleep Sep 1, 2022
@moidx moidx linked an issue Sep 1, 2022 that may be closed by this pull request
6 tasks
@moidx moidx force-pushed the chip-test-alert-handler-reverse-ping branch from 9639385 to 9ff9827 Compare September 1, 2022 01:09
@moidx moidx requested a review from cindychip September 6, 2022 15:21
hw/top_earlgrey/data/chip_testplan.hjson Outdated Show resolved Hide resolved
hw/top_earlgrey/data/chip_testplan.hjson Outdated Show resolved Hide resolved
hw/top_earlgrey/data/chip_testplan.hjson Outdated Show resolved Hide resolved
@moidx moidx force-pushed the chip-test-alert-handler-reverse-ping branch 2 times, most recently from 9ff9827 to 4a6d497 Compare September 7, 2022 23:18
@moidx moidx requested a review from cindychip September 7, 2022 23:21
@moidx
Copy link
Contributor Author

moidx commented Sep 7, 2022

Hi @cindychip, I updated the test to enable and check all local alerts. I also added a flag to the ISR to fail the test if there are any alert interrupts detected. PTAL. Thanks!

Copy link
Contributor

@cindychip cindychip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Miguel. great test!
I am running in DV and trying to dump waves to see if it is what we expect.

@moidx moidx force-pushed the chip-test-alert-handler-reverse-ping branch from 4a6d497 to 3020974 Compare September 8, 2022 23:15
Copy link
Contributor

@cindychip cindychip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks Miguel

Copy link
Contributor

@msfschaffner msfschaffner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo @cindychip's comments.

- Ensure reset status is low power exit. A `kDifRstmgrResetInfoEscalation` signals that
there was a local escalation and should result in test failure.
- Disable AON timer.
- Check there are no flagged local alerts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

@@ -778,15 +778,33 @@
uvm_test_seq: chip_sw_alert_handler_escalation_vseq
sw_images: ["//sw/device/tests/sim_dv:alert_handler_escalation_test:1"]
en_run_modes: ["sw_test_mode_test_rom"]
// Disable scoreboard to avoid incorrect alert prediction from the alert_monitor. Due to the
// cross-domain alert senders and receivers, the monitor from the chip level did not support
// processing alerts accurately from both ends.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that we track in an issue to fix later (if not, should it be?)?

@cindychip
Copy link
Contributor

@moidx here is the run result in dv:

Maybe we can set the run_timeout_min to 4 hours?

### Test Results
|  Stage  |                       Name                       | Tests                                            |  Max Job Runtime  |  Simulated Time  |  Passi
ng  |  Total  |  Pass Rate  |
|:-------:|:------------------------------------------------:|:-------------------------------------------------|:-----------------:|:----------------:|:------
---:|:-------:|:-----------:|
|   V2    | chip_sw_alert_handler_reverse_ping_in_deep_sleep | chip_sw_alert_handler_reverse_ping_in_deep_sleep |      3.443h       |    255.146ms     |     1 
    |    1    |  100.00 %   |
|   V2    |                                                  | **TOTAL**                                        |                   |                  |     1 
    |    1    |  100.00 %   |
|         |                                                  | **TOTAL**                                        |                   |                  |     1 
    |    1    |  100.00 %   |

@tjaychen
Copy link

hey all,
i've been staring at this test a bit, and now i'm not sure how we really should be testing this. Because it's IMO a combination of hardware / software / system behavior.

From what I can see, the most likely way a failure will trigger is a combination of two things:

  1. we screw up the clock / reset controls somehow such that pwrmgr escalation receiver gets out of sync from the alert_handler.
  2. after a low power event, the software for whatever reason takes an enormous amount of time before it enables pings again.

The first event will cause the pwrmgr escalation receiver to think that it should keep counting even though the alert handler has gone offline.
While the latter (if the limit is 175ms..it's not inconceivable software might run into this, since as far as I can tell rom doesn't enable the ping) would mean the alert handler doesn't ping the escalation for some time, causing it to auto-escalate.

The test case we have here I feel like is not very likely to ever flag an error (though it is still good to have) because most of the "time" in the test I think is used in pre-sleep, and it's not very likely things will go out of sync there.

The only way I can think of to test this is to hit the alert handler / pwrmgr combintation repeatedly with different kinds of sleep over and over with real software / or introduce artificial delays on the software, but all of this is going to take way too long in simulation. We could run it in FPGA, but this would be something that should be run for hours.

I think at a minimum I should probably add some assertions to ensure the escalation receiver and alert_handler cannot get out of sync, but beyond that I don't really know how to make this better.... any thoughts?

@tjaychen
Copy link

i think what i'm trying to say is that the purpose of these tests is probably to look for places where alert handler and pwrmgr get out of sync. But in this test, there's really only one moment where that could happen. And even if things DID get out of sync, it would require software to behave in an expected way to actually manifest an error. So I wonder if it makes more sense to codify what we think are the properties into cross module assertions, and then just try to stimulate it as much as possible.

@moidx
Copy link
Contributor Author

moidx commented Sep 10, 2022

Hi @tjaychen, for the long running test you are describing, we could simulate different pwrmgr transitions in a similar way to pwrmgr_deep_sleep_all_reset_reqs_test.c with additional random delays in software while enabling/checking alerts on each power cycle. I can probably write that as a separate test if it makes sense. wdyt?

@cindychip
Copy link
Contributor

cindychip commented Sep 12, 2022

Thanks Tim and Miguel for the suggestions.

I think Tim's concern is totally valid, because prim_esc_receiver does not have any status register to tell which state is it at, if reset happened or not, or if this counter is reset to 0.

1). To address pwrmgr and prim_esc_receiver not in sync:
As Tim summarized, we can try the following two things:
a). probing some internal signals, or write assertions.
b). have some long FPGA test to re-iterate this sequence, or wait for longer time after wake up, just to catch if any modules are out-of-sync.
To unblock this PR from reviewing/merging, I will create a separate issue for that.

We could have some long FPGA test to re-iterate this sequence, or wait for longer time after wake up, just to catch if any modules are out-of-sync.
2). To address pwrmgr and alert_handler not in sync:
@moidx's suggestion would work!

@tjaychen please feel free to let me know if I missed any points from our discussion.

@msfschaffner
Copy link
Contributor

@tjaychen just to clarify, the out of sync scenario you mention should not be possible if everything is implemented correctly, right? i.e., the vectors you describe would only help to bias the test so that the involved mechanisms are used in a way that a bug (if there was one) would manifest more readily?

@tjaychen
Copy link

@tjaychen just to clarify, the out of sync scenario you mention should not be possible if everything is implemented correctly, right? i.e., the vectors you describe would only help to bias the test so that the involved mechanisms are used in a way that a bug (if there was one) would manifest more readily?

yeah. But I worry I'm being too specific on just clk/rst, so if you can think of any other areas we should target..we should probably think about that too.

Check that escalation receivers located inside always-on blocks do not
auto-escalate due to the reverse ping feature while the system is in deep
sleep.

Signed-off-by: Miguel Osorio <[email protected]>
@moidx moidx force-pushed the chip-test-alert-handler-reverse-ping branch from 3020974 to ff7d6a0 Compare September 13, 2022 02:07
@moidx moidx merged commit db5c836 into lowRISC:master Sep 13, 2022
@moidx moidx deleted the chip-test-alert-handler-reverse-ping branch September 13, 2022 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[chip-test] chip_sw_alert_handler_reverse_ping_in_deep_sleep
5 participants