Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[entropy_src/dv] Health tests sometimes fail in scoreboard but not in DUT #18797

Closed
andreaskurth opened this issue Jun 2, 2023 · 6 comments · Fixed by #24166
Closed

[entropy_src/dv] Health tests sometimes fail in scoreboard but not in DUT #18797

andreaskurth opened this issue Jun 2, 2023 · 6 comments · Fixed by #24166
Assignees
Labels
Component:DV DV issue: testbench, test case, etc. Earlgrey-PROD Candidate Temporary label to triage issues into Earlgrey-PROD Milestones IP:entropy_src Milestone:V3 Priority:P2 Priority: medium Subsystem:Entropy entropy_src, csrng, or edn related issues

Comments

@andreaskurth
Copy link
Contributor

In very rare cases (based on how rarely this occurs even in full regression runs), entropy_src's scoreboard predicts a recoverable alert due to a failed health tests but RTL does not produce this alert. This usually manifests itself as UVM_ERROR @ * ps: (cip_base_scoreboard.sv:*) uvm_test_top.env.scoreboard [uvm_test_top.env.scoreboard] alert recov_alert did not trigger max_delay:*.

This seems to affect bucket tests and Markov tests.

Given current evidence, I think this is a minor difference between scoreboard and DUT (e.g., of statistical nature) and does not pose a substantial problem. Nonetheless, we should resolve it for DV closure. I'm thus labeling it V3.

Here's a patch that can help with debugging:
diff --git a/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv b/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv
index 150130dca4..1419793a00 100644
--- a/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv
+++ b/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv
@@ -574,6 +574,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
                                               !total_scope, sigma_lo, fail_lo);
     end
 
+    `uvm_info(`gfn, $sformatf("evaluate_adaptp_test()=%b", fail_hi || fail_lo), UVM_LOW)
     return (fail_hi || fail_lo);
   endfunction
 
@@ -604,6 +605,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
                                               1'b0, sigma, fail);
     end
 
+    `uvm_info(`gfn, $sformatf("evaluate_bucket_test()=%b", fail), UVM_LOW)
     return fail;
   endfunction
 
@@ -636,9 +638,11 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
     update_watermark("markov_hi", fips_mode, total_scope ? value : maxval);
 
     fail_lo = check_threshold("markov_lo", fips_mode, total_scope ? value : minval);
+    `uvm_info(`gfn, $sformatf("fail_lo=%b", fail_lo), UVM_LOW)
     if (fail_lo) predict_failure_logs("markov_lo");
 
     fail_hi = check_threshold("markov_hi", fips_mode, total_scope ? value : maxval);
+    `uvm_info(`gfn, $sformatf("fail_hi=%b", fail_hi), UVM_LOW)
     if (fail_hi) predict_failure_logs("markov_hi");
 
     if (ht_is_active()) begin
@@ -650,6 +654,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
                                               !total_scope, sigma_hi, fail_lo);
     end
 
+    `uvm_info(`gfn, $sformatf("evaluate_markov_test()=%b", fail_hi || fail_lo), UVM_LOW)
     return (fail_hi || fail_lo);
   endfunction
 
@@ -730,6 +735,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
     windowed_fail_count = evaluate_adaptp_test(window, fips_mode) +
                           evaluate_bucket_test(window, fips_mode) +
                           evaluate_markov_test(window, fips_mode);
+    `uvm_info(`gfn, $sformatf("windowed_fail_count=%0d", windowed_fail_count), UVM_LOW)
 
     // Are there any failures this in the last sample (continuous or windowed)?
     // Note: If the last sample in the window has a continuous HT failure, then
@@ -738,7 +744,10 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
 
     // Add the number of continuous fails (excluding the last sample)
     // to any failure in this sample.
+    `uvm_info(`gfn, $sformatf("continuous_fail_count=%0d", continuous_fail_count), UVM_LOW)
+    `uvm_info(`gfn, $sformatf("extht_fail_count=%0d", extht_fail_count), UVM_LOW)
     total_fail_count = sample_fail_count + continuous_fail_count + extht_fail_count;
+    `uvm_info(`gfn, $sformatf("total_fail_count=%0d", total_fail_count), UVM_LOW)
 
     // Implementation artifact:
     // Account for the fact that simultaneous failures for windowed, continuous and ExtHT tests
@@ -746,6 +755,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
     // sample.
     overcount = sample_fail_count + cont_fail_in_last_sample + extht_fail_in_last_sample;
     overcount -= (overcount >= 1);
+    `uvm_info(`gfn, $sformatf("overcount=%0d", overcount), UVM_LOW)
     total_fail_count -= overcount;
 
     // To avoid double counting only mark a sample failure if there haven't
@@ -807,6 +817,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
           end
           threshold_alert_active = 1;
           `DV_CHECK_FATAL(ral.recov_alert_sts.es_main_sm_alert.predict(1'b1));
+          `uvm_info(`gfn, $sformatf("recov_alert due to main SM"), UVM_LOW)
           set_exp_alert(.alert_name("recov_alert"), .is_fatal(0), .max_delay(cfg.alert_max_delay));
           // The DUT should either set the alert, or crash the sim.
           // If we succeed, sample this alert_threshold as covered successfully.
@@ -2315,6 +2326,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
       if (seeds_out != 0 && get_csrng_seed(item.d_data) == prev_csrng_seed) begin
         `uvm_info(`gfn, "Repeated seed, expecting recov_alert", UVM_MEDIUM)
         `DV_CHECK_FATAL(ral.recov_alert_sts.es_bus_cmp_alert.predict(1'b1));
+        `uvm_info(`gfn, $sformatf("recov_alert due to repeated seed"), UVM_LOW)
         set_exp_alert(.alert_name("recov_alert"), .is_fatal(0), .max_delay(cfg.alert_max_delay));
       end

The problem can be reproduced with Xcelium 21.09-s006 as follows:

  • commit 02fd38a036eb88f0e1068c83667b928d11e1f648, test entropy_src_rng, seed 1862319441 (failing bucket test)
  • commit c6009c674b93ae28b06badb8a98bba5c577ad4ab, test entropy_src_rng_max_rate, seed 2988502003 (failing Markov test)

You may also be able to reproduce the problem on master by running many random seeds of the tests above.

@andreaskurth andreaskurth added Component:DV DV issue: testbench, test case, etc. IP:entropy_src Milestone:V3 labels Jun 2, 2023
@andreaskurth
Copy link
Contributor Author

Along the reasoning above, I think this is not critical for M2.5.2 and am thus assigning it to M2.5 Backlog.

CC @johngt @vogelpi

@andreaskurth andreaskurth added this to the Discrete: M2.5.Backlog milestone Jun 2, 2023
@msfschaffner msfschaffner added the Earlgrey-PROD Candidate Temporary label to triage issues into Earlgrey-PROD Milestones label Oct 7, 2023
@msfschaffner msfschaffner added Subsystem:Entropy entropy_src, csrng, or edn related issues Priority:P2 Priority: medium labels Dec 21, 2023
@vogelpi
Copy link
Contributor

vogelpi commented Jun 18, 2024

Moving to M7 as part of V3.

@h-filali
Copy link

h-filali commented Jul 29, 2024

I fixed most of these in #23787.

However, I think there is also an RTL bug here. When rst_alert_cntr_o in the main SM goes high while there is a failure in one of the continuous health tests like the repcnts test, the failure is ignored because the counter is incremented and cleared at the same time (clr supersedes the incr). I observed this for this seed:
util/dvsim/dvsim.py hw/ip/entropy_src/dv/entropy_src_sim_cfg.hjson -i entropy_src_rng_max_rate -t xcelium --fixed-seed 108069444358779596642755797006271320644578869651229429093932085286013706466631
I created #24159 to document this.

I'll check if I can find some more that fail due to different reasons and report it here.

@h-filali
Copy link

I have now checked the last 5 regressions.
I have found a total of two left over failures.
For the first one I have created a bug report #24159.
For the second one I have created a PR #24166.

There are also some failures that were probably already fixed by #23787.
I think after merging the above PR we can close this issue.

@andreaskurth
Copy link
Contributor Author

@h-filali: #23787 has now been merged; should we wait for #24166 to get merged before closing this issue?

@h-filali
Copy link

h-filali commented Aug 2, 2024

@andreaskurth yes I think we should wait for #24166 to close. The issue is already tagged in the PR such that it closes automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component:DV DV issue: testbench, test case, etc. Earlgrey-PROD Candidate Temporary label to triage issues into Earlgrey-PROD Milestones IP:entropy_src Milestone:V3 Priority:P2 Priority: medium Subsystem:Entropy entropy_src, csrng, or edn related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants