[entropy_src/dv] Health tests sometimes fail in scoreboard but not in DUT #18797

andreaskurth · 2023-06-02T10:25:15Z

In very rare cases (based on how rarely this occurs even in full regression runs), entropy_src's scoreboard predicts a recoverable alert due to a failed health tests but RTL does not produce this alert. This usually manifests itself as UVM_ERROR @ * ps: (cip_base_scoreboard.sv:*) uvm_test_top.env.scoreboard [uvm_test_top.env.scoreboard] alert recov_alert did not trigger max_delay:*.

This seems to affect bucket tests and Markov tests.

Given current evidence, I think this is a minor difference between scoreboard and DUT (e.g., of statistical nature) and does not pose a substantial problem. Nonetheless, we should resolve it for DV closure. I'm thus labeling it V3.

Here's a patch that can help with debugging:

diff --git a/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv b/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv
index 150130dca4..1419793a00 100644
--- a/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv
+++ b/hw/ip/entropy_src/dv/env/entropy_src_scoreboard.sv
@@ -574,6 +574,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
                                               !total_scope, sigma_lo, fail_lo);
     end
 
+    `uvm_info(`gfn, $sformatf("evaluate_adaptp_test()=%b", fail_hi || fail_lo), UVM_LOW)
     return (fail_hi || fail_lo);
   endfunction
 
@@ -604,6 +605,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
                                               1'b0, sigma, fail);
     end
 
+    `uvm_info(`gfn, $sformatf("evaluate_bucket_test()=%b", fail), UVM_LOW)
     return fail;
   endfunction
 
@@ -636,9 +638,11 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
     update_watermark("markov_hi", fips_mode, total_scope ? value : maxval);
 
     fail_lo = check_threshold("markov_lo", fips_mode, total_scope ? value : minval);
+    `uvm_info(`gfn, $sformatf("fail_lo=%b", fail_lo), UVM_LOW)
     if (fail_lo) predict_failure_logs("markov_lo");
 
     fail_hi = check_threshold("markov_hi", fips_mode, total_scope ? value : maxval);
+    `uvm_info(`gfn, $sformatf("fail_hi=%b", fail_hi), UVM_LOW)
     if (fail_hi) predict_failure_logs("markov_hi");
 
     if (ht_is_active()) begin
@@ -650,6 +654,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
                                               !total_scope, sigma_hi, fail_lo);
     end
 
+    `uvm_info(`gfn, $sformatf("evaluate_markov_test()=%b", fail_hi || fail_lo), UVM_LOW)
     return (fail_hi || fail_lo);
   endfunction
 
@@ -730,6 +735,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
     windowed_fail_count = evaluate_adaptp_test(window, fips_mode) +
                           evaluate_bucket_test(window, fips_mode) +
                           evaluate_markov_test(window, fips_mode);
+    `uvm_info(`gfn, $sformatf("windowed_fail_count=%0d", windowed_fail_count), UVM_LOW)
 
     // Are there any failures this in the last sample (continuous or windowed)?
     // Note: If the last sample in the window has a continuous HT failure, then
@@ -738,7 +744,10 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
 
     // Add the number of continuous fails (excluding the last sample)
     // to any failure in this sample.
+    `uvm_info(`gfn, $sformatf("continuous_fail_count=%0d", continuous_fail_count), UVM_LOW)
+    `uvm_info(`gfn, $sformatf("extht_fail_count=%0d", extht_fail_count), UVM_LOW)
     total_fail_count = sample_fail_count + continuous_fail_count + extht_fail_count;
+    `uvm_info(`gfn, $sformatf("total_fail_count=%0d", total_fail_count), UVM_LOW)
 
     // Implementation artifact:
     // Account for the fact that simultaneous failures for windowed, continuous and ExtHT tests
@@ -746,6 +755,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
     // sample.
     overcount = sample_fail_count + cont_fail_in_last_sample + extht_fail_in_last_sample;
     overcount -= (overcount >= 1);
+    `uvm_info(`gfn, $sformatf("overcount=%0d", overcount), UVM_LOW)
     total_fail_count -= overcount;
 
     // To avoid double counting only mark a sample failure if there haven't
@@ -807,6 +817,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
           end
           threshold_alert_active = 1;
           `DV_CHECK_FATAL(ral.recov_alert_sts.es_main_sm_alert.predict(1'b1));
+          `uvm_info(`gfn, $sformatf("recov_alert due to main SM"), UVM_LOW)
           set_exp_alert(.alert_name("recov_alert"), .is_fatal(0), .max_delay(cfg.alert_max_delay));
           // The DUT should either set the alert, or crash the sim.
           // If we succeed, sample this alert_threshold as covered successfully.
@@ -2315,6 +2326,7 @@ class entropy_src_scoreboard extends cip_base_scoreboard#(
       if (seeds_out != 0 && get_csrng_seed(item.d_data) == prev_csrng_seed) begin
         `uvm_info(`gfn, "Repeated seed, expecting recov_alert", UVM_MEDIUM)
         `DV_CHECK_FATAL(ral.recov_alert_sts.es_bus_cmp_alert.predict(1'b1));
+        `uvm_info(`gfn, $sformatf("recov_alert due to repeated seed"), UVM_LOW)
         set_exp_alert(.alert_name("recov_alert"), .is_fatal(0), .max_delay(cfg.alert_max_delay));
       end

The problem can be reproduced with Xcelium 21.09-s006 as follows:

commit 02fd38a036eb88f0e1068c83667b928d11e1f648, test entropy_src_rng, seed 1862319441 (failing bucket test)
commit c6009c674b93ae28b06badb8a98bba5c577ad4ab, test entropy_src_rng_max_rate, seed 2988502003 (failing Markov test)

You may also be able to reproduce the problem on master by running many random seeds of the tests above.

The text was updated successfully, but these errors were encountered:

andreaskurth · 2023-06-02T10:26:39Z

Along the reasoning above, I think this is not critical for M2.5.2 and am thus assigning it to M2.5 Backlog.

CC @johngt @vogelpi

vogelpi · 2024-06-18T21:21:00Z

Moving to M7 as part of V3.

h-filali · 2024-07-29T14:08:19Z

I fixed most of these in #23787.

However, I think there is also an RTL bug here. When rst_alert_cntr_o in the main SM goes high while there is a failure in one of the continuous health tests like the repcnts test, the failure is ignored because the counter is incremented and cleared at the same time (clr supersedes the incr). I observed this for this seed:
util/dvsim/dvsim.py hw/ip/entropy_src/dv/entropy_src_sim_cfg.hjson -i entropy_src_rng_max_rate -t xcelium --fixed-seed 108069444358779596642755797006271320644578869651229429093932085286013706466631
I created #24159 to document this.

I'll check if I can find some more that fail due to different reasons and report it here.

h-filali · 2024-07-30T07:36:46Z

I have now checked the last 5 regressions.
I have found a total of two left over failures.
For the first one I have created a bug report #24159.
For the second one I have created a PR #24166.

There are also some failures that were probably already fixed by #23787.
I think after merging the above PR we can close this issue.

andreaskurth · 2024-08-01T12:06:47Z

@h-filali: #23787 has now been merged; should we wait for #24166 to get merged before closing this issue?

h-filali · 2024-08-02T09:26:18Z

@andreaskurth yes I think we should wait for #24166 to close. The issue is already tagged in the PR such that it closes automatically.

andreaskurth added Component:DV DV issue: testbench, test case, etc. IP:entropy_src Milestone:V3 labels Jun 2, 2023

andreaskurth added this to the Discrete: M2.5.Backlog milestone Jun 2, 2023

msfschaffner added the Earlgrey-PROD Candidate Temporary label to triage issues into Earlgrey-PROD Milestones label Oct 7, 2023

msfschaffner modified the milestones: Earlgrey ES M2.5.Backlog, Earlgrey-PROD.M5 Nov 8, 2023

msfschaffner added Subsystem:Entropy entropy_src, csrng, or edn related issues Priority:P2 Priority: medium labels Dec 21, 2023

msfschaffner assigned vogelpi and h-filali Dec 21, 2023

This was referenced Mar 29, 2024

[entropy_src] D2S Signoff #20973

Closed

[entropy_src] V1 Signoff #21008

Closed

[csrng] D2S Signoff #20976

Closed

vogelpi modified the milestones: Earlgrey-PROD.M5, Earlgrey-PROD.M7 Jun 18, 2024

h-filali mentioned this issue Jun 24, 2024

[entropy_src/dv] Improve the group coverage #23787

Merged

vogelpi mentioned this issue Jun 25, 2024

[csrng] V2S Signoff #22467

Closed

vogelpi mentioned this issue Jul 4, 2024

[entropy_src] V2S Signoff #22469

Closed

h-filali mentioned this issue Jul 30, 2024

[entropy_src/dv] Ignore outstanding alerts after disablement #24166

Merged

vogelpi closed this as completed in #24166 Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[entropy_src/dv] Health tests sometimes fail in scoreboard but not in DUT #18797

[entropy_src/dv] Health tests sometimes fail in scoreboard but not in DUT #18797

andreaskurth commented Jun 2, 2023

andreaskurth commented Jun 2, 2023

vogelpi commented Jun 18, 2024

h-filali commented Jul 29, 2024 •

edited

Loading

h-filali commented Jul 30, 2024

andreaskurth commented Aug 1, 2024

h-filali commented Aug 2, 2024 •

edited

Loading

[entropy_src/dv] Health tests sometimes fail in scoreboard but not in DUT #18797

[entropy_src/dv] Health tests sometimes fail in scoreboard but not in DUT #18797

Comments

andreaskurth commented Jun 2, 2023

andreaskurth commented Jun 2, 2023

vogelpi commented Jun 18, 2024

h-filali commented Jul 29, 2024 • edited Loading

h-filali commented Jul 30, 2024

andreaskurth commented Aug 1, 2024

h-filali commented Aug 2, 2024 • edited Loading

h-filali commented Jul 29, 2024 •

edited

Loading

h-filali commented Aug 2, 2024 •

edited

Loading