[opentitantool] Run "nice"r sims #16913

drewmacrae · 2022-12-21T15:41:07Z

Fix some timing issues by running sims with nice -5 to ensure test harnesses don't get starved of threads. This should solve some batch testing problems with a large number of threads serving verilator simulations.

I think it might help for: #16831 (Though that may just be a port collision which could require additional machinery)

It protects both the verilator and the cw310 test harnesses from heavy sim workloads which will be important to running cw310 tests that aren't exclusive.

Signed-off-by: Drew Macrae [email protected]

Fix some timing issues by running sims with nice -5 to ensure test harnesses don't get starved of threads. This should solve some batch testing problems with a large number of threads serving verilator simulations. Signed-off-by: Drew Macrae <[email protected]>

drewmacrae · 2022-12-21T15:42:25Z

sw/host/opentitanlib/src/transport/verilator/subprocess.rs

@@ -35,9 +35,10 @@ pub struct Subprocess {
 impl Subprocess {
    /// Starts a verilator [`Subprocess`] based on [`Options`].
    pub fn from_options(options: Options) -> Result<Self> {
-        let mut command = Command::new(&options.executable);
+        let mut command = Command::new(String::from("nice"));


I don't particularly like changing the command from the executable to 'nice', but it's the best way I know to deprioritize this.

Does this need the String::from? The docs for Command::new don't use it, but I didn't check the version.

I think "nice" is a "slice" and new needs a String, This is something I did to make OTT build happily.

I gave it a shot without wrapping in String::from and didn't have any trouble compiling. I admit I don't entirely understand the AsRef<OsStr> type.

yeah, I guess there was some other issue that I've since fixed. Thanks!

Are we able to use setpriority() or some other POSIX function instead of calling a separate program? I guess nice will probably be available anywhere those are, though, so maybe the function is not any better. 🤔

dmcardle · 2022-12-21T15:58:04Z

Fix some timing issues by running sims with nice -5 to ensure test harnesses don't get starved of threads. This should solve some batch testing problems with a large number of threads serving verilator simulations.

Out of curiosity, how did you discover the thread starvation issue? Does it happen on CI? And how will we know that adjusting the nice value fixes the thread starvation?

dmcardle

LGTM % comments and questions. Thanks, @drewmacrae!

dmcardle · 2022-12-21T15:51:28Z

sw/host/opentitanlib/src/transport/verilator/subprocess.rs

@@ -35,9 +35,10 @@ pub struct Subprocess {
 impl Subprocess {
    /// Starts a verilator [`Subprocess`] based on [`Options`].
    pub fn from_options(options: Options) -> Result<Self> {
-        let mut command = Command::new(&options.executable);
+        let mut command = Command::new(String::from("nice"));


Does this need the String::from? The docs for Command::new don't use it, but I didn't check the version.

dmcardle · 2022-12-21T15:53:17Z

sw/host/opentitanlib/src/transport/verilator/subprocess.rs

        let mut args = Vec::new();
-
+        args.push(String::from("-5"));


Peeking at man nice, doesn't -5 need to be preceded by -n or --adjustment=?

That makes it clearer we're increasing niceness but less concise.

Actually, now I'm worried that nice might be reading plain -5 as a parameter and silently ignoring it.

% nice -5 bash -c 'echo hello' hello % nice -n -5 bash -c 'echo hello' nice: cannot set niceness: Permission denied hello

It works on my machine
nice -5 yes > /dev/null & htop

-5 should be equivalent to -n 5 IIRC

I'll make it more verbose and readable, I just want to check that it works before I push the change.

positive nicing requires lesser permissions

Good point. Positive nicing is less favorable to the process, so I think -5 is the wrong direction.

It's misleading so I'm changing it
nice -5 nice:5
nice -n 5 nice:5

it's a different sort of hyphen

Oh! I thought it was silently ignoring -5, but it actually interpreted it as -n 5. Got it.

I think my persistent confusion about this is evidence it should be a verbose -n 5 :P

dmcardle · 2022-12-21T15:54:30Z

sw/host/opentitanlib/src/transport/verilator/subprocess.rs

@@ -35,9 +35,10 @@ pub struct Subprocess {
 impl Subprocess {
    /// Starts a verilator [`Subprocess`] based on [`Options`].
    pub fn from_options(options: Options) -> Result<Self> {
-        let mut command = Command::new(&options.executable);
+        let mut command = Command::new(String::from("nice"));
        let mut args = Vec::new();


Out-of-scope nit: Looks like Command::arg() and Command::args() exist. Given that command is already mutable, I think we could probably eliminate the args vector. Not super important and adds more noise to the diff, though.

drewmacrae · 2022-12-21T16:10:45Z

Out of curiosity, how did you discover the thread starvation issue? Does it happen on CI? And how will we know that adjusting the nice value fixes the thread starvation?

Good question. I came across it with the custom resource allocation work I've been doing. I've tested it in this branch #16436 where I'd previously been allocating 8 cpus to the FPGA tests to prevent failures. This change lets us expect the FPGA tests to run on a single cpu instead.

a-will · 2022-12-21T16:51:06Z

What is the situation where we are getting thread starvation for the test harness? And why are we getting ourselves into that situation? Are we sure there isn't something structurally wrong with the architecture of specific tests or of the dispatch strategy?

This PR may merely be missing important motivating information here (especially observations / data), but I am skeptical that this approach is best. Messing with thread priority is playing with fire, and there very likely are situations where decreasing the thread priority of the simulator's threads is deleterious to the performance of tests. At the very least, I would oppose hard-coded constructions that are merely made convenient for the current CI environment. Perhaps consider adding a backend configuration option here?

drewmacrae · 2022-12-21T16:59:36Z

What is the situation where we are getting thread starvation for the test harness?

On my workstation I have 72 cores which means I can run 17 4:cpu verilator tests, and have 4 cores available to run fpga tests which in practice require 1 core.

If these are simultaneously dispatched the fpga tests fail reasonably often, but of I run only 16 verilator tests it's healthy.

The fpga tests don't require or use 8 cores, so while it's sufficient to rescue the tests I dislike reserving them for it.

We also see some test-harness interactions causing problems in CI which may be resulting from this kind of resource contention but also might not. (We don't have a means to reproduce that issue)

drewmacrae · 2022-12-21T17:01:13Z

Are we sure there isn't something structurally wrong with the architecture of specific tests or of the dispatch strategy?

Nope, this is just an easier more elegant fix than the alternative method of accounting for resources.

drewmacrae · 2022-12-21T17:08:12Z

Perhaps consider adding a backend configuration option here?

@a-will, Are you suggesting plumbing it through from rules/opentitan_test.bzl through an argument like --verilator-niceness?

a-will · 2022-12-21T17:15:15Z

What is the situation where we are getting thread starvation for the test harness?

On my workstation I have 72 cores which means I can run 17 4:cpu verilator tests, and have 4 cores available to run fpga tests which in practice require 1 core.

If these are simultaneously dispatched the fpga tests fail reasonably often, but of I run only 16 verilator tests it's healthy.

The fpga tests don't require or use 8 cores, so while it's sufficient to rescue the tests I dislike reserving them for it.

We also see some test-harness interactions causing problems in CI which may be resulting from this kind of resource contention but also might not. (We don't have a means to reproduce that issue)

It feels like that means you can't run 17 verilator tests ;) (i.e. they actually require more than 4 cores). For tests that are time-sensitive, should we be allocating an entire core for the testbench / test harness? The original 4 cores only represent the DUT, and if we want to banish as much flakiness as possible, we may want to avoid loading a system enough to cause significant contention.

That's the most conservative approach and leads to a less efficient use of resources, so it's just a question. There's some balance to be had between false failure probability and average throughput. So I wouldn't go with hard-coding this for everyone, but the backend configuration option might give you an opportunity to play with it.

a-will · 2022-12-21T17:15:47Z

Perhaps consider adding a backend configuration option here?

@a-will, Are you suggesting plumbing it through from rules/opentitan_test.bzl through an argument like --verilator-niceness?

Yup, exactly! :)

drewmacrae · 2022-12-21T17:36:41Z

Messing with thread priority is playing with fire

For me, this statement isn't as much of a deterrent as you might hope.

drewmacrae · 2022-12-21T17:46:19Z

It feels like that means you can't run 17 verilator tests ;) (i.e. they actually require more than 4 cores).

Huh, interesting, this finding would argue for a value of cpu:4.5 (which is an upper-bound sufficient to prevent this condition.) It's more elegant and honest than saying FPGA tests take cpu:8 and should have the same effect on resource based allocation.

drewmacrae · 2022-12-21T17:54:55Z

Huh, interesting, this finding would argue for a value of cpu:4.5 (which is an upper-bound sufficient to prevent this condition.) It's more elegant and honest than saying FPGA tests take cpu:8 and should have the same effect on resource based allocation.

There's a higher overhead to protect the verilator harnesses in CI and thus resolve #16831 this way.

…d overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. Signed-off-by: Drew Macrae <[email protected]>

drewmacrae · 2022-12-21T18:01:33Z

Well #16915 is certainly simpler and does almost everything this PR does.

…overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. Signed-off-by: Drew Macrae <[email protected]>

…overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. We also stop overriding resource based allocation by removing --local_test_jobs arg and update documentation Signed-off-by: Drew Macrae <[email protected]>

drewmacrae · 2022-12-22T15:13:33Z

I think one of the bins of failure was a harness timing out waiting for a sim, so we really need less sims rather than nicer sims.

…overhead An alternative to lowRISC#16913 to prevent harnesses from getting pre-empted by a simulation load. We also stop overriding resource based allocation by removing --local_test_jobs arg and update documentation Signed-off-by: Drew Macrae <[email protected]>

…overhead An alternative to #16913 to prevent harnesses from getting pre-empted by a simulation load. We also stop overriding resource based allocation by removing --local_test_jobs arg and update documentation Signed-off-by: Drew Macrae <[email protected]>

drewmacrae requested a review from a team as a code owner December 21, 2022 15:41

drewmacrae requested review from alphan and removed request for a team December 21, 2022 15:41

drewmacrae commented Dec 21, 2022

View reviewed changes

drewmacrae requested review from cfrantz, a-will, dmcardle and jon-flatley December 21, 2022 15:43

dmcardle approved these changes Dec 21, 2022

View reviewed changes

drewmacrae marked this pull request as draft December 21, 2022 17:12

drewmacrae mentioned this pull request Dec 21, 2022

[bazel] allocate additional core per verilator test to allow for harness and overhead #16915

Merged

drewmacrae mentioned this pull request Dec 21, 2022

[ci/test] sw_strap_value_sim_verilator fails periodically in CI #16831

Closed

drewmacrae closed this Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opentitantool] Run "nice"r sims #16913

[opentitantool] Run "nice"r sims #16913

drewmacrae commented Dec 21, 2022

drewmacrae Dec 21, 2022

dmcardle Dec 21, 2022

drewmacrae Dec 21, 2022

dmcardle Dec 21, 2022

drewmacrae Dec 21, 2022

a-will Dec 21, 2022

dmcardle commented Dec 21, 2022

dmcardle left a comment

dmcardle Dec 21, 2022

dmcardle Dec 21, 2022

drewmacrae Dec 21, 2022

dmcardle Dec 21, 2022

drewmacrae Dec 21, 2022

drewmacrae Dec 21, 2022

drewmacrae Dec 21, 2022

dmcardle Dec 21, 2022

drewmacrae Dec 21, 2022

drewmacrae Dec 21, 2022

dmcardle Dec 21, 2022

dmcardle Dec 21, 2022

drewmacrae commented Dec 21, 2022

a-will commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022 •

edited

Loading

a-will commented Dec 21, 2022

a-will commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 22, 2022

[opentitantool] Run "nice"r sims #16913

[opentitantool] Run "nice"r sims #16913

Conversation

drewmacrae commented Dec 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmcardle commented Dec 21, 2022

dmcardle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewmacrae commented Dec 21, 2022

a-will commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022 • edited Loading

a-will commented Dec 21, 2022

a-will commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 21, 2022

drewmacrae commented Dec 22, 2022

drewmacrae commented Dec 21, 2022 •

edited

Loading