-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for varying CPU frequency more robustly #138
Comments
Ah, one other simple thing we could do: add "instruction count" to our metrics, along with cycles and wallclock time. While there isn't necessarily a simple relationship between instruction count and performance (IPC can vary widely), it is at least a deterministic (or very nearly deterministic) measure for single-threaded benchmarks. |
Yeah, the hope was that we could largely avoid CPU scaling bias by measuring cycles instead of wall time. This is more of a mitigation than a solution, though, and it seems like it isn't enough, apparently.
I think we should remove all of the shootout benchmarks. They were useful for verifying that we had similar numbers to the old sightglass, which also used these benchmarks. But they are tiny, micro benchmarks that don't reflect real world programs nor are they snippets of code that we found Wasmtime/Cranelift lacking on. I don't think they are useful anymore. I would focus on just the markdown and bz2 benchmarks, for the moment.
Yes, we should do this. It will require some refactoring of our parent-child subprocess interactions and a protocol between them instead of just "spawn and wait for completion".
None of our benchmarks are single-threaded: we want to measure how well parallel compilation and such are helping (or not). I don't think we should invest time here. Similar for measuring instruction count instead of cycles or wall time. I'd prefer to identify and mitigate sources of bias, instead, so we can still measure the "actual" thing we care about (wall time) rather than something that is loosely correlated with the thing we care about.
This would be great to support. |
Filed a dedicated issue for this: #139 |
Filed a dedicated issue for this: #140 |
Thanks for filing the issues! And sorry for dumping so many ideas at once :-)
The execution phase is single-threaded, no? Part of the issue is that the sampling isn't uniformly getting a mix of cores: depending on what else is going on in the background on the system, one or another core may be busy for a period of time, and the scheduler's affinity heuristics will make core assignments somewhat "sticky" for the main thread that invokes the Wasm in a given process. So one set of runs may land on a core running at say 3.6GHz and another set of runs will land on a core at 4.2GHz; this introduces bias that is hard to get rid of, even if we bump the iteration count. (Or at least, that's part of what seems to be going on in my case.) One way around this is to go completely in the other direction, and ensure we sample on all cores. Two particular changes might help: (i) for a given compilation, do a bunch of instantiations and executions, so we can take more samples relatively cheaply; (ii) explicitly bounce between cores. Think of this like another dimension of randomization (akin to ASLR)... Thoughts? |
(For the core-bouncing, there seem to be a few crates that manipulate a thread's CPU affinity; we could e.g. spawn a thread for each known core, or just migrate a single thread.) |
At the I think (ii) would also be great to have. Are CPU governors generally scaling individual CPUs separately? I know big-little exists, but my understanding was that it was effectively discontinued, so AFAIK, the only way we would generally see cores at different clock speeds would be CPU governors scaling individual CPUs separately. |
Ah, and one more thought: have we considered any statistical analysis that would look for multi-modal distributions (and warn, at least)? If we see that e.g. half of all runs of a benchmark run in 0.3s and half in 0.5s, and the distribution looks like the sum of two Gaussians, it may be better to warn the user "please check settings X, Y, Z; you seem to be alternating between two different configurations randomly" than to just present a mean of 0.4s with some wide variance, while the latter makes more sense if we just have a single Gaussian with truly random noise. |
We have #91 and talk a little bit about this in the RFC too. The idea is that we should be able to test whether samples are independent of process or iteration. We don't have anything on file to explicitly check for multi-modal distributions / non-normal distributions, but that would be good to do as well. |
Yup! Here's from my desktop just now:
That's with
note that most are around 3.5GHz but if you land on one of the last six, you're in for a (slightly faster) wild ride. |
cc @abrown: we talked about removing these before, after we had more benchmark programs, and after we verified that our results are roughly in the same range as old sightglass. AFAIK, those things are pretty much done (we can always add more benchmark programs, but we have a couple solid C and Rust programs). Do you feel okay about removing the shootout benchmarks at this point? |
(Sorry, coming in a bit late to the conversation). @cfallin, I always sort of took the @fitzgen, re: removing the shootout benchmarks, I'm conflicted. Your comment that they don't reflect real-world programs is accurate, but I'm pretty sure @jlb6740 is still using them occasionally and in terms of analyzability, it is a lot easier to figure out what is impacting performance in small programs. And they're already all set up. What about adding a "manifest" feature to the CLI? When no benchmark is specified, only the benchmarks in the manifest are run? |
Re: perf counters (thanks @abrown), and also instruction counts as discussed here:
It may merit a separate issue to discuss and work out any implications, but I think I should offer my perspective as a benchmark user trying to tune things: instruction counts are extraordinarily useful too, and I don't think we should shy away from them so strongly, or at least, I'd like to offer their merits for (re)consideration :-) Specifically, the two useful properties are determinism and monotonicity. Determinism completely sidesteps all of these measurement-bias issues: in principle, it should be possible to measure an exact instruction count for a single-threaded benchmark, and get the same instruction count every time. This saves a ton of headache. Monotonicity, i.e. that a decrease in instruction count should yield some decrease in runtime, is what allows it to be a fine-grained feedback signal while tweaking heuristics and the like. In other words, the slope (IPC) is variable but gradient-descent on instruction count should reduce runtime too. This was fantastically useful especially while bringing up the new backends last year. FWIW this has always been my experience in any performance-related research too: clean metrics from deterministic models allow for more visibility and much more effective iteration, and "end-user metric" measurements are useful mostly to report the final results. The swings I'm trying to measure with regalloc2 are so far coarse-grained enough that I've been able to use wallclock time, but I guess I just want to speak up for this (instruction-counting) usage pattern and make sure it remains supported too! |
This is undoubtedly true, but my response would be... does it matter? As we already discussed, they don't reflect real programs that we actually care about.
Definitely we shouldn't remove support for instruction counting! I still think our default should be cycles. |
Filed #142 for this. |
@cfallin Hey guys, just checking out this issue and starting to read through. When you set the scaling governor to ondemand are you also setting the scaling_min_freq and scaling_max_freq. Typically when doing analysis where I need a baseline run, I set the governor to "userspace" if available, but more importantly set the scaling frequencies to something like 1GHz. Also pin to a core like you say and do runs with both hyperthreading first turned off, then turned on. |
Most modern CPUs scale their clock frequency according to demand, and this CPU frequency scaling is always a headache when running benchmarks. There are two main dimensions in which this variance could cause trouble:
I've been seeing some puzzling results lately and I suspect at least part of the trouble has to do with the above. I've set my CPU cores to the Linux kernel's
performance
governor, but even then, on my 12-core Ryzen CPU, I see clock speeds between 3.6GHz and 4.2GHz, likely due to best-effort frequency boost (which is regulated by thermal bounds and so unpredictable).Note that measuring only cycles does not completely remove the effects of clock speed, because parts of performance are pinned to other clocks -- e.g., memory latency depends on the DDR clock, not the core clock, and L3 cache latency depends on the uncore clock.
The best ways I know to avoid noise from varying CPU performance are:
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
text file, will usually beondemand
, we wantperformance
) and warn if scaling is turned onThoughts? Other ideas?
The text was updated successfully, but these errors were encountered: