-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[clflush] Enable x86 cpu cache flush #5914
Conversation
I like the overall util for cache flushing, however, it would be great to discuss the interface of cache eviction. In terms the API choices:
Here are a few alternative API choices for configuring the cache flushing behavior. A0: Fold cache flushing factor into time_evaluatormod = load_module()
# flush cpu cache of args 1
f = mod.time_evaluator("myfunc", repeat=10, cache_flush_cpu_args_begin=1) A1: Decoupled Composite stylemod = load_module()
# cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments.
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed) |
cc @jwfromm @junrushao1994 @merrymercy @yidawang @jcf94 for quick discussion |
I am not in favor of setting environment variables. This is like an even worse global variable and is largely unsafe. If this is really the case, we probably need a global atomic flag, or a thread local flag. Do not use environment variables if we can avoid. Thank you! |
I agree that the cache flush mechanism is useful in getting preciser measurement. It would be great if @FrozenGene can provide some experimental data to further assure. I vote for folding cache flushing factor into time_evaluator for succinctness. And making it more configurable and generic sounds good to me. |
@FrozenGene please followup |
I was at vocation several days ago, sorry for replying a little late. For this, I also want to do a quick discussion. For single op / or Ansor's single subgraph benchmark, we don't want to turn on cache flush so that we could unleash maximum performance for them, however for network benchmark, we want to turn on cache flush because of the reason mentioned. So maybe A1 is easier to complete this task if everyone agree we turn off cache flush when we do single op benchmark? I agree we should provide one api for this as if we want to support remote devices (like arm), environment values can not control it on our local machine. |
@yidawang Previous experimental data is almost based on Ansor, especially x86 winograd. Like winograd of 1x7x7x512x512, the single op tuning performance time could reach in 0.11ms (on one skylake 512 machine), but when to execute on e2e, this op even could cost several ms (sorry I lost this number, only record 0.11ms). The issue is the const matrix and weight (for example, weight 3x3x513x512 will become 6x6x512x512 if tile size is 4). Another benefit to add clflush is we needn't min_repeat_ms (like 1000) because we could measure it very precisely. Like this pr, we even only set repeat to be 10. So we could reduce our tuning time. I am collecting auto tvm resnet18 data on one skylake machine and will share it when it completes ASAP. |
I agree that A1 is more modular, @yidawang do you have strong preference on A0? |
I want to share my testing on my skylake machine result. The CPU is Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz. I use one core to tune. The tuning method is the tutorial of Without cache flush:
With cache flush:
The execution time: As you could see, if we have clflush, almost single layer's tuning gflops is slower than without clflush. But when we run it end2end, we could get better result. And with clflush, we could have much less tuning time as we only need to tune 10 times (even could less). As said before, if we have winograd for cpu, this becomes more important. |
How about we go with A1 for now, @FrozenGene can you update this PR to A1? |
OK. I will implement this PR to A1 next. |
@tqchen If we use # cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments.
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed) We will meet Related Code (see TVM_REGISTER_GLOBAL("runtime.RPCTimeEvaluator")
.set_body_typed([](Optional<Module> opt_mod, std::string name, int device_type, int device_id,
int number, int repeat, int min_repeat_ms, PackedFunc f_prepare) {
TVMContext ctx;
ctx.device_type = static_cast<DLDeviceType>(device_type);
ctx.device_id = device_id;
if (opt_mod.defined()) {
Module m = opt_mod.value();
std::string tkey = m->type_key();
if (tkey == "rpc") {
// Wrong
return static_cast<RPCModuleNode*>(m.operator->())
->GetTimeEvaluator(name, ctx, number, repeat, min_repeat_ms, f_prepare);
// Pass
ctx.device_type = static_cast<DLDeviceType>(ctx.device_type % kRPCSessMask);
return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
} else {
return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
}
} |
Do you have any idea about it? @tqchen |
Ah, you are right, right now it is hard to pass function as an argument to the remote because it is wrapped under the std::function, we could lift the restriction later once we fold the PackedFunc as an object. Right now we might need to pass fprepare by string as its name which skips the first argument by default. f = mod.time_evaluator("myfunc", repeat=10, fprepare="cache_flush_cpu_non_first_arg") It is still a bit more flexible than A0, as it enables multiple cache flush options but less flexible than A1. The migration to A1 possible though later. |
@tqchen Sorry to drop the ball as I was out the entire last week. The current implementation looks good to me. |
Another point I want to do one round of quick discussion. When we tune single op / single layer network performance, we don't want to do cache flush so that we could get maximum performance. Previous pr use environment value could control it easily (setting it in tune_network but doesn't set it when to tune single layer network). If we fold it into time_evaluator, how to distinguish these two conditions? One method I could come up is we could add one extra tuning option (like |
@comaniac I have updated the doc and change the name from |
Didn't see the update on my side. Will check it later again. |
I could see it now from my side. It should work from your side too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
@merrymercy I have updated the doc. For |
Thanks @FrozenGene @merrymercy @comaniac @yidawang |
When we are in the tuning of TVM, we will make TVM occupy the cache fully and doesn't flush it during iteration. This has problems then in e2e testing, since arrays that we assume exist in cache (ie. weights) are evicted during e2e runs,
which leads to lower performance. This has been demonstrated in Ansor.
@merrymercy @tqchen @jcf94 @minminsun