Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clflush] Enable x86 cpu cache flush #5914

Merged
merged 8 commits into from
Jul 15, 2020
Merged

Conversation

FrozenGene
Copy link
Member

When we are in the tuning of TVM, we will make TVM occupy the cache fully and doesn't flush it during iteration. This has problems then in e2e testing, since arrays that we assume exist in cache (ie. weights) are evicted during e2e runs,
which leads to lower performance. This has been demonstrated in Ansor.

@merrymercy @tqchen @jcf94 @minminsun

@tqchen
Copy link
Member

tqchen commented Jun 24, 2020

I like the overall util for cache flushing, however, it would be great to discuss the interface of cache eviction. In terms the API choices:

  • While it is understandable that we could like to include first argument(activation) and flush the rest of the argument, it is still a very specific setup(ideally it might be better to make it configurable).
  • Right now things are configured through env variable, is it the best way to configure API?
  • The current logic does not check for other context(besides CPU), and will results un determined behavior when we use OpenCL or CUDA(because the opaque data ptr does not corresponds to a CPU addresss), it might also cause problem when the function is not an DLTensor

Here are a few alternative API choices for configuring the cache flushing behavior.

A0: Fold cache flushing factor into time_evaluator

mod = load_module()
# flush cpu cache of args 1
f = mod.time_evaluator("myfunc", repeat=10, cache_flush_cpu_args_begin=1)

A1: Decoupled Composite style

mod = load_module()
# cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments. 
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed)

@tqchen tqchen added the status: need update need update based on feedbacks label Jun 24, 2020
@tqchen
Copy link
Member

tqchen commented Jun 24, 2020

cc @jwfromm @junrushao1994 @merrymercy @yidawang @jcf94 for quick discussion

@tqchen tqchen self-assigned this Jun 24, 2020
@junrushao
Copy link
Member

I am not in favor of setting environment variables. This is like an even worse global variable and is largely unsafe. If this is really the case, we probably need a global atomic flag, or a thread local flag. Do not use environment variables if we can avoid. Thank you!

@yidawang
Copy link
Contributor

I agree that the cache flush mechanism is useful in getting preciser measurement. It would be great if @FrozenGene can provide some experimental data to further assure.

I vote for folding cache flushing factor into time_evaluator for succinctness. And making it more configurable and generic sounds good to me.

@tqchen
Copy link
Member

tqchen commented Jun 26, 2020

@FrozenGene please followup

@FrozenGene
Copy link
Member Author

FrozenGene commented Jun 28, 2020

I was at vocation several days ago, sorry for replying a little late. For this, I also want to do a quick discussion. For single op / or Ansor's single subgraph benchmark, we don't want to turn on cache flush so that we could unleash maximum performance for them, however for network benchmark, we want to turn on cache flush because of the reason mentioned. So maybe A1 is easier to complete this task if everyone agree we turn off cache flush when we do single op benchmark?

I agree we should provide one api for this as if we want to support remote devices (like arm), environment values can not control it on our local machine.

@FrozenGene
Copy link
Member Author

FrozenGene commented Jun 28, 2020

I agree that the cache flush mechanism is useful in getting preciser measurement. It would be great if @FrozenGene can provide some experimental data to further assure.

I vote for folding cache flushing factor into time_evaluator for succinctness. And making it more configurable and generic sounds good to me.

@yidawang Previous experimental data is almost based on Ansor, especially x86 winograd. Like winograd of 1x7x7x512x512, the single op tuning performance time could reach in 0.11ms (on one skylake 512 machine), but when to execute on e2e, this op even could cost several ms (sorry I lost this number, only record 0.11ms). The issue is the const matrix and weight (for example, weight 3x3x513x512 will become 6x6x512x512 if tile size is 4).

Another benefit to add clflush is we needn't min_repeat_ms (like 1000) because we could measure it very precisely. Like this pr, we even only set repeat to be 10. So we could reduce our tuning time.

I am collecting auto tvm resnet18 data on one skylake machine and will share it when it completes ASAP.

@tqchen
Copy link
Member

tqchen commented Jun 28, 2020

I agree that A1 is more modular, @yidawang do you have strong preference on A0?

@FrozenGene
Copy link
Member Author

FrozenGene commented Jun 29, 2020

I want to share my testing on my skylake machine result. The CPU is Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz. I use one core to tune. The tuning method is the tutorial of tune_relay_x86.py (for without clflush) and this pr's modification of tune_relay_x86.py (for with clflush)

Without cache flush:

[Task  1/12]  Current/Best:   43.31/  44.04 GFLOPS | Progress: (800/800) | 1325.94 s Done.
[Task  2/12]  Current/Best:    4.33/  44.02 GFLOPS | Progress: (720/720) | 1152.81 s Done.
[Task  3/12]  Current/Best:   44.19/  65.08 GFLOPS | Progress: (972/972) | 1492.48 s Done.
[Task  4/12]  Current/Best:   39.14/  57.74 GFLOPS | Progress: (864/864) | 1292.03 s Done.
[Task  5/12]  Current/Best:   53.04/  66.45 GFLOPS | Progress: (1024/1024) | 1584.45 s Done.
[Task  6/12]  Current/Best:   42.93/  54.29 GFLOPS | Progress: (896/896) | 1382.46 s Done.
[Task  7/12]  Current/Best:    8.67/  63.83 GFLOPS | Progress: (980/980) | 1501.93 s Done.
[Task  8/12]  Current/Best:   32.04/  57.19 GFLOPS | Progress: (308/308) | 484.76 s Done.
[Task  9/12]  Current/Best:   12.25/  46.87 GFLOPS | Progress: (980/980) | 2161.56 s Done.
[Task 10/12]  Current/Best:   41.17/  49.36 GFLOPS | Progress: (896/896) | 2174.74 s Done.
[Task 11/12]  Current/Best:   17.24/  49.36 GFLOPS | Progress: (864/864) | 2075.64 s Done.
[Task 12/12]  Current/Best:   23.43/  51.69 GFLOPS | Progress: (720/720) | 1708.31 s Done.

With cache flush:

[Task  1/12]  Current/Best:   41.26/  42.29 GFLOPS | Progress: (800/800) | 543.79 s Done.
[Task  2/12]  Current/Best:    4.30/  41.93 GFLOPS | Progress: (720/720) | 338.14 s Done.
[Task  3/12]  Current/Best:   43.09/  64.36 GFLOPS | Progress: (972/972) | 503.03 s Done.
[Task  4/12]  Current/Best:   41.95/  56.23 GFLOPS | Progress: (864/864) | 350.40 s Done.
[Task  5/12]  Current/Best:   52.39/  66.52 GFLOPS | Progress: (1024/1024) | 505.65 s Done.
[Task  6/12]  Current/Best:   42.34/  53.17 GFLOPS | Progress: (896/896) | 353.18 s Done.
[Task  7/12]  Current/Best:    8.38/  62.88 GFLOPS | Progress: (980/980) | 492.13 s Done.
[Task  8/12]  Current/Best:   31.29/  57.12 GFLOPS | Progress: (308/308) | 166.95 s Done.
[Task  9/12]  Current/Best:   12.36/  40.97 GFLOPS | Progress: (980/980) | 302.91 s Done.
[Task 10/12]  Current/Best:   36.14/  41.85 GFLOPS | Progress: (896/896) | 264.56 s Done.
[Task 11/12]  Current/Best:   16.24/  48.53 GFLOPS | Progress: (864/864) | 257.15 s Done.
[Task 12/12]  Current/Best:   19.53/  47.11 GFLOPS | Progress: (720/720) | 212.18 s Done.

The execution time:
88.36ms (w/ clflush) v.s. 87.26ms (w/o clflush).

As you could see, if we have clflush, almost single layer's tuning gflops is slower than without clflush. But when we run it end2end, we could get better result. And with clflush, we could have much less tuning time as we only need to tune 10 times (even could less).

As said before, if we have winograd for cpu, this becomes more important.

@tqchen
Copy link
Member

tqchen commented Jun 29, 2020

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

@FrozenGene
Copy link
Member Author

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

OK. I will implement this PR to A1 next.

@FrozenGene
Copy link
Member Author

FrozenGene commented Jun 30, 2020

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

@tqchen If we use

# cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments. 
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed)

We will meet Cannot pass type FunctionHandle as an argument to the remote error. Do you have any good suggestion about it?

Related Code (see Wrong and Pass part):

TVM_REGISTER_GLOBAL("runtime.RPCTimeEvaluator")
    .set_body_typed([](Optional<Module> opt_mod, std::string name, int device_type, int device_id,
                       int number, int repeat, int min_repeat_ms, PackedFunc f_prepare) {
      TVMContext ctx;
      ctx.device_type = static_cast<DLDeviceType>(device_type);
      ctx.device_id = device_id;
      if (opt_mod.defined()) {
        Module m = opt_mod.value();
        std::string tkey = m->type_key();
        if (tkey == "rpc") {
          // Wrong
          return static_cast<RPCModuleNode*>(m.operator->())
              ->GetTimeEvaluator(name, ctx, number, repeat, min_repeat_ms, f_prepare);
          // Pass
          ctx.device_type = static_cast<DLDeviceType>(ctx.device_type % kRPCSessMask);
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        } else {
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        }
      }

@FrozenGene
Copy link
Member Author

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

@tqchen If we use

# cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments. 
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed)

We will meet Cannot pass type FunctionHandle as an argument to the remote error. Do you have any good suggestion about it?

Related Code (see Wrong and Pass part):

TVM_REGISTER_GLOBAL("runtime.RPCTimeEvaluator")
    .set_body_typed([](Optional<Module> opt_mod, std::string name, int device_type, int device_id,
                       int number, int repeat, int min_repeat_ms, PackedFunc f_prepare) {
      TVMContext ctx;
      ctx.device_type = static_cast<DLDeviceType>(device_type);
      ctx.device_id = device_id;
      if (opt_mod.defined()) {
        Module m = opt_mod.value();
        std::string tkey = m->type_key();
        if (tkey == "rpc") {
          // Wrong
          return static_cast<RPCModuleNode*>(m.operator->())
              ->GetTimeEvaluator(name, ctx, number, repeat, min_repeat_ms, f_prepare);
          // Pass
          ctx.device_type = static_cast<DLDeviceType>(ctx.device_type % kRPCSessMask);
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        } else {
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        }
      }

Do you have any idea about it? @tqchen

@tqchen
Copy link
Member

tqchen commented Jul 7, 2020

Ah, you are right, right now it is hard to pass function as an argument to the remote because it is wrapped under the std::function, we could lift the restriction later once we fold the PackedFunc as an object.

Right now we might need to pass fprepare by string as its name which skips the first argument by default.

f = mod.time_evaluator("myfunc", repeat=10, fprepare="cache_flush_cpu_non_first_arg")

It is still a bit more flexible than A0, as it enables multiple cache flush options but less flexible than A1. The migration to A1 possible though later.

@tqchen
Copy link
Member

tqchen commented Jul 7, 2020

cc @junrushao1994

@yidawang
Copy link
Contributor

yidawang commented Jul 7, 2020

@tqchen Sorry to drop the ball as I was out the entire last week. The current implementation looks good to me.

@FrozenGene
Copy link
Member Author

Ah, you are right, right now it is hard to pass function as an argument to the remote because it is wrapped under the std::function, we could lift the restriction later once we fold the PackedFunc as an object.

Right now we might need to pass fprepare by string as its name which skips the first argument by default.

f = mod.time_evaluator("myfunc", repeat=10, fprepare="cache_flush_cpu_non_first_arg")

It is still a bit more flexible than A0, as it enables multiple cache flush options but less flexible than A1. The migration to A1 possible though later.

Another point I want to do one round of quick discussion. When we tune single op / single layer network performance, we don't want to do cache flush so that we could get maximum performance. Previous pr use environment value could control it easily (setting it in tune_network but doesn't set it when to tune single layer network). If we fold it into time_evaluator, how to distinguish these two conditions? One method I could come up is we could add one extra tuning option (like enable_cache_flush, default value could be False) for measure. Do you have better suggestions?

@FrozenGene
Copy link
Member Author

@yidawang @tqchen Code is updated. Please help to review it again. Currently, when we tune network, we could set enable_cpu_cache_flush option be true to turn on cache flush for network. The default value is false. Like to listen to your opinions.

python/tvm/autotvm/measure/measure_methods.py Outdated Show resolved Hide resolved
python/tvm/autotvm/measure/measure_methods.py Outdated Show resolved Hide resolved
python/tvm/runtime/module.py Outdated Show resolved Hide resolved
tutorials/autotvm/tune_relay_x86.py Show resolved Hide resolved
python/tvm/autotvm/measure/measure_methods.py Show resolved Hide resolved
@FrozenGene
Copy link
Member Author

@comaniac I have updated the doc and change the name from f_prepare to f_preproc. Could you help to review it again?

@comaniac
Copy link
Contributor

@comaniac I have updated the doc and change the name from f_prepare to f_preproc. Could you help to review it again?

Didn't see the update on my side. Will check it later again.

@FrozenGene
Copy link
Member Author

@comaniac I have updated the doc and change the name from f_prepare to f_preproc. Could you help to review it again?

Didn't see the update on my side. Will check it later again.

I could see it now from my side. It should work from your side too?

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

python/tvm/autotvm/measure/measure_methods.py Outdated Show resolved Hide resolved
python/tvm/autotvm/measure/measure_methods.py Outdated Show resolved Hide resolved
tutorials/autotvm/tune_relay_x86.py Outdated Show resolved Hide resolved
@FrozenGene
Copy link
Member Author

@merrymercy I have updated the doc. For enable_cpu_cache_flush, I combine the advices and add one comment that which only has effect on CPU task. Could you spend time in reviewing it again? Thanks.

@tqchen tqchen merged commit ae4480a into apache:master Jul 15, 2020
@tqchen tqchen added status: accepted and removed status: need update need update based on feedbacks labels Jul 15, 2020
@tqchen
Copy link
Member

tqchen commented Jul 15, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants