[clflush] Enable x86 cpu cache flush #5914

FrozenGene · 2020-06-24T11:45:20Z

When we are in the tuning of TVM, we will make TVM occupy the cache fully and doesn't flush it during iteration. This has problems then in e2e testing, since arrays that we assume exist in cache (ie. weights) are evicted during e2e runs,
which leads to lower performance. This has been demonstrated in Ansor.

@merrymercy @tqchen @jcf94 @minminsun

tqchen · 2020-06-24T15:47:31Z

I like the overall util for cache flushing, however, it would be great to discuss the interface of cache eviction. In terms the API choices:

While it is understandable that we could like to include first argument(activation) and flush the rest of the argument, it is still a very specific setup(ideally it might be better to make it configurable).
Right now things are configured through env variable, is it the best way to configure API?
The current logic does not check for other context(besides CPU), and will results un determined behavior when we use OpenCL or CUDA(because the opaque data ptr does not corresponds to a CPU addresss), it might also cause problem when the function is not an DLTensor

Here are a few alternative API choices for configuring the cache flushing behavior.

A0: Fold cache flushing factor into time_evaluator

mod = load_module()
# flush cpu cache of args 1
f = mod.time_evaluator("myfunc", repeat=10, cache_flush_cpu_args_begin=1)

A1: Decoupled Composite style

mod = load_module()
# cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments. 
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed)

tqchen · 2020-06-24T15:48:24Z

cc @jwfromm @junrushao1994 @merrymercy @yidawang @jcf94 for quick discussion

junrushao · 2020-06-24T17:09:44Z

I am not in favor of setting environment variables. This is like an even worse global variable and is largely unsafe. If this is really the case, we probably need a global atomic flag, or a thread local flag. Do not use environment variables if we can avoid. Thank you!

yidawang · 2020-06-24T18:49:34Z

I agree that the cache flush mechanism is useful in getting preciser measurement. It would be great if @FrozenGene can provide some experimental data to further assure.

I vote for folding cache flushing factor into time_evaluator for succinctness. And making it more configurable and generic sounds good to me.

tqchen · 2020-06-26T14:20:26Z

@FrozenGene please followup

FrozenGene · 2020-06-28T02:36:46Z

I was at vocation several days ago, sorry for replying a little late. For this, I also want to do a quick discussion. For single op / or Ansor's single subgraph benchmark, we don't want to turn on cache flush so that we could unleash maximum performance for them, however for network benchmark, we want to turn on cache flush because of the reason mentioned. So maybe A1 is easier to complete this task if everyone agree we turn off cache flush when we do single op benchmark?

I agree we should provide one api for this as if we want to support remote devices (like arm), environment values can not control it on our local machine.

FrozenGene · 2020-06-28T11:31:29Z

I agree that the cache flush mechanism is useful in getting preciser measurement. It would be great if @FrozenGene can provide some experimental data to further assure.

I vote for folding cache flushing factor into time_evaluator for succinctness. And making it more configurable and generic sounds good to me.

@yidawang Previous experimental data is almost based on Ansor, especially x86 winograd. Like winograd of 1x7x7x512x512, the single op tuning performance time could reach in 0.11ms (on one skylake 512 machine), but when to execute on e2e, this op even could cost several ms (sorry I lost this number, only record 0.11ms). The issue is the const matrix and weight (for example, weight 3x3x513x512 will become 6x6x512x512 if tile size is 4).

Another benefit to add clflush is we needn't min_repeat_ms (like 1000) because we could measure it very precisely. Like this pr, we even only set repeat to be 10. So we could reduce our tuning time.

I am collecting auto tvm resnet18 data on one skylake machine and will share it when it completes ASAP.

tqchen · 2020-06-28T17:11:36Z

I agree that A1 is more modular, @yidawang do you have strong preference on A0?

FrozenGene · 2020-06-29T03:59:52Z

I want to share my testing on my skylake machine result. The CPU is Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz. I use one core to tune. The tuning method is the tutorial of tune_relay_x86.py (for without clflush) and this pr's modification of tune_relay_x86.py (for with clflush)

Without cache flush:

[Task  1/12]  Current/Best:   43.31/  44.04 GFLOPS | Progress: (800/800) | 1325.94 s Done.
[Task  2/12]  Current/Best:    4.33/  44.02 GFLOPS | Progress: (720/720) | 1152.81 s Done.
[Task  3/12]  Current/Best:   44.19/  65.08 GFLOPS | Progress: (972/972) | 1492.48 s Done.
[Task  4/12]  Current/Best:   39.14/  57.74 GFLOPS | Progress: (864/864) | 1292.03 s Done.
[Task  5/12]  Current/Best:   53.04/  66.45 GFLOPS | Progress: (1024/1024) | 1584.45 s Done.
[Task  6/12]  Current/Best:   42.93/  54.29 GFLOPS | Progress: (896/896) | 1382.46 s Done.
[Task  7/12]  Current/Best:    8.67/  63.83 GFLOPS | Progress: (980/980) | 1501.93 s Done.
[Task  8/12]  Current/Best:   32.04/  57.19 GFLOPS | Progress: (308/308) | 484.76 s Done.
[Task  9/12]  Current/Best:   12.25/  46.87 GFLOPS | Progress: (980/980) | 2161.56 s Done.
[Task 10/12]  Current/Best:   41.17/  49.36 GFLOPS | Progress: (896/896) | 2174.74 s Done.
[Task 11/12]  Current/Best:   17.24/  49.36 GFLOPS | Progress: (864/864) | 2075.64 s Done.
[Task 12/12]  Current/Best:   23.43/  51.69 GFLOPS | Progress: (720/720) | 1708.31 s Done.

With cache flush:

[Task  1/12]  Current/Best:   41.26/  42.29 GFLOPS | Progress: (800/800) | 543.79 s Done.
[Task  2/12]  Current/Best:    4.30/  41.93 GFLOPS | Progress: (720/720) | 338.14 s Done.
[Task  3/12]  Current/Best:   43.09/  64.36 GFLOPS | Progress: (972/972) | 503.03 s Done.
[Task  4/12]  Current/Best:   41.95/  56.23 GFLOPS | Progress: (864/864) | 350.40 s Done.
[Task  5/12]  Current/Best:   52.39/  66.52 GFLOPS | Progress: (1024/1024) | 505.65 s Done.
[Task  6/12]  Current/Best:   42.34/  53.17 GFLOPS | Progress: (896/896) | 353.18 s Done.
[Task  7/12]  Current/Best:    8.38/  62.88 GFLOPS | Progress: (980/980) | 492.13 s Done.
[Task  8/12]  Current/Best:   31.29/  57.12 GFLOPS | Progress: (308/308) | 166.95 s Done.
[Task  9/12]  Current/Best:   12.36/  40.97 GFLOPS | Progress: (980/980) | 302.91 s Done.
[Task 10/12]  Current/Best:   36.14/  41.85 GFLOPS | Progress: (896/896) | 264.56 s Done.
[Task 11/12]  Current/Best:   16.24/  48.53 GFLOPS | Progress: (864/864) | 257.15 s Done.
[Task 12/12]  Current/Best:   19.53/  47.11 GFLOPS | Progress: (720/720) | 212.18 s Done.

The execution time:
88.36ms (w/ clflush) v.s. 87.26ms (w/o clflush).

As you could see, if we have clflush, almost single layer's tuning gflops is slower than without clflush. But when we run it end2end, we could get better result. And with clflush, we could have much less tuning time as we only need to tune 10 times (even could less).

As said before, if we have winograd for cpu, this becomes more important.

tqchen · 2020-06-29T15:39:07Z

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

FrozenGene · 2020-06-30T02:14:13Z

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

OK. I will implement this PR to A1 next.

FrozenGene · 2020-06-30T10:31:01Z

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

@tqchen If we use

# cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments. 
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed)

We will meet Cannot pass type FunctionHandle as an argument to the remote error. Do you have any good suggestion about it?

Related Code (see Wrong and Pass part):

TVM_REGISTER_GLOBAL("runtime.RPCTimeEvaluator")
    .set_body_typed([](Optional<Module> opt_mod, std::string name, int device_type, int device_id,
                       int number, int repeat, int min_repeat_ms, PackedFunc f_prepare) {
      TVMContext ctx;
      ctx.device_type = static_cast<DLDeviceType>(device_type);
      ctx.device_id = device_id;
      if (opt_mod.defined()) {
        Module m = opt_mod.value();
        std::string tkey = m->type_key();
        if (tkey == "rpc") {
          // Wrong
          return static_cast<RPCModuleNode*>(m.operator->())
              ->GetTimeEvaluator(name, ctx, number, repeat, min_repeat_ms, f_prepare);
          // Pass
          ctx.device_type = static_cast<DLDeviceType>(ctx.device_type % kRPCSessMask);
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        } else {
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        }
      }

FrozenGene · 2020-07-04T11:35:37Z

How about we go with A1 for now, @FrozenGene can you update this PR to A1?

@tqchen If we use

# cache flush packed is a packed func that performs the cpu cache flush
cache_flush_packed = remote.get_function("cpu_cache_flush")(args=begin=1)
# fprepare is a callback that will be called before the evaluation, it takes in args as arguments. 
f = mod.time_evaluator("myfunc", repeat=10, fprepare=cache_flush_packed)

We will meet Cannot pass type FunctionHandle as an argument to the remote error. Do you have any good suggestion about it?

Related Code (see Wrong and Pass part):

TVM_REGISTER_GLOBAL("runtime.RPCTimeEvaluator")
    .set_body_typed([](Optional<Module> opt_mod, std::string name, int device_type, int device_id,
                       int number, int repeat, int min_repeat_ms, PackedFunc f_prepare) {
      TVMContext ctx;
      ctx.device_type = static_cast<DLDeviceType>(device_type);
      ctx.device_id = device_id;
      if (opt_mod.defined()) {
        Module m = opt_mod.value();
        std::string tkey = m->type_key();
        if (tkey == "rpc") {
          // Wrong
          return static_cast<RPCModuleNode*>(m.operator->())
              ->GetTimeEvaluator(name, ctx, number, repeat, min_repeat_ms, f_prepare);
          // Pass
          ctx.device_type = static_cast<DLDeviceType>(ctx.device_type % kRPCSessMask);
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        } else {
          return WrapTimeEvaluator(m.GetFunction(name, false), ctx, number, repeat, min_repeat_ms, f_prepare);
        }
      }

Do you have any idea about it? @tqchen

tqchen · 2020-07-07T02:26:18Z

Ah, you are right, right now it is hard to pass function as an argument to the remote because it is wrapped under the std::function, we could lift the restriction later once we fold the PackedFunc as an object.

Right now we might need to pass fprepare by string as its name which skips the first argument by default.

f = mod.time_evaluator("myfunc", repeat=10, fprepare="cache_flush_cpu_non_first_arg")

It is still a bit more flexible than A0, as it enables multiple cache flush options but less flexible than A1. The migration to A1 possible though later.

tqchen · 2020-07-07T02:26:23Z

cc @junrushao1994

yidawang · 2020-07-07T03:34:25Z

@tqchen Sorry to drop the ball as I was out the entire last week. The current implementation looks good to me.

FrozenGene · 2020-07-07T10:28:13Z

Ah, you are right, right now it is hard to pass function as an argument to the remote because it is wrapped under the std::function, we could lift the restriction later once we fold the PackedFunc as an object.

Right now we might need to pass fprepare by string as its name which skips the first argument by default.
f = mod.time_evaluator("myfunc", repeat=10, fprepare="cache_flush_cpu_non_first_arg")
It is still a bit more flexible than A0, as it enables multiple cache flush options but less flexible than A1. The migration to A1 possible though later.

Another point I want to do one round of quick discussion. When we tune single op / single layer network performance, we don't want to do cache flush so that we could get maximum performance. Previous pr use environment value could control it easily (setting it in tune_network but doesn't set it when to tune single layer network). If we fold it into time_evaluator, how to distinguish these two conditions? One method I could come up is we could add one extra tuning option (like enable_cache_flush, default value could be False) for measure. Do you have better suggestions?

FrozenGene · 2020-07-13T12:32:20Z

@yidawang @tqchen Code is updated. Please help to review it again. Currently, when we tune network, we could set enable_cpu_cache_flush option be true to turn on cache flush for network. The default value is false. Like to listen to your opinions.

python/tvm/autotvm/measure/measure_methods.py

python/tvm/runtime/module.py

tutorials/autotvm/tune_relay_x86.py

python/tvm/autotvm/measure/measure_methods.py

FrozenGene · 2020-07-14T05:31:18Z

@comaniac I have updated the doc and change the name from f_prepare to f_preproc. Could you help to review it again?

comaniac · 2020-07-14T06:00:27Z

@comaniac I have updated the doc and change the name from f_prepare to f_preproc. Could you help to review it again?

Didn't see the update on my side. Will check it later again.

FrozenGene · 2020-07-14T06:24:21Z

@comaniac I have updated the doc and change the name from f_prepare to f_preproc. Could you help to review it again?

Didn't see the update on my side. Will check it later again.

I could see it now from my side. It should work from your side too?

comaniac

LGTM. Thanks.

python/tvm/autotvm/measure/measure_methods.py

tutorials/autotvm/tune_relay_x86.py

FrozenGene · 2020-07-15T02:57:49Z

@merrymercy I have updated the doc. For enable_cpu_cache_flush, I combine the advices and add one comment that which only has effect on CPU task. Could you spend time in reviewing it again? Thanks.

tqchen · 2020-07-15T22:24:19Z

Thanks @FrozenGene @merrymercy @comaniac @yidawang

jcf94 mentioned this pull request Jun 24, 2020

[WIP][Ansor][AutoTVM v2.0] Part 0: Infrastructures for Automatic Schedule Search #5883

Closed

7 tasks

tqchen added the status: need update need update based on feedbacks label Jun 24, 2020

tqchen self-assigned this Jun 24, 2020

FrozenGene added 4 commits July 13, 2020 16:37

[clflush] Enable x86 cpu cache flush

69a0e82

solve windows build

015d779

clflush packed func

5b4fc16

refactor

2fd3158

FrozenGene force-pushed the clflush branch from d922635 to 2fd3158 Compare July 13, 2020 12:23

FrozenGene added 2 commits July 13, 2020 20:27

add doc of function params

1f20abf

restore git master testing

05d090e

comaniac requested changes Jul 13, 2020

View reviewed changes

update

cd3ed60

comaniac approved these changes Jul 14, 2020

View reviewed changes

merrymercy reviewed Jul 14, 2020

View reviewed changes

python/tvm/autotvm/measure/measure_methods.py Outdated Show resolved Hide resolved

python/tvm/autotvm/measure/measure_methods.py Outdated Show resolved Hide resolved

tutorials/autotvm/tune_relay_x86.py Outdated Show resolved Hide resolved

update

8b0de2b

merrymercy approved these changes Jul 15, 2020

View reviewed changes

tqchen merged commit ae4480a into apache:master Jul 15, 2020

tqchen added status: accepted and removed status: need update need update based on feedbacks labels Jul 15, 2020

FrozenGene deleted the clflush branch July 16, 2020 02:39

FrozenGene mentioned this pull request Jul 16, 2020

[Ansor][AutoTVM v2.0] Phase 1: Add RPC Runner #6077

Merged

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020

[clflush] Enable x86 cpu cache flush (apache#5914)

6788661

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020

[clflush] Enable x86 cpu cache flush (apache#5914)

226bae3

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Sep 2, 2020

[clflush] Enable x86 cpu cache flush (apache#5914)

c5a3704

FrozenGene mentioned this pull request Sep 3, 2020

[AutoTVM][Ansor] Enable random fill and CPU cache flush for AutoTVM and Ansor #6391

Merged

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Sep 3, 2020

[clflush] Enable x86 cpu cache flush (apache#5914)

75900a5

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clflush] Enable x86 cpu cache flush #5914

[clflush] Enable x86 cpu cache flush #5914

FrozenGene commented Jun 24, 2020

tqchen commented Jun 24, 2020

tqchen commented Jun 24, 2020

junrushao commented Jun 24, 2020

yidawang commented Jun 24, 2020

tqchen commented Jun 26, 2020

FrozenGene commented Jun 28, 2020 •

edited

Loading

FrozenGene commented Jun 28, 2020 •

edited

Loading

tqchen commented Jun 28, 2020

FrozenGene commented Jun 29, 2020 •

edited

Loading

tqchen commented Jun 29, 2020

FrozenGene commented Jun 30, 2020

FrozenGene commented Jun 30, 2020 •

edited

Loading

FrozenGene commented Jul 4, 2020

tqchen commented Jul 7, 2020

tqchen commented Jul 7, 2020

yidawang commented Jul 7, 2020

FrozenGene commented Jul 7, 2020

FrozenGene commented Jul 13, 2020

FrozenGene commented Jul 14, 2020

comaniac commented Jul 14, 2020

FrozenGene commented Jul 14, 2020

comaniac left a comment

FrozenGene commented Jul 15, 2020

tqchen commented Jul 15, 2020 •

edited

Loading

[clflush] Enable x86 cpu cache flush #5914

[clflush] Enable x86 cpu cache flush #5914

Conversation

FrozenGene commented Jun 24, 2020

tqchen commented Jun 24, 2020

A0: Fold cache flushing factor into time_evaluator

A1: Decoupled Composite style

tqchen commented Jun 24, 2020

junrushao commented Jun 24, 2020

yidawang commented Jun 24, 2020

tqchen commented Jun 26, 2020

FrozenGene commented Jun 28, 2020 • edited Loading

FrozenGene commented Jun 28, 2020 • edited Loading

tqchen commented Jun 28, 2020

FrozenGene commented Jun 29, 2020 • edited Loading

tqchen commented Jun 29, 2020

FrozenGene commented Jun 30, 2020

FrozenGene commented Jun 30, 2020 • edited Loading

FrozenGene commented Jul 4, 2020

tqchen commented Jul 7, 2020

tqchen commented Jul 7, 2020

yidawang commented Jul 7, 2020

FrozenGene commented Jul 7, 2020

FrozenGene commented Jul 13, 2020

FrozenGene commented Jul 14, 2020

comaniac commented Jul 14, 2020

FrozenGene commented Jul 14, 2020

comaniac left a comment

Choose a reason for hiding this comment

FrozenGene commented Jul 15, 2020

tqchen commented Jul 15, 2020 • edited Loading

FrozenGene commented Jun 28, 2020 •

edited

Loading

FrozenGene commented Jun 28, 2020 •

edited

Loading

FrozenGene commented Jun 29, 2020 •

edited

Loading

FrozenGene commented Jun 30, 2020 •

edited

Loading

tqchen commented Jul 15, 2020 •

edited

Loading