[TIR][TOPI][x86][CI] Support skylake avx512 #13621

vvchernov · 2022-12-15T11:02:57Z

It looks like despite of some avx512 intrinsics were supported (see topi/x86 and tir), they are not used during simple compilation or tuning model by meta-scheduler on skylake-avx512 target.
The aim is end-to-end support of Skylake X architecture on TVM side for dense, batch_matmul and conv ops.

Details

CI tests were added
Code was extended to compilation with AVX512 instructions without VNNI for dense, batch_matmul and conv2d. Some fixes were done for the latter.
Code was extended to tune by meta-scheduler with AVX512 instructions without VNNI for dense, batch_matmul and conv2d.

tvm-bot · 2022-12-15T11:03:00Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @Hzfengsy, @Mousius, @areusch, @driazati, @gigiblender, @junrushao, @leandron, @quic-sanirudh _{See #10317 for details}

_{Generated by tvm-bot}

cbalint13

Thank you @vvchernov for this excellent addition on x86 coverage !

I've suggested few nits in the code, mostly cosmetic.

We could add (next PR?) 4x4 (ssse3/m128), 8x4 (avx2/m256) skylake counterparts:

tvm/python/tvm/topi/x86/tensor_intrin.py

Lines 90 to 107 in 58f924d

    
           if int32_lanes == 4: 
        
               int_lx32 = "int16x8" 
        
               int_8xl = "int8x16" 
        
               int_32xl = "int32x4" 
        
               pmaddubs = "llvm.x86.ssse3.pmadd.ub.sw.128" 
        
               pmaddw = "llvm.x86.sse2.pmadd.wd" 
        
           elif int32_lanes == 8: 
        
               int_lx32 = "int16x16" 
        
               int_8xl = "int8x32" 
        
               int_32xl = "int32x8" 
        
               pmaddubs = "llvm.x86.avx2.pmadd.ub.sw" 
        
               pmaddw = "llvm.x86.avx2.pmadd.wd" 
        
           elif int32_lanes == 16: 
        
               int_lx32 = "int16x32" 
        
               int_8xl = "int8x64" 
        
               int_32xl = "int32x16" 
        
               pmaddubs = "llvm.x86.avx512.pmaddubs.w.512" 
        
               pmaddw = "llvm.x86.avx512.pmaddw.d.512"

python/tvm/tir/tensor_intrin/x86.py

cbalint13 · 2022-12-24T20:21:06Z

python/tvm/relay/op/strategy/x86.py

-        and target_has_vnni(mcpu)
+        and target_has_avx512(mcpu)
        and inputs[0].dtype == "uint8"
        and inputs[1].dtype == "int8"
        and inputs[1].shape[-2] % 16 == 0
        and inputs[1].shape[-1] % 4 == 0
    ):
        strategy.add_implementation(
-            wrap_compute_batch_matmul(topi.x86.batch_matmul_vnni_compute, need_out_dtype=True),
-            wrap_topi_schedule(topi.x86.schedule_batch_matmul_vnni),
-            name="batch_matmul_vnni.x86",


Don't remove the vnni one completley

Instead, maybe a descend strategy would be better:
- if has_vnni // llvm -mcpu=cascadelake
- elif has_avx512 // llvm -mcpu=skylake
- elif has_avx2 (in a future PR) // llvm -mcpu=haswell
- elif has_ssse3 (in a future PR) // llvm -mcpu=sandybridge

See strategy/x86.py, also descending plevel, in the upcoming PR#13642

Hello @cbalint13! Thank you for your nits and remarks! In this case VNNI was not removed but extended, as you know VNNI is a part of AVX512 architectures. The fork is here:
https://github.com/apache/tvm/blob/main/python/tvm/topi/x86/tensor_intrin.py#:~:text=def%20dot_16x1x16_uint8_int8_int32()%3A,return%20dot_16x1x16_uint8_int8_int32_skylake()
As you correctly remarked avx2 and ssse3 are also processed here, but they are not accessable due to high-level check target_has_avx512. Possibly you suggestion is good way how to resolve it further. Now I only extended existed approach for avx512.

know VNNI is a part of AVX512 architectures. The fork is here:
https://github.com/apache/tvm/blob/main/python/tvm/topi/x86/tensor_intrin.py#:~:text=def%20dot_16x1x16_uint8_int8_int32()%3A,return%20dot_16x1x16_uint8_int8_int32_skylake()
As you correctly remarked avx2 and ssse3 are also processed here, but they are not accessable due to high-level check target_has_avx512.

Can't see original llvm.x86.vnni instrinsic one in the above.
This switch, right in the provided tensor_intrin.py fork:

if target_has_vnni(mcpu): # VNNI capable platform return dot_16x1x16_uint8_int8_int32_cascadelake() # vpmaddubsw/vpmaddwd fallback return dot_16x1x16_uint8_int8_int32_skylake()

As +SIMD keeps coming, would't be better to stay as upcoming strategy/x86.py if/elif + preferece plevel ?

The tensor_intrin.py would remain only as enums of SIMD (no decisions), triages would stay strategy, etc.

User may control the fall into strategy by narrowing "llvm +mattr={avx512bw,avxvnni,...}", as llvm flags.

Hello @cbalint13! Your view looks reasonable and there is no problems to reimplement it from my side. But I did not implement method dot_16x1x16_uint8_int8_int32 with conditions on tensor_intrin.py side and thought that it is brick to build some concept. @elvin-n and @jwfromm what do you think about Balint's view?

Hello @cbalint13! Your view looks reasonable and there is no problems to reimplement it from my side. But I did not implement method dot_16x1x16_uint8_int8_int32 with conditions on tensor_intrin.py side and thought that it is brick to build some concept. @elvin-n and @jwfromm what do you think about Balint's view?

@vvchernov ,

Thanks for clarifications, I see your point, it is perfectly fine way too.
I think by making all CI tests to pass in green more reviewers will come.

I try sum up, on this very pinned PR change on strategy/x86.py, visibile on top of this thread:

-from tvm.topi.x86.utils import target_has_vnni +from tvm.topi.x86.utils import target_has_avx512 - and target_has_vnni(mcpu) + and target_has_avx512(mcpu) - wrap_compute_dense(topi.x86.dense_vnni), - wrap_topi_schedule(topi.x86.schedule_dense_vnni), + wrap_compute_dense(topi.x86.dense_int8), + wrap_topi_schedule(topi.x86.schedule_dense_int8),

This merge vnni to avx512 (under new dense_int8 umbrella) arguing that VNNI is subset of AVX512 group.

VNNI is subset of AVX512 group, however there are CPU having AVX512 but no VNNI [1].

[1] https://en.wikipedia.org/wiki/AVX-512#VNNI

My view was to leave separate avx512 & vnni(as was) in strategy/x86.py (not to merge vnni->avx512)
My argument was to triage any SIMD right in strategy/x86.py as upcoming AMX do here + plevel control.
I saw VNNI and AVX512 +(AVX2, SSSE3) as potentialy independend things, moreover choosable via "llvm +mattr=...".

The instruction set for SSE/AVX2/AVX512 for int8 is absolutely the same, the only difference is the number of lanes. Additionally, the patterns how these int8 instructions (VPMADDUBSW/VPMADDWD/VPADDD) are used, is the same as the only VNNI instruction (VPDPBUSD). I.e. it is reasonable to have the only tvm intrinsic, it is reasonable to remove VNNI from the name of the function, and it is reasonable to extend these intrinsic function to SSE and AVX2 that is not done yet in this PR

The instruction set for SSE/AVX2/AVX512 for int8 is absolutely the same, the only difference is the number of lanes.

Yes, same class doing integer dot products on immediate registers, but mention:

different clocking, timing & implementation on ASIC

(auto)tensorization opportunities differ as inner loops match differently

Additionally, the patterns how these int8 instructions (VPMADDUBSW/VPMADDWD/VPADDD) are used, is the >same as the only VNNI instruction (VPDPBUSD).

Right.

VNNI insn. accumulates into int32 lanes in single step: vpdpbusd

AVX512 (incl. AVX2, SSSE3 ones) does same in two-step, e.g: pmaddubs + pmadd

I.e. it is reasonable to have the only tvm intrinsic, it is reasonable to remove VNNI from the name of the function, and it is reasonable to extend these intrinsic function to SSE and AVX2 that is not done yet in this PR

Indeed the proposed intrinsic merger is perfectly fine.

It was possible to question it with reasonable arguments.

different clocking, timing & implementation on ASIC

What kind of ASIC do you mean?

(auto)tensorization opportunities differ as inner loops match differently

Under tensorization opportunities differ do yo mean different number of lanes for different instruction set which can be reflected in potential different blocking size? Or something else?

different clocking, timing & implementation on ASIC

What kind of ASIC do you mean?

CPU, family of x86, different generations, varying extended ISA layouts: amx avx512 vnni avx2 ssse3 sse2

(auto)tensorization opportunities differ as inner loops match differently
Under tensorization opportunities differ do yo mean different number of lanes for different instruction set which can be reflected in potential different blocking size?

Yes, both input-widths and output-lanes yields different outcomes, varying performances.

E.g. autotensorizer will opportunistically search to permute & match inner loops to these varying sizes.

Or something else?

TVM is a compiler after all, to my knowledge the only capable of auto-tensorization with arbitrary intrinsic.

I would consider amx vs vnni avx512 avx2 sse3 (btw, there is no sse2 for int8, required instructions appeared if I am not mistaken in sse3.x) because first is matrix multiplication, other ones are vector instructions. For now I propose to go from local to generic and when we see needs in differentiate vector sets, we will do this. For now pattern look similar for all of vector instructions, the aspect of blocking should be added separately if it is not done yet, The aspect of lanes in TVM intrinsic should be covered in this PR

match inner loops to these varying sizes.

The inner loop is the same for all these instructions. It will be

for (int k = 0; k < 4; k++){ output[i] += data[k] * kernel[i][k] }

TVM is a compiler after all, to my knowledge the only capable of auto-tensorization with arbitrary intrinsic.

I agree, at the same time I propose to move from local to generic patterns. We do not limit anything for now

vvchernov · 2022-12-26T05:35:17Z

Thank you @vvchernov for this excellent addition on x86 coverage !

I've suggested few nits in the code, mostly cosmetic.

We could add (next PR?) 4x4 (ssse3/m128), 8x4 (avx2/m256) skylake counterparts:

tvm/python/tvm/topi/x86/tensor_intrin.py

Lines 90 to 107 in 58f924d

if int32_lanes == 4:

int_lx32 = "int16x8"

int_8xl = "int8x16"

int_32xl = "int32x4"

pmaddubs = "llvm.x86.ssse3.pmadd.ub.sw.128"

pmaddw = "llvm.x86.sse2.pmadd.wd"

elif int32_lanes == 8:

int_lx32 = "int16x16"

int_8xl = "int8x32"

int_32xl = "int32x8"

pmaddubs = "llvm.x86.avx2.pmadd.ub.sw"

pmaddw = "llvm.x86.avx2.pmadd.wd"

elif int32_lanes == 16:

int_lx32 = "int16x32"

int_8xl = "int8x64"

int_32xl = "int32x16"

pmaddubs = "llvm.x86.avx512.pmaddubs.w.512"

pmaddw = "llvm.x86.avx512.pmaddw.d.512"

Thanks! I agree and also thought how it can be done in reasonable way. And yes, I think it should be done in separated PR.

cbalint13

@vvchernov ,
It is important to make tests for this PR pass all in green.
Unsure for the actual failures, maybe CI lacks related avx512 ISA ?

tests/python/integration/test_auto_tensorize.py

tests/python/unittest/test_meta_schedule_trace_apply.py

…ted shedules and postprocs for it. add TODOs for further check and development

vvchernov · 2023-01-13T07:34:05Z

Hello @areusch, @driazati, @junrushao! Could you see this PR?

junrushao · 2023-01-13T07:36:02Z

Happy to take a look tomorrow :-)

vvchernov · 2023-01-16T11:30:20Z

Hello @masahi! Could you see this PR?

masahi

Please remove all diffs from some unittests where VNNI vs AVX512 difference doesn't matter. The diffs in this PR is unnecessarily big.
Please verify that VNNI tests are still functional after this change.

src/meta_schedule/schedule_rule/schedule_rule.cc

src/meta_schedule/space_generator/space_generator.cc

masahi · 2023-01-16T22:58:49Z

tests/python/unittest/test_meta_schedule_16x4_integration.py

@@ -28,6 +28,7 @@
 from tvm.tir.schedule import BlockRV, Schedule


Let's rename it to test_meta_schedule_cpu_dot_product.py

done. But I'm not sure that it is clearest name due to cpu includes not only Intel architectures. Nevertheless there is no other similar test to disturb somebody

tests/python/unittest/test_meta_schedule_trace_apply.py

masahi · 2023-01-16T23:01:28Z

tests/python/unittest/test_tir_schedule_analysis.py

@@ -41,6 +41,71 @@
 from tvm.te import create_prim_func


Also remove change from this file

There is no big changes, I tried to unify tests using the same classes, but my try failed and I return it back (as fact it was replaced inside the file). I've rollbacked the trasferred code. Just now there is pylint fix and renaming for the sake of clarity (not only VNNI is checked)

vvchernov · 2023-01-17T11:50:41Z

Hello @masahi! Some words about check of VNNI functionality after changing: 1. Unfortunately I do not have machine with VNNI to check it locally therefore I based on CI test for VNNI, 2. This PR is devoted to support of avx512 which was implemented in paralell to VNNI functionality. Changes touching VNNI are related to renaming or unifying test common code. 3. I plan to open new PR with fixes for VNNI of some small issues observed during this work if they (fixes) are correct.

* add skylake-avx512 tests * extend tests by skylake-avx512 * lint fixes * fix misprinting * misprinting fix * TODOs for further development * add temporally commented tests for skylake-avx512 due to not implemented shedules and postprocs for it. add TODOs for further check and development * update int8-acc32 test for vnni and avx512 w/o it * pylint fix * once more pylint fix * fix Feature init for skylake * fix test * fix intrin names for assert for skylake * small fix * return back fast int8 intrinsic tests * test connect of dense and batch_matmul to avx512 tensorization * extend dense_alter_layout on avx512 (currently) instead of VNNI. some renaming vnni to int8 for the sake of clarity * more renaming vnni to int8 for dense schedule, compute, strategy for the sake of clarity * update for batch_matmul with avx512 * extend space generator init for avx512. Add Default AVX512 schedule rules * avx512 dot 16x4 intrin was implemented for MS default schedule rule * small fix * update * pylint fixes * test workaround for const alloc in tir * test fix (broadcasting) * remove excess instructions from dot_product_16x4_u8i8i32_avx512 * pylint fix * skip asm check for askew weight shapes * fix pylint * revert test fix * set number of args * test fix * fix const allocation in tir for avx512 dot 16x4 * fix signature of dot_product_16x4_u8i8i32_avx512 * use script instead of tvm.tir for const allocation * extend auto tensorize test by skylake-avx512 target * clean code * update test_op_level1, resolve TODO * small update test_op_level2 * update test_op_level10, resolve TODO * update qnn legalize pass test, resolve TODOs * pylint fixes * update ms test for avx512 * update more ms test for avx512 * try to fix i386 CI tests * fix intrin name for check * skip test due to model downloading issue * fix test failure * use ORT for conv2d check * lint fix after rebasing * comment ORT part of test * extend tests tir schedule analysis and transform for avx512. unify test classes * extend test tir schedule tensorize for avx512 * extend test meta schedule vnni integration for avx512 * rename test file * pylint fix * tag fix * update test meta schedule trace apply with avx512 * rollback test class unifying in utils * pylint fixes * separate TIRs for scheduled conv2d for vnni and avx512 * fix registering issue in test * update conv+bias onnx model for intermediate test * fix int16 overflow * fix int16 overflow for dense test * update input data for test of dense * small rollback * fix misprinting * fix * restart CI * DefaultVNNI was renamed to DefaultLLVM for mutator * rename test file for the sake of clarity * DefaultVNNI was renamed to DefaultCPUTensorization for postproc * remove resolved TODO * DefaultVNNI and AVX512 for ScheduleRule were unified * replace code to upstream with initial version * fix arg type * lint fix * small fix * lint fix * fix misprinting * rollback trace apply test for avx512 (reviewer remark) * fix pylint Co-authored-by: Valery Chernov <[email protected]>

vvchernov changed the title ~~WIP: [TIR][TOPI][CI] Support skylake avx512~~ WIP: [TIR][TOPI][x86][CI] Support skylake avx512 Dec 15, 2022

vvchernov force-pushed the vc/support_skylake_avx512 branch 2 times, most recently from 5601b8d to 3a8fbf3 Compare December 23, 2022 08:59

cbalint13 requested changes Dec 24, 2022

View reviewed changes

python/tvm/tir/tensor_intrin/x86.py Outdated Show resolved Hide resolved

python/tvm/tir/tensor_intrin/x86.py Outdated Show resolved Hide resolved

cbalint13 reviewed Dec 24, 2022

View reviewed changes

cbalint13 requested changes Dec 28, 2022

View reviewed changes

tests/python/integration/test_auto_tensorize.py Outdated Show resolved Hide resolved

tests/python/unittest/test_meta_schedule_trace_apply.py Outdated Show resolved Hide resolved

tests/python/unittest/test_meta_schedule_trace_apply.py Outdated Show resolved Hide resolved

vvchernov force-pushed the vc/support_skylake_avx512 branch 8 times, most recently from 0000b49 to 1aa2093 Compare January 11, 2023 14:20

Valery Chernov added 14 commits January 11, 2023 19:10

add skylake-avx512 tests

ef5d6c8

extend tests by skylake-avx512

7c84e41

lint fixes

a1c3b3c

fix misprinting

57fbc5e

misprinting fix

094fd8d

TODOs for further development

b7dff8f

add temporally commented tests for skylake-avx512 due to not implemen…

78d5e25

…ted shedules and postprocs for it. add TODOs for further check and development

update int8-acc32 test for vnni and avx512 w/o it

df49fe6

pylint fix

d2e02e8

once more pylint fix

0610ea6

fix Feature init for skylake

751f729

fix test

3647631

fix intrin names for assert for skylake

a439103

small fix

a2f1587

vvchernov force-pushed the vc/support_skylake_avx512 branch from cccd755 to a289d4b Compare January 12, 2023 06:59

Valery Chernov added 2 commits January 12, 2023 11:34

small rollback

e6ea691

fix misprinting

300c66d

vvchernov changed the title ~~WIP: [TIR][TOPI][x86][CI] Support skylake avx512~~ [TIR][TOPI][x86][CI] Support skylake avx512 Jan 12, 2023

fix

59bf956

cbalint13 approved these changes Jan 12, 2023

View reviewed changes

restart CI

c550370

masahi requested changes Jan 16, 2023

View reviewed changes

Valery Chernov added 12 commits January 17, 2023 09:24

DefaultVNNI was renamed to DefaultLLVM for mutator

b69946a

rename test file for the sake of clarity

9c6054b

DefaultVNNI was renamed to DefaultCPUTensorization for postproc

76d9aff

remove resolved TODO

db7960e

DefaultVNNI and AVX512 for ScheduleRule were unified

b7d3f8b

replace code to upstream with initial version

8c9e403

fix arg type

f8794c9

lint fix

cddd1a1

small fix

84b780d

lint fix

2d772f6

fix misprinting

d2343ab

rollback trace apply test for avx512 (reviewer remark)

18c3610

fix pylint

c06ee20

masahi approved these changes Jan 17, 2023

View reviewed changes

masahi merged commit 3281226 into apache:main Jan 17, 2023

vvchernov deleted the vc/support_skylake_avx512 branch January 18, 2023 04:55

ysh329 mentioned this pull request Apr 17, 2023

[Release] v0.12.0 Release Candidate Notes #14645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIR][TOPI][x86][CI] Support skylake avx512 #13621

[TIR][TOPI][x86][CI] Support skylake avx512 #13621

vvchernov commented Dec 15, 2022 •

edited

Loading

tvm-bot commented Dec 15, 2022 •

edited

Loading

cbalint13 left a comment

cbalint13 Dec 24, 2022 •

edited

Loading

vvchernov Dec 26, 2022

cbalint13 Dec 26, 2022 •

edited

Loading

vvchernov Dec 26, 2022

cbalint13 Dec 26, 2022 •

edited

Loading

elvin-n Dec 27, 2022

cbalint13 Dec 27, 2022

elvin-n Dec 27, 2022

cbalint13 Dec 27, 2022 •

edited

Loading

elvin-n Dec 28, 2022 •

edited

Loading

vvchernov commented Dec 26, 2022

cbalint13 left a comment

vvchernov commented Jan 13, 2023

junrushao commented Jan 13, 2023

vvchernov commented Jan 16, 2023

masahi left a comment

masahi Jan 16, 2023

vvchernov Jan 17, 2023

masahi Jan 16, 2023

vvchernov Jan 17, 2023

vvchernov commented Jan 17, 2023

	if int32_lanes == 4:
	int_lx32 = "int16x8"
	int_8xl = "int8x16"
	int_32xl = "int32x4"
	pmaddubs = "llvm.x86.ssse3.pmadd.ub.sw.128"
	pmaddw = "llvm.x86.sse2.pmadd.wd"
	elif int32_lanes == 8:
	int_lx32 = "int16x16"
	int_8xl = "int8x32"
	int_32xl = "int32x8"
	pmaddubs = "llvm.x86.avx2.pmadd.ub.sw"
	pmaddw = "llvm.x86.avx2.pmadd.wd"
	elif int32_lanes == 16:
	int_lx32 = "int16x32"
	int_8xl = "int8x64"
	int_32xl = "int32x16"
	pmaddubs = "llvm.x86.avx512.pmaddubs.w.512"
	pmaddw = "llvm.x86.avx512.pmaddw.d.512"

		@@ -28,6 +28,7 @@
		from tvm.tir.schedule import BlockRV, Schedule

[TIR][TOPI][x86][CI] Support skylake avx512 #13621

[TIR][TOPI][x86][CI] Support skylake avx512 #13621

Conversation

vvchernov commented Dec 15, 2022 • edited Loading

Details

tvm-bot commented Dec 15, 2022 • edited Loading

cbalint13 left a comment

Choose a reason for hiding this comment

cbalint13 Dec 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalint13 Dec 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalint13 Dec 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalint13 Dec 27, 2022 • edited Loading

Choose a reason for hiding this comment

elvin-n Dec 28, 2022 • edited Loading

Choose a reason for hiding this comment

vvchernov commented Dec 26, 2022

cbalint13 left a comment

Choose a reason for hiding this comment

vvchernov commented Jan 13, 2023

junrushao commented Jan 13, 2023

vvchernov commented Jan 16, 2023

masahi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvchernov commented Jan 17, 2023

vvchernov commented Dec 15, 2022 •

edited

Loading

tvm-bot commented Dec 15, 2022 •

edited

Loading

cbalint13 Dec 24, 2022 •

edited

Loading

cbalint13 Dec 26, 2022 •

edited

Loading

cbalint13 Dec 26, 2022 •

edited

Loading

cbalint13 Dec 27, 2022 •

edited

Loading

elvin-n Dec 28, 2022 •

edited

Loading