Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR][TOPI][x86][CI] Support skylake avx512 #13621

Merged
merged 84 commits into from
Jan 17, 2023

Conversation

vvchernov
Copy link
Contributor

@vvchernov vvchernov commented Dec 15, 2022

It looks like despite of some avx512 intrinsics were supported (see topi/x86 and tir), they are not used during simple compilation or tuning model by meta-scheduler on skylake-avx512 target.
The aim is end-to-end support of Skylake X architecture on TVM side for dense, batch_matmul and conv ops.

Details

  1. CI tests were added
  2. Code was extended to compilation with AVX512 instructions without VNNI for dense, batch_matmul and conv2d. Some fixes were done for the latter.
  3. Code was extended to tune by meta-scheduler with AVX512 instructions without VNNI for dense, batch_matmul and conv2d.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Dec 15, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@vvchernov vvchernov changed the title WIP: [TIR][TOPI][CI] Support skylake avx512 WIP: [TIR][TOPI][x86][CI] Support skylake avx512 Dec 15, 2022
@vvchernov vvchernov force-pushed the vc/support_skylake_avx512 branch 2 times, most recently from 5601b8d to 3a8fbf3 Compare December 23, 2022 08:59
Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vvchernov for this excellent addition on x86 coverage !

  • I've suggested few nits in the code, mostly cosmetic.
  • We could add (next PR?) 4x4 (ssse3/m128), 8x4 (avx2/m256) skylake counterparts:
    if int32_lanes == 4:
    int_lx32 = "int16x8"
    int_8xl = "int8x16"
    int_32xl = "int32x4"
    pmaddubs = "llvm.x86.ssse3.pmadd.ub.sw.128"
    pmaddw = "llvm.x86.sse2.pmadd.wd"
    elif int32_lanes == 8:
    int_lx32 = "int16x16"
    int_8xl = "int8x32"
    int_32xl = "int32x8"
    pmaddubs = "llvm.x86.avx2.pmadd.ub.sw"
    pmaddw = "llvm.x86.avx2.pmadd.wd"
    elif int32_lanes == 16:
    int_lx32 = "int16x32"
    int_8xl = "int8x64"
    int_32xl = "int32x16"
    pmaddubs = "llvm.x86.avx512.pmaddubs.w.512"
    pmaddw = "llvm.x86.avx512.pmaddw.d.512"

python/tvm/tir/tensor_intrin/x86.py Outdated Show resolved Hide resolved
python/tvm/tir/tensor_intrin/x86.py Outdated Show resolved Hide resolved
Comment on lines 630 to 638
and target_has_vnni(mcpu)
and target_has_avx512(mcpu)
and inputs[0].dtype == "uint8"
and inputs[1].dtype == "int8"
and inputs[1].shape[-2] % 16 == 0
and inputs[1].shape[-1] % 4 == 0
):
strategy.add_implementation(
wrap_compute_batch_matmul(topi.x86.batch_matmul_vnni_compute, need_out_dtype=True),
wrap_topi_schedule(topi.x86.schedule_batch_matmul_vnni),
name="batch_matmul_vnni.x86",
Copy link
Contributor

@cbalint13 cbalint13 Dec 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Don't remove the vnni one completley
  • Instead, maybe a descend strategy would be better:
    - if has_vnni // llvm -mcpu=cascadelake
    - elif has_avx512 // llvm -mcpu=skylake
    - elif has_avx2 (in a future PR) // llvm -mcpu=haswell
    - elif has_ssse3 (in a future PR) // llvm -mcpu=sandybridge
  • See strategy/x86.py, also descending plevel, in the upcoming PR#13642

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @cbalint13! Thank you for your nits and remarks! In this case VNNI was not removed but extended, as you know VNNI is a part of AVX512 architectures. The fork is here:
https://github.com/apache/tvm/blob/main/python/tvm/topi/x86/tensor_intrin.py#:~:text=def%20dot_16x1x16_uint8_int8_int32()%3A,return%20dot_16x1x16_uint8_int8_int32_skylake()
As you correctly remarked avx2 and ssse3 are also processed here, but they are not accessable due to high-level check target_has_avx512. Possibly you suggestion is good way how to resolve it further. Now I only extended existed approach for avx512.

Copy link
Contributor

@cbalint13 cbalint13 Dec 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

know VNNI is a part of AVX512 architectures. The fork is here:
https://github.com/apache/tvm/blob/main/python/tvm/topi/x86/tensor_intrin.py#:~:text=def%20dot_16x1x16_uint8_int8_int32()%3A,return%20dot_16x1x16_uint8_int8_int32_skylake()
As you correctly remarked avx2 and ssse3 are also processed here, but they are not accessable due to high-level check target_has_avx512.

Can't see original llvm.x86.vnni instrinsic one in the above.
This switch, right in the provided tensor_intrin.py fork:

    if target_has_vnni(mcpu):
        # VNNI capable platform
        return dot_16x1x16_uint8_int8_int32_cascadelake()
    # vpmaddubsw/vpmaddwd fallback
    return dot_16x1x16_uint8_int8_int32_skylake()

As +SIMD keeps coming, would't be better to stay as upcoming strategy/x86.py if/elif + preferece plevel ?

  • The tensor_intrin.py would remain only as enums of SIMD (no decisions), triages would stay strategy, etc.
  • User may control the fall into strategy by narrowing "llvm +mattr={avx512bw,avxvnni,...}", as llvm flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @cbalint13! Your view looks reasonable and there is no problems to reimplement it from my side. But I did not implement method dot_16x1x16_uint8_int8_int32 with conditions on tensor_intrin.py side and thought that it is brick to build some concept. @elvin-n and @jwfromm what do you think about Balint's view?

Copy link
Contributor

@cbalint13 cbalint13 Dec 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @cbalint13! Your view looks reasonable and there is no problems to reimplement it from my side. But I did not implement method dot_16x1x16_uint8_int8_int32 with conditions on tensor_intrin.py side and thought that it is brick to build some concept. @elvin-n and @jwfromm what do you think about Balint's view?

@vvchernov ,

Thanks for clarifications, I see your point, it is perfectly fine way too.
I think by making all CI tests to pass in green more reviewers will come.


I try sum up, on this very pinned PR change on strategy/x86.py, visibile on top of this thread:

-from tvm.topi.x86.utils import target_has_vnni
+from tvm.topi.x86.utils import target_has_avx512

-        and target_has_vnni(mcpu)
+        and target_has_avx512(mcpu)

-        wrap_compute_dense(topi.x86.dense_vnni),
-        wrap_topi_schedule(topi.x86.schedule_dense_vnni),
+        wrap_compute_dense(topi.x86.dense_int8),
+        wrap_topi_schedule(topi.x86.schedule_dense_int8),
  • This merge vnni to avx512 (under new dense_int8 umbrella) arguing that VNNI is subset of AVX512 group.
  • VNNI is subset of AVX512 group, however there are CPU having AVX512 but no VNNI [1].

[1] https://en.wikipedia.org/wiki/AVX-512#VNNI

My view was to leave separate avx512 & vnni(as was) in strategy/x86.py (not to merge vnni->avx512)
My argument was to triage any SIMD right in strategy/x86.py as upcoming AMX do here + plevel control.
I saw VNNI and AVX512 +(AVX2, SSSE3) as potentialy independend things, moreover choosable via "llvm +mattr=...".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instruction set for SSE/AVX2/AVX512 for int8 is absolutely the same, the only difference is the number of lanes. Additionally, the patterns how these int8 instructions (VPMADDUBSW/VPMADDWD/VPADDD) are used, is the same as the only VNNI instruction (VPDPBUSD). I.e. it is reasonable to have the only tvm intrinsic, it is reasonable to remove VNNI from the name of the function, and it is reasonable to extend these intrinsic function to SSE and AVX2 that is not done yet in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instruction set for SSE/AVX2/AVX512 for int8 is absolutely the same, the only difference is the number of lanes.

Yes, same class doing integer dot products on immediate registers, but mention:

  • different clocking, timing & implementation on ASIC
  • (auto)tensorization opportunities differ as inner loops match differently

Additionally, the patterns how these int8 instructions (VPMADDUBSW/VPMADDWD/VPADDD) are used, is the >same as the only VNNI instruction (VPDPBUSD).

Right.

  • VNNI insn. accumulates into int32 lanes in single step: vpdpbusd
  • AVX512 (incl. AVX2, SSSE3 ones) does same in two-step, e.g: pmaddubs + pmadd

I.e. it is reasonable to have the only tvm intrinsic, it is reasonable to remove VNNI from the name of the function, and it is reasonable to extend these intrinsic function to SSE and AVX2 that is not done yet in this PR

  • Indeed the proposed intrinsic merger is perfectly fine.
  • It was possible to question it with reasonable arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different clocking, timing & implementation on ASIC

What kind of ASIC do you mean?

(auto)tensorization opportunities differ as inner loops match differently

Under tensorization opportunities differ do yo mean different number of lanes for different instruction set which can be reflected in potential different blocking size? Or something else?

Copy link
Contributor

@cbalint13 cbalint13 Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different clocking, timing & implementation on ASIC

What kind of ASIC do you mean?

  • CPU, family of x86, different generations, varying extended ISA layouts: amx avx512 vnni avx2 ssse3 sse2

(auto)tensorization opportunities differ as inner loops match differently
Under tensorization opportunities differ do yo mean different number of lanes for different instruction set which can be reflected in potential different blocking size?

  • Yes, both input-widths and output-lanes yields different outcomes, varying performances.
  • E.g. autotensorizer will opportunistically search to permute & match inner loops to these varying sizes.

Or something else?

  • TVM is a compiler after all, to my knowledge the only capable of auto-tensorization with arbitrary intrinsic.

Copy link
Contributor

@elvin-n elvin-n Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider amx vs vnni avx512 avx2 sse3 (btw, there is no sse2 for int8, required instructions appeared if I am not mistaken in sse3.x) because first is matrix multiplication, other ones are vector instructions. For now I propose to go from local to generic and when we see needs in differentiate vector sets, we will do this. For now pattern look similar for all of vector instructions, the aspect of blocking should be added separately if it is not done yet, The aspect of lanes in TVM intrinsic should be covered in this PR

match inner loops to these varying sizes.

The inner loop is the same for all these instructions. It will be

                for (int k = 0; k < 4; k++){
                    output[i] += data[k] * kernel[i][k]
                }

TVM is a compiler after all, to my knowledge the only capable of auto-tensorization with arbitrary intrinsic.

I agree, at the same time I propose to move from local to generic patterns. We do not limit anything for now

@vvchernov
Copy link
Contributor Author

Thank you @vvchernov for this excellent addition on x86 coverage !

  • I've suggested few nits in the code, mostly cosmetic.
  • We could add (next PR?) 4x4 (ssse3/m128), 8x4 (avx2/m256) skylake counterparts:
    if int32_lanes == 4:
    int_lx32 = "int16x8"
    int_8xl = "int8x16"
    int_32xl = "int32x4"
    pmaddubs = "llvm.x86.ssse3.pmadd.ub.sw.128"
    pmaddw = "llvm.x86.sse2.pmadd.wd"
    elif int32_lanes == 8:
    int_lx32 = "int16x16"
    int_8xl = "int8x32"
    int_32xl = "int32x8"
    pmaddubs = "llvm.x86.avx2.pmadd.ub.sw"
    pmaddw = "llvm.x86.avx2.pmadd.wd"
    elif int32_lanes == 16:
    int_lx32 = "int16x32"
    int_8xl = "int8x64"
    int_32xl = "int32x16"
    pmaddubs = "llvm.x86.avx512.pmaddubs.w.512"
    pmaddw = "llvm.x86.avx512.pmaddw.d.512"

Thanks! I agree and also thought how it can be done in reasonable way. And yes, I think it should be done in separated PR.

Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvchernov ,
It is important to make tests for this PR pass all in green.
Unsure for the actual failures, maybe CI lacks related avx512 ISA ?

tests/python/integration/test_auto_tensorize.py Outdated Show resolved Hide resolved
tests/python/unittest/test_meta_schedule_trace_apply.py Outdated Show resolved Hide resolved
tests/python/unittest/test_meta_schedule_trace_apply.py Outdated Show resolved Hide resolved
@vvchernov vvchernov force-pushed the vc/support_skylake_avx512 branch 8 times, most recently from 0000b49 to 1aa2093 Compare January 11, 2023 14:20
@vvchernov vvchernov force-pushed the vc/support_skylake_avx512 branch from cccd755 to a289d4b Compare January 12, 2023 06:59
@vvchernov vvchernov changed the title WIP: [TIR][TOPI][x86][CI] Support skylake avx512 [TIR][TOPI][x86][CI] Support skylake avx512 Jan 12, 2023
@vvchernov
Copy link
Contributor Author

Hello @areusch, @driazati, @junrushao! Could you see this PR?

@junrushao
Copy link
Member

Happy to take a look tomorrow :-)

@vvchernov
Copy link
Contributor Author

Hello @masahi! Could you see this PR?

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please remove all diffs from some unittests where VNNI vs AVX512 difference doesn't matter. The diffs in this PR is unnecessarily big.
  • Please verify that VNNI tests are still functional after this change.

src/meta_schedule/schedule_rule/schedule_rule.cc Outdated Show resolved Hide resolved
src/meta_schedule/space_generator/space_generator.cc Outdated Show resolved Hide resolved
src/meta_schedule/space_generator/space_generator.cc Outdated Show resolved Hide resolved
src/meta_schedule/space_generator/space_generator.cc Outdated Show resolved Hide resolved
@@ -28,6 +28,7 @@
from tvm.tir.schedule import BlockRV, Schedule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename it to test_meta_schedule_cpu_dot_product.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. But I'm not sure that it is clearest name due to cpu includes not only Intel architectures. Nevertheless there is no other similar test to disturb somebody

@@ -41,6 +41,71 @@
from tvm.te import create_prim_func
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove change from this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no big changes, I tried to unify tests using the same classes, but my try failed and I return it back (as fact it was replaced inside the file). I've rollbacked the trasferred code. Just now there is pylint fix and renaming for the sake of clarity (not only VNNI is checked)

@vvchernov
Copy link
Contributor Author

Hello @masahi! Some words about check of VNNI functionality after changing: 1. Unfortunately I do not have machine with VNNI to check it locally therefore I based on CI test for VNNI, 2. This PR is devoted to support of avx512 which was implemented in paralell to VNNI functionality. Changes touching VNNI are related to renaming or unifying test common code. 3. I plan to open new PR with fixes for VNNI of some small issues observed during this work if they (fixes) are correct.

@masahi masahi merged commit 3281226 into apache:main Jan 17, 2023
@vvchernov vvchernov deleted the vc/support_skylake_avx512 branch January 18, 2023 04:55
fzi-peccia pushed a commit to fzi-peccia/tvm that referenced this pull request Mar 27, 2023
* add skylake-avx512 tests

* extend tests by skylake-avx512

* lint fixes

* fix misprinting

* misprinting fix

* TODOs for further development

* add temporally commented tests for skylake-avx512 due to not implemented shedules and postprocs for it. add TODOs for further check and development

* update int8-acc32 test for vnni and avx512 w/o it

* pylint fix

* once more pylint fix

* fix Feature init for skylake

* fix test

* fix intrin names for assert for skylake

* small fix

* return back fast int8 intrinsic tests

* test connect of dense and batch_matmul to avx512 tensorization

* extend dense_alter_layout on avx512 (currently) instead of VNNI. some renaming vnni to int8 for the sake of clarity

* more renaming vnni to int8 for dense schedule, compute, strategy for the sake of clarity

* update for batch_matmul with avx512

* extend space generator init for avx512. Add Default AVX512 schedule rules

* avx512 dot 16x4 intrin was implemented for MS default schedule rule

* small fix

* update

* pylint fixes

* test workaround for const alloc in tir

* test fix (broadcasting)

* remove excess instructions from dot_product_16x4_u8i8i32_avx512

* pylint fix

* skip asm check for askew weight shapes

* fix pylint

* revert test fix

* set number of args

* test fix

* fix const allocation in tir for avx512 dot 16x4

* fix signature of dot_product_16x4_u8i8i32_avx512

* use script instead of tvm.tir for const allocation

* extend auto tensorize test by skylake-avx512 target

* clean code

* update test_op_level1, resolve TODO

* small update test_op_level2

* update test_op_level10, resolve TODO

* update qnn legalize pass test, resolve TODOs

* pylint fixes

* update ms test for avx512

* update more ms test for avx512

* try to fix i386 CI tests

* fix intrin name for check

* skip test due to model downloading issue

* fix test failure

* use ORT for conv2d check

* lint fix after rebasing

* comment ORT part of test

* extend tests tir schedule analysis and transform for avx512. unify test classes

* extend test tir schedule tensorize for avx512

* extend test meta schedule vnni integration for avx512

* rename test file

* pylint fix

* tag fix

* update test meta schedule trace apply with avx512

* rollback test class unifying in utils

* pylint fixes

* separate TIRs for scheduled conv2d for vnni and avx512

* fix registering issue in test

* update conv+bias onnx model for intermediate test

* fix int16 overflow

* fix int16 overflow for dense test

* update input data for test of dense

* small rollback

* fix misprinting

* fix

* restart CI

* DefaultVNNI was renamed to DefaultLLVM for mutator

* rename test file for the sake of clarity

* DefaultVNNI was renamed to DefaultCPUTensorization for postproc

* remove resolved TODO

* DefaultVNNI and AVX512 for ScheduleRule were unified

* replace code to upstream with initial version

* fix arg type

* lint fix

* small fix

* lint fix

* fix misprinting

* rollback trace apply test for avx512 (reviewer remark)

* fix pylint

Co-authored-by: Valery Chernov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants