[SYCL][CUDA][Matrix] Initial Tensorcore matrix ext impl #4696

JackAKirk · 2021-10-04T15:16:32Z

Initial Implementation based on the new matrix extension supporting Nvidia Tensorcore, #4695, that is adapted from the AMX matrix extension.
Only double data type matrix elements are initially supported.

Signed-off-by: jack.kirk [email protected]

Only double data type matrix elements are initially supported. Adaptation of the AMX matrix extension to also support Nvidia tensorcore hardware. Signed-off-by: jack.kirk <[email protected]>

Signed-off-by: jack.kirk <[email protected]>

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

dkhaldi

Thank you for doing this work.
This is a nice work and shows how the joint_matrix interface can apply to other TPUs like the Nvidia one. I posted some comments mostly about the interface.

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

dkhaldi · 2021-10-04T20:02:07Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+namespace intel {
+namespace experimental::matrix {
+
+enum class matrix_type { a, b, accumulator };


While we don't have this right now, this will be needed for future support.
So this is a good addition to the interface, I would suggest changing the name though to matrix_use rather than type.
Also, please take a look at the query interface in static-query.hpp. It provides a nice way to bypass all these extra arguments including the sizes (see example in https://github.com/intel/llvm/blob/sycl/sycl/test/matrix/query.cpp)
You just need to say:
using myparams = tpu_params<tpu::nvidia, int8_t, int8_t, int>;
the matrices can be created as follows:
myparams::joint_matrix_a<sub_group> sub_a(sg);
myparams::joint_matrix_b<sub_group> sub_b(sg);
myparams::joint_matrix_c<sub_group> sub_c(sg);

As you can see the sizes are constructed underneath. The matrix_use is specified in the type alias.

Thanks, I think that the query which misses matrix size parameters will be useful for the tensorcore case, both as a user query to inform which matrix sizes are available for the given matrix_type (using the definition of matrix_type in static_query.hpp which basically corresponds to matrix::precision that was described in the tensorcore matrix proposal) and also to potentially reduce the number of parameters necessary in the group functions in the cases where a single matrix_type corresponds to a single matrix size (although there are only a small number of cases where this is valid for cuda - In the majority of cases all template parameters will be needed to uniquely specify the correct joint_matrix: see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape). Since the tensorcore case does not support a continuous range of integers for the matrix sizes, the variables such as max_msize, max_nsize, max_ksize won't be appropriate for the cuda case, but we could e.g. make an alternative implementation for cuda which can report to the user the set of available matrix sizes (most commonly there are two or three per matrix_type) for each 'matrix_type'.

Since the tensorcore case does not support a continuous range of integers for the matrix sizes, the variables such as max_msize, max_nsize, max_ksize won't be appropriate for the cuda case

the max_msize/nsize/ksize are only appropriate for AMX.
the DPAS GPU implementation also supports a discrete range of values. That would be msize/nsize/ksize members of 'combination' type. Please refer to https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/matrix/static-query.hpp#L297 on how we filled out the combinations for DPAS.

I see, thanks.

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

dkhaldi · 2021-10-04T20:27:31Z

sycl/include/sycl/ext/oneapi/matrix/matrix.hpp

@@ -25,3 +25,6 @@
 #include <sycl/ext/oneapi/matrix/matrix-jit.hpp>
 #include <sycl/ext/oneapi/matrix/static-query.hpp>
 #endif
+#if (SYCL_EXT_ONEAPI_MATRIX == 3)


This implementation can also benefit from the static query we have as well. Besides that the query can give the user information about what the implementation support, it can also construct the matrices and make the sizes optional for the user.

We should probably add this to matrix-jit.hpp and fork to using the AOT tensorcore implementation based on some option (AOT for tensorcore).
I am asking this because we should have one place that has the interface to make maintaining the code easy but also, since this interface is experimental, we expect it will be changed (like the use argument you introduce). We should make the interface in one place so we only have to modify it in only one place.

Do you think that there should be a single header for all of the definitions of joint_matrix, joint_matrix_load, joint_matrix_store, joint_matrix_mad, and then backend dependent specializations of these functions can be in separate files?

Yes, if you can use the same things as in matrix/matrix-jit.hpp like matrix_layout and not redefine them, that would be better.
For the things that are different like the definition of joint_matrix type, joint_matrix_load/store/mad because of "use" argument, can you add the use-definitions in matrix-jit.hpp (under the new test macro = 3)

As you know, we are planning on adding the new "use" argument for AMX and DPAS as well. Once we do that, there will be one definition of joint_matrix type/joint_matrix_load/store/mad.

If you make this change now, later, there will be one place for us to change (remove the old joint_matrix,load,store,mad that do not have "use" argument). And we won't need to touch the tensorcores specific specifications that will be in a different file.

Also, when this convergence happens, there will be no need for the feature test macro. Since this is an experimental interface, we don't need to keep track of "old" versions of the interface. We will remove AOT AMX (SYCL_EXT_ONEAPI_MATRIX=1), we only keep matrix-jit.hpp that enables DPAS, AMX and tensorecores.

matrix_layout has an identical definition as in matrix-jit.hpp.

For the things that are different like the definition of joint_matrix type, joint_matrix_load/store/mad because of "use" argument, can you add the use-definitions in matrix-jit.hpp (under the new test macro = 3)

As you know, we are planning on adding the new "use" argument for AMX and DPAS as well. Once we do that, there will be one definition of joint_matrix type/joint_matrix_load/store/mad.

I'm not sure what you are asking me to do here? : if I add the definitions of joint_matrix_* used in matrix-tensorcore.hpp into matrix-jit.hpp they will be a redeclaration of the intel specific functions already defined in matrix-jit.hpp that do not use the matrix_use template parameter.

Hi @dkhaldi , We would like to get this merged. Could you clarify what you would like me to change? Thanks.

Sorry for late reply, I was thinking you can have these defined under the new test macro = 3 in the same file so they don't get redefined.
However, I think it will be best if we merge these as separate files. Once we add the use argument, we can reiterate on this to merge both files. What do you think?

OK sure, I think that keeping them separate is a good idea for now.

sycl/test/matrix/matrix-cuda.cpp

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

sycl/test/matrix/matrix-cuda.cpp

Switched namespace intel with oneapi. Moved Group template parameter to end of parameter list in joint_matrix. Removed unnecessary template parameter Group from impl functions. Signed-off-by: jack.kirk <[email protected]>

sycl/test/matrix/matrix-cuda.cpp

…alls. Signed-off-by: jack.kirk <[email protected]>

dkhaldi · 2021-10-12T13:54:49Z

sycl/test/check_device_code/matrix-builtins-nvptx.cpp

@@ -0,0 +1,99 @@
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX=3 -S -Xclang -emit-llvm %s -o -| FileCheck %s
+
+#include <CL/sycl.hpp>


rename this file to matrix-nvptx-double-test.cpp to match the name in other matrix tests

OK, I've renamed the test.

Why is the test under test/check_device_code rather than test/matrix?

This test is a device code only test. I thought that we were meant to put runtime tests like those in https://github.com/intel/llvm/tree/sycl/sycl/test/matrix in https://github.com/intel/llvm-test-suite/tree/intel/SYCL/Matrix from now on?

@JackAKirk I see now, thanks.
@romanovvlad, is check_device_code/ the right folder for matrix-nvptx-double-test.cpp or should it stay under matrix/?

I think the test should be somewhere in check_device_code. We may introduce a matrix subdir if it makes sense so the final location is sycl/test/check_device_code/matrix/matrix-builtins-nvptx.cpp.

Thanks. I've moved the test to sycl/test/check_device_code/matrix

matrix_layout::packed -> packed_a, packed_b Signed-off-by: jack.kirk <[email protected]>

sycl/test/check_device_code/matrix-nvptx-double-test.cpp

Signed-off-by: jack.kirk <[email protected]>

dkhaldi

LGTM

JackAKirk · 2021-11-03T10:32:05Z

The test fails due to:

cannot find libdevice for sm_80

I ran this test on sm_61 with the 11.4 CUDA driver and it passed. Does your CI have an up to date CUDA driver that supports sm_80?

Signed-off-by: jack.kirk <[email protected]>

JackAKirk · 2021-11-04T16:58:20Z

The test fails due to:

cannot find libdevice for sm_80

I ran this test on sm_61 with the 11.4 CUDA driver and it passed. Does your CI have an up to date CUDA driver that supports sm_80?

Sorry I think this happened because I had missed // REQUIRES: gpu, cuda. Should be fine now.

JackAKirk · 2021-11-05T12:54:43Z

The test fails due to:
cannot find libdevice for sm_80
I ran this test on sm_61 with the 11.4 CUDA driver and it passed. Does your CI have an up to date CUDA driver that supports sm_80?

Sorry I think this happened because I had missed // REQUIRES: gpu, cuda. Should be fine now.

@romanovvlad can we run the tests again? Thanks

[SYCL][CUDA][Matrix] Initial implementation of tensorcore extension

fde5488

Only double data type matrix elements are initially supported. Adaptation of the AMX matrix extension to also support Nvidia tensorcore hardware. Signed-off-by: jack.kirk <[email protected]>

JackAKirk requested a review from a team as a code owner October 4, 2021 15:16

JackAKirk requested a review from romanovvlad October 4, 2021 15:16

JackAKirk mentioned this pull request Oct 4, 2021

[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

Closed

Repaired matrix-tensorcore.hpp formatting.

1f200d7

Signed-off-by: jack.kirk <[email protected]>

pvchupin requested a review from dkhaldi October 4, 2021 18:55

dkhaldi reviewed Oct 4, 2021

View reviewed changes

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp Outdated Show resolved Hide resolved

dkhaldi requested changes Oct 4, 2021

View reviewed changes

keryell reviewed Oct 5, 2021

View reviewed changes

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp Outdated Show resolved Hide resolved

sycl/test/matrix/matrix-cuda.cpp Outdated Show resolved Hide resolved

Addressed comments.

5a70623

Switched namespace intel with oneapi. Moved Group template parameter to end of parameter list in joint_matrix. Removed unnecessary template parameter Group from impl functions. Signed-off-by: jack.kirk <[email protected]>

romanovvlad reviewed Oct 7, 2021

View reviewed changes

sycl/test/matrix/matrix-cuda.cpp Outdated Show resolved Hide resolved

Removed integration style test, added unit tests checking llvm.nvvm c…

a223fd2

…alls. Signed-off-by: jack.kirk <[email protected]>

bader requested a review from romanovvlad October 11, 2021 10:59

romanovvlad previously approved these changes Oct 12, 2021

View reviewed changes

dkhaldi reviewed Oct 12, 2021

View reviewed changes

renamed test file.

04dc06a

matrix_layout::packed -> packed_a, packed_b Signed-off-by: jack.kirk <[email protected]>

JackAKirk dismissed romanovvlad’s stale review via 04dc06a October 15, 2021 17:24

dkhaldi reviewed Nov 1, 2021

View reviewed changes

sycl/test/check_device_code/matrix-nvptx-double-test.cpp Outdated Show resolved Hide resolved

Placed test in new matrix folder.

9817ae6

Signed-off-by: jack.kirk <[email protected]>

dkhaldi previously approved these changes Nov 2, 2021

View reviewed changes

Added REQUIRES: gpu, cuda.

62f28c0

Signed-off-by: jack.kirk <[email protected]>

JackAKirk dismissed dkhaldi’s stale review via 62f28c0 November 4, 2021 16:56

JackAKirk mentioned this pull request Nov 5, 2021

[SYCL][CUDA] Matrix MMA for double type using nvptx. intel/llvm-test-suite#553

Merged

romanovvlad approved these changes Nov 8, 2021

View reviewed changes

romanovvlad merged commit 711ba58 into intel:sycl Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA][Matrix] Initial Tensorcore matrix ext impl #4696

[SYCL][CUDA][Matrix] Initial Tensorcore matrix ext impl #4696

JackAKirk commented Oct 4, 2021 •

edited

Loading

dkhaldi left a comment

dkhaldi Oct 4, 2021

JackAKirk Oct 5, 2021 •

edited

Loading

dkhaldi Oct 5, 2021

JackAKirk Oct 6, 2021

dkhaldi Oct 4, 2021

JackAKirk Oct 15, 2021

dkhaldi Oct 15, 2021

JackAKirk Oct 19, 2021 •

edited

Loading

JackAKirk Nov 1, 2021

dkhaldi Nov 1, 2021

JackAKirk Nov 1, 2021

dkhaldi Oct 12, 2021

JackAKirk Oct 15, 2021

dkhaldi Oct 15, 2021

JackAKirk Oct 18, 2021

dkhaldi Nov 1, 2021

romanovvlad Nov 2, 2021

JackAKirk Nov 2, 2021

dkhaldi left a comment

JackAKirk commented Nov 3, 2021 •

edited

Loading

JackAKirk commented Nov 4, 2021

JackAKirk commented Nov 5, 2021

		@@ -0,0 +1,99 @@
		// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX=3 -S -Xclang -emit-llvm %s -o -\| FileCheck %s

		#include <CL/sycl.hpp>

[SYCL][CUDA][Matrix] Initial Tensorcore matrix ext impl #4696

[SYCL][CUDA][Matrix] Initial Tensorcore matrix ext impl #4696

Conversation

JackAKirk commented Oct 4, 2021 • edited Loading

dkhaldi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Oct 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkhaldi left a comment

Choose a reason for hiding this comment

JackAKirk commented Nov 3, 2021 • edited Loading

JackAKirk commented Nov 4, 2021

JackAKirk commented Nov 5, 2021

JackAKirk commented Oct 4, 2021 •

edited

Loading

JackAKirk Oct 5, 2021 •

edited

Loading

JackAKirk Oct 19, 2021 •

edited

Loading

JackAKirk commented Nov 3, 2021 •

edited

Loading