Rebase xsmm to main #177

Devjiu · 2024-11-13T17:17:52Z

Ref to take a look on changes, prepared by xsmm team.

Adds a `USE_BLOCK_POINTER` flag to the matmul_kernel so we can get IR for pointers-to-tensors instead of tensors-of-pointers.

Implements lowering pass from vector to XSMM microkernels. libxsmm is added as an external dependency together with general MLIR infrastructure for handling XSMM code generation and runtime execution. The XSMM lowering is optional and can be enabled at JIT step by environment variable TRITON_CPU_XSMM=1 libxsmm is built as a shared library and linked with XSMM-related libraries. These are also added to the Python infrastructure. Additionally, general MLIR utilities are imported to allow analysis, code generation and microkernel execution. Initially, a simple pattern mapping vector contraction to an XSMM kernel is added.

…riton-lang#5) Contraction lowering now moves accumulation buffer outside of a reduction loop when possible. This reduces data movement between memory and registers needed to accommodate mixed memref and vector abstractions.

Adds lowering pass from triton to XSMM microkernels. XSMM utility APIs are generalized to work on opaque operations representing contractions. A simple pattern mapping tt.dot to XSMM kernel is added. The runtime lowering to XSMM is now controlled by two separate flags: - TRITON_CPU_VECTOR_XSMM=1 to lower from vector as before - TRITON_CPU_TRITON_XSMM=1 to lower from triton ops

…on (triton-lang#7) * Lift -triton-raise-block-pointer pass from intel-xpu-backend-for-triton Code was in turn taken from triton-shared (though does not use the tts dialect).

Ports hoisting from Vector to XSMM pass to Triton lowering. Dot lowering now moves accumulation buffer outside of a reduction loop when possible.

Updates libxsmm version. Brings support for vnni sw pipeline.

Extends XSMM code generation to allow for mixed precision computations to match triton requirements for <bf16 x bf16 -> f32> contraction. Data type selection is added as a global variable to the matmul tutorial. BF16 can suffer from some inaccuracies compared to PyTorch baseline. However, the difference appears to be the same between native triton-cpu and XSMM lowering - no mismatch on SPR. The matmul tutorial is aligned more with the main branch. V2 backend benchmarking is disable due to its instabilities. Default tile sizes are increased to improve general performance.

…metadata (triton-lang#11)

Adds two new optional flags to the matmul tutorial: - K dim padding - pads input matrices into multiple of chosen BLOCK_SIZE_K - dynamic K blocking - overrides set BLOCK_SIZE_K and adjusts it based on the input K dimension; input is padded if needed The main motivation is to allow testing with larger reduction dimension blocks without kernel lossing support for various sizes. Padding is required to meet triton's requirement for power-of-2 sizes. Dynamic blocking can be used to decrease reduction dimension range or completely eliminate it. Allowing the kernel to work on larger K blocks is also helpful for future rewriting of GEMM into BRGEMM to ensure larger batch dimension.

Adds extra optional padding that can be use to ensure that input matrices' strides are non-power-of-two to improve cache behavior. Currently, it is most useful with DYNAMIC_K_BLOCK enabled.

Extends contraction lowering to XSMM by rewriting plain GEMM into a BRGEMM kernel when possible. The rewrite improves performance of larger K block sizes thanks to extra reduction dim tiling. Use of BRGEMM kernel also enables online VNNI packing for BF16.

Adds an optional flag to move matmul input preprocessing outside of the benchmarked kernel. This option allows to exclude preprocessing overhead from performance measurements.

Adds a python wrapper for a parallelized in-place copy function using libxsmm and OpenMP. It is intended to be used for efficient tensor padding implementation. The libxsmm path have to be specified through env variables: - XSMM_ROOT_DIR - path to libxsmm root dir with headers - XSMM_LIB_DIR - path to libxsmm.so location libxsmm .so also has to be available during runtime execution e.g., exposed through LD_LIBRARY_PATH. The XSMM python module can be built and installed using command: pip install -e ./third_party/cpu/python/

Adds experimental rewrite collapsing reduction loop over GEMM into a BRGEMM ukernel. The pattern matches the hand-written kernel using block pointers and is not compatible with IR generated by triton pointer raising. Direct lowering to XSMM allows to bypass triton load restriction when K dimension is not power-of-two. The pattern is quite brittle but functional for the matmul tutorial example. The rewriting is disable by default and can be enabled with environment variable: TRITON_CPU_LOOP_BRGEMM_XSMM=1

Adds option to apply padding only to matrix B. This allows to explore potential speedups by limiting padding to weights which is reasonably common strategy in e.g., ML inference. Full padding still has to occur when K dimension is padded to avoid dimension mismatch and/or meet power-of-two size requirement.

rolfmorel and others added 16 commits November 13, 2024 17:20

Matrix multiplication tutorial block pointer variant (triton-lang#1)

eb8fbdf

Adds a `USE_BLOCK_POINTER` flag to the matmul_kernel so we can get IR for pointers-to-tensors instead of tensors-of-pointers.

Lift -triton-raise-block-pointer pass from intel-xpu-backend-for-trit…

075b49f

…on (triton-lang#7) * Lift -triton-raise-block-pointer pass from intel-xpu-backend-for-triton Code was in turn taken from triton-shared (though does not use the tts dialect).

[triton][XSMM] Hoist accumulation buffer (triton-lang#8)

417e628

Ports hoisting from Vector to XSMM pass to Triton lowering. Dot lowering now moves accumulation buffer outside of a reduction loop when possible.

Bump libxsmm (triton-lang#9)

5fc010f

Updates libxsmm version. Brings support for vnni sw pipeline.

Dynamic shape/stride/offset support by way of memref.extract_strided_…

00f3a4a

…metadata (triton-lang#11)

Matmul tutorial - cache padding (triton-lang#14)

e6c26cb

Adds extra optional padding that can be use to ensure that input matrices' strides are non-power-of-two to improve cache behavior. Currently, it is most useful with DYNAMIC_K_BLOCK enabled.

Matmul tutorial - external preprocessing (triton-lang#15)

3b72634

Adds an optional flag to move matmul input preprocessing outside of the benchmarked kernel. This option allows to exclude preprocessing overhead from performance measurements.

Devjiu force-pushed the rebase_xsmm_to_main branch from 1e7d9f3 to 7eb5e04 Compare November 13, 2024 17:24

Devjiu added 2 commits November 13, 2024 17:50

format

945f223

format

f82acbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase xsmm to main #177

Rebase xsmm to main #177

Devjiu commented Nov 13, 2024

Rebase xsmm to main #177

Are you sure you want to change the base?

Rebase xsmm to main #177

Conversation

Devjiu commented Nov 13, 2024