Example Conv2D in Triton #591

sebastienwood · 2022-07-19T22:30:38Z

sebastienwood
Jul 19, 2022

Hi !
I'm interested in implementing a Conv-Nd-like operation in Triton. I'm pretty new to GPU programming and Triton. The associated paper proposed a Conv2d implementation in C-Triton. Is there a Python equivalent available ?
My main technical questions lies in the computation of the offsets and the general optimization strategy. Should there be two or more program ids ? (I'm not so sure what is the use of tl.program_id and the documentation is not so clear on it).

My end goal is not amenable to unfolding. To be precise, I only need to compute the following equation (+ batched):
$Z_{a, b, c, a, d, e} = \sum_g \sum_i \sum_j \gamma_{g, b+i, c+j, g, d+i, e+j}$
Let me know if I can provide further information !

yangjunpro · 2022-09-24T02:45:35Z

yangjunpro
Sep 24, 2022
Collaborator

I think you can try to start from here
We have tried it locally and functionally it works with ResNet-50 inference problem size.
However, the performance is not good(we have seen around 2X slow-down compared against the PyTorch native implementation which calls cuDNN instead). Of course, for popular workload like RN50, we should not expect significant perf gain with Triton, but the 2X perf slow-down still deserves some investigation, which is our current ongoing work.

BTW, as to conv performance, do you have any previous empirical data to be shared? @ptillet

4 replies

ptillet Sep 26, 2022
Maintainer

Yeah, I remember investigating this code, and finding out that the issue came from lack of vectorization, which snowballed and made async copies inapplicable. Will be worth re-investigating once the new MLIR backend lands.

sebastienwood Oct 3, 2022
Author

Thanks for the detailed answers ! Out of curiosity:

do you have any ETA for the new MLRIR backend ?
I understand MLIR is a new layer to the LLVM ecosystem, and is taylor made for ML/DL. How does it impact Triton ? Does it makes some computation graphs easier to optimize ? Do I miss some other "cool stuff" behind the scene ?
Is there some readings you would recommand to get into understanding/contributing to Triton ? (LLVM's tutorial seems like the first step in my naive eyes)

ptillet Oct 4, 2022
Maintainer

I don't want to get ahead of myself, but it's making good progress and we're hoping it'll land in ~2months.
I wouldn't say MLIR is taylor made for DL or even Linear Algebra. It's really just nice infrastructure for representing and transforming programs at different levels of abstractions. We don't use it for anything fancy; only to avoid reinventing the wheel on traditional compiler infra components.
I think I would just recommend to read the Triton-MLIR codebase (as well as the MLIR docs) and follow the pull requests that are being made there :)

jsonlee0x02 Apr 22, 2024

Seems some efforts have been done in the current version of Triton.
But still cannot find any convolution example in the codebase of latest triton?
BTW, any updates to the performance of ResNet50?

flishwang · 2024-08-09T01:01:27Z

flishwang
Aug 9, 2024

Two years have been passed, are there any updates?

0 replies

l1351868270 · 2024-09-03T01:34:45Z

l1351868270
Sep 3, 2024

@sebastienwood @ptillet you see "The associated paper proposed a Conv2d implementation in C-Triton." i implement one in python, but it performance is poor! i do not know how to improve it, could you tell me which paper。i want to refer to it to improve some performance https://github.com/l1351868270/implicit_gemm.triton/blob/main/triton_implicit_gemm.py

0 replies

l1351868270 · 2024-09-03T06:37:12Z

l1351868270
Sep 3, 2024

i found why the performance is poor? when the data load from global memory to shared memory, the ptx code do not use the cp.async future. could i force the tl.load compile to use cp.async?

ps:
i write a test code with U=1, V=1, pad_h=0, pad_w=0, dila_h=1, dila_w=1, then the conv2d is same as the gemm only the shape is different. When i use x_ptrs, w_ptrs the performance is poor, and when i use a_ptrs and b_ptrs the performace is ok.
the source code is here https://github.com/l1351868270/implicit_gemm.triton/blob/main/triton_implicit_gemm_1x1_0x0_1x1.py

command:
python python triton_implicit_gemm_1x1_0x0_1x1.py
python triton_bench_1x1_0x0_1x1.py
ncu -f --set full --call-stack -o bench_conv2d_report python triton_implicit_gemm_1x1_0x0_1x1.py

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Conv2D in Triton #591

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Example Conv2D in Triton #591

sebastienwood Jul 19, 2022

Replies: 4 comments · 4 replies

yangjunpro Sep 24, 2022 Collaborator

ptillet Sep 26, 2022 Maintainer

sebastienwood Oct 3, 2022 Author

ptillet Oct 4, 2022 Maintainer

jsonlee0x02 Apr 22, 2024

flishwang Aug 9, 2024

l1351868270 Sep 3, 2024

l1351868270 Sep 3, 2024

sebastienwood
Jul 19, 2022

Replies: 4 comments 4 replies

yangjunpro
Sep 24, 2022
Collaborator

ptillet Sep 26, 2022
Maintainer

sebastienwood Oct 3, 2022
Author

ptillet Oct 4, 2022
Maintainer

flishwang
Aug 9, 2024

l1351868270
Sep 3, 2024

l1351868270
Sep 3, 2024