Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PT2.6][Torch.complie] Performance analysis and optimization #1004

Open
riverliuintel opened this issue Oct 22, 2024 · 3 comments
Open

[PT2.6][Torch.complie] Performance analysis and optimization #1004

riverliuintel opened this issue Oct 22, 2024 · 3 comments

Comments

@riverliuintel
Copy link
Contributor

🚀 The feature, motivation and pitch

Analyze Triton kernels data and report to Triton XPU.

  1. Recollect reasonable competitive GPU performance data
  2. Use TorchInductor built-in benchmark tool to detect slower XPU triton kernels.

Alternatives

No response

Additional context

No response

@jianyizh
Copy link

scatter op issue: intel/intel-xpu-backend-for-triton#2665

@jianyizh
Copy link

jianyizh commented Nov 21, 2024

layout issue: intel/intel-xpu-backend-for-triton#2229

  1. when number of conv is small, inductor will close layout opt. We have to force it open by TORCHINDUCTOR_FORCE_LAYOUT_OPT, otherwise we may meet inefficient kernel like cat_layernorm in this issue
  2. For both xpu and cuda, when there are more nodes between conv, there will be unnecessary transpose. For example, conv (channel last) + bias and leakyrelu fusion (to channel first) + avg_pool (to channel last) + conv. It seems inductor does not propagate layout. Fuse bias and activation into conv will mitigate.

@jianyizh
Copy link

RNN related ops: #1109
We should fuse it using onednn instead of using torch compile for these small ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants