[PT2.6][Torch.complie] Performance analysis and optimization #1004

riverliuintel · 2024-10-22T12:01:18Z

Analyze Triton kernels data and report to Triton XPU.

No response

No response

jianyizh · 2024-11-11T07:18:11Z

jianyizh · 2024-11-21T09:12:19Z

when number of conv is small, inductor will close layout opt. We have to force it open by TORCHINDUCTOR_FORCE_LAYOUT_OPT, otherwise we may meet inefficient kernel like cat_layernorm in this issue
For both xpu and cuda, when there are more nodes between conv, there will be unnecessary transpose. For example, conv (channel last) + bias and leakyrelu fusion (to channel first) + avg_pool (to channel last) + conv. It seems inductor does not propagate layout. Fuse bias and activation into conv will mitigate.

jianyizh · 2024-11-21T09:44:02Z

RNN related ops: #1109
We should fuse it using onednn instead of using torch compile for these small ops.

riverliuintel added feature performance labels Oct 22, 2024

riverliuintel added this to the PT2.6 - Feature Freeze milestone Oct 22, 2024

riverliuintel assigned retonym, jianyizh and weishi-deng Oct 22, 2024

Provide feedback