Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some triton kernels gernerated by inductors have low efficiency on PVC 1550 compare to A100 #2229

Closed
jianyizh opened this issue Sep 13, 2024 · 4 comments
Assignees
Labels

Comments

@jianyizh
Copy link
Contributor

Hi, I found some triton kernels generated by torch benchmark vit-base model are slower than A100. It seems the bandwidth are pretty low. I'm using public pytorch master branch + XPU build.
PVC:
python cat_layernorm.py
0.403ms 0.039GB 96.59GB/s
python gelu.py
0.362ms 0.155GB 428.07GB/s
python layernorm.py
0.296ms 0.058GB 196.21GB/s
python safe_softmax.py
0.495ms 0.238GB 481.51GB/s
A100:
python cat_layernorm_nv.py
0.089ms 0.039GB 437.12GB/s
python gelu_nv.py
0.144ms 0.155GB 1073.02GB/s
python layernorm_nv.py
0.062ms 0.058GB 930.15GB/s
python safe_softmax_nv.py
0.261ms 0.238GB 913.15GB/s
reproducer.zip

@vlad-penkin
Copy link
Contributor

vlad-penkin commented Sep 13, 2024

@jianyizh i've got higher numbers for your reproducer:

Kernel Original PVC Time (ms) Original PVC Bandwidth (GB) Original PVC Speed (GB/s) PVC 1100 Time (ms) PVC 1100 Bandwidth (GB) PVC 1100 Speed (GB/s) A100 Time (ms) A100 Bandwidth (GB) A100 Speed (GB/s)
cat_layernorm 0.403 0.039 96.59 0.290 0.039 134.47 0.089 0.039 437.12
gelu 0.362 0.155 428.07 0.201 0.155 770.94 0.144 0.155 1073.02
layernorm 0.296 0.058 196.21 0.131 0.058 443.38 0.062 0.058 930.15
safe_softmax 0.495 0.238 481.51 0.437 0.238 545.68 0.261 0.238 913.15

Could you please provide more details on your env by running these two commands:

  • ./scripts/capture-hw-details.sh
  • pip list | grep -iE "torch|triton"

@jianyizh
Copy link
Contributor Author

@vlad-penkin Thank you for reply. I think the performance of the two layernorm (~50%) and softmax (~60%) is still not good compare with A100.

./scripts/capture-hw-details.sh
LIBIGC1_VERSION=1.0.16900.24-914
LEVEL_ZERO_VERSION=1.3.29735.27-914
AGAMA_VERSION=914
GPU_DEVICE=Intel(R) Data Center GPU Max 1550

pip list | grep -iE "torch|triton"
bert_pytorch 0.0.1a4 /home/sdp/jianyi/oob/benchmark/torchbenchmark/models/BERT_pytorch
functorch 1.14.0a0+b71aa0b
pytorch-labs-segment-anything-fast 0.2
torch 2.6.0a0+gite6b6835 /home/sdp/jianyi/pytorch
torch_geometric 2.4.0
torchao 0.5.0
torchaudio 2.5.0a0+97ed7b3 /home/sdp/jianyi/audio/src
torchvision 0.20.0a0+838ad6c /home/sdp/jianyi/vision
triton 3.0.0
triton-xpu 3.0.0b2

@vlad-penkin vlad-penkin self-assigned this Sep 13, 2024
@jianyizh
Copy link
Contributor Author

jianyizh commented Oct 9, 2024

@vlad-penkin Hi, I tested on 1100, still cannot get the good results as yours

| Kernel | Time (ms) | Bandwidth (GB) | Speed (GB/s) |
| cat_layernorm | 0.592ms | 0.039GB | 65.78GB/s |
| gelu | 0.539ms | 0.155GB | 287.53GB/s |
| layernorm | 0.423ms | 0.058GB | 137.32GB/s |
| safe_softmax | 0.720ms | 0.238GB | 331.07GB/s |

LIBIGC1_VERSION=1.0.17193.16-950.16
LEVEL_ZERO_VERSION=1.3.30049.10-950.16
AGAMA_VERSION=950.16
GPU_DEVICE=Intel(R) Data Center GPU Max 1100

intel-xpu-backend-for-triton$ git status
HEAD detached at 91b14bf

pytorch commit: 9529d018e937b8b2a53cccc72f38be283828715a

I source compile it using oneapi 2025.0.0.533

@jianyizh
Copy link
Contributor Author

This cat layernorm is followed by a channel first conv, if force layout opt to use channel last, the bandwidth will increase to ~300 GB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants