Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

09-small-models-road-to-the-top-part-2 runs very slow on AMD Radeon 7900XTX using ROCm PyTorch #96

Open
briansp2020 opened this issue Sep 18, 2023 · 2 comments

Comments

@briansp2020
Copy link

Hi,
I'm not sure where to start so I'm just posting here hoping that someone with more knowledge could help me out. I'm trying to run these notebooks on my system with 7900XTX and it is running very slow. The code that uses resnet26d seems ok but the code that uses convnext_small_in22k is very slow. I also tried convnext_small to make it use the model from torchvision but that seems to run just as slowly.

I first thought that ROCm pytorch has not optimized the model yet. But I found out that pytorch microbenchmark (https://github.com/ROCmSoftwarePlatform/pytorch-micro-benchmarking) actually shows 7900XTX running faster when using torchvision model.

(pt) root@rocm:~/pytorch-micro-benchmarking# python3 micro_benchmarking_pytorch.py --network convnext_small
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : convnext_small
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 2.088879442214966
Throughput [img/sec] : 30.638436429886497

Running the same test on my 3080ti gives

(pt) bsp2020@Ryzen5950X:~/pytorch-micro-benchmarking$ python3 micro_benchmarking_pytorch.py --network convnext_small
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : convnext_small
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 18.948059797286987
Throughput [img/sec] : 3.3776545295241056

Could anyone please help me figure out what is going on? Any help would be appreciated.

@briansp2020
Copy link
Author

I just tried pytorch nightly build with rocm 5.7.1 (not released yet) and it now runs much faster. PyTorch microbenchmark has improved so much as well.

(pt) root@rocm:~/pytorch-micro-benchmarking# python micro_benchmarking_pytorch.py --network convnext_small
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : convnext_small
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.43914319276809693
Throughput [img/sec] : 145.73834014500406

fp16 performance is still very poor compared to 3080ti

(pt) root@rocm:~/pytorch-micro-benchmarking# python micro_benchmarking_pytorch.py --network convnext_small --fp16 1
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : convnext_small
Num devices: 1
Dtype: FP16
Mini batch size [img] : 64
Time per mini-batch : 0.2943673968315125
Throughput [img/sec] : 217.41538189649373

fp16 microbenchmark is still far slower than 3080ti (558 img/sec) but the fastai example runs slightly faster than 3080ti now.

Good job AMD!
I'm looking forward to official 7900XTX support!

Not closing the issue yet since I believe the performance can still be improved further.

@briansp2020
Copy link
Author

It seems like the performance improvement came with the recent update to pytorch and not the library that came with 5.7.1. I reran the same code with 5.7 environment the performance improvement is still there.

The code I'm using is here. This example code has 2 examples from fastai quickstart examples. The first section uses the convnext_small model that just saw a huge improvement. The second section is a text-processing example that has not seen any speed improvement. Could you, whoever just released the huge speed improvement, also take a look at the text processing as well?

I'm not sure if anyone is reading this or whether my issue report was the one that brought the 7900XTX performance issue to the developer's attention. But I'm so excited I just had to post this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant