Benchmarking Results #530

MetaAnomie · 2022-11-15T02:31:55Z

MetaAnomie
Nov 15, 2022

Been performing some performance benchmarking recently so just attaching the results here if they can be of any reference to anyone. Will include more in the future as I go.

cetiny · 2022-11-16T19:55:06Z

cetiny
Nov 16, 2022

Are you using "beam search" or "greedy search"? It has an impact on the performance.

I have an old 1070 Ti GPU. A 2-hour podcast takes around 60 minutes to transcribe (medium.en model). I am willing to upgrade, so I have made the following research on GPU performance.

As far as I know, the "FP16 performance" is what determines the speed of the transcription. @jongwook - please correct me if I am wrong.

These are my theoretical values. The 4090 seems to be a beast when it comes to machine learning. I would be happy if someone can share their benchmarking values for other GPUs as well.

Graphic Card	FP16 Performance		Price (EU)	Price/Performance
GeForce RTX 4090	82,58	TFLOPS	2.200 €	26,64 €
GeForce RTX 4080	48,74	TFLOPS	1.469 €	30,14 €
GeForce RTX 3090 Ti	40,00	TFLOPS	1.295 €	32,38 €
GeForce RTX 3090	35,58	TFLOPS	1.310 €	36,82 €
GeForce RTX 3080 Ti	34,10	TFLOPS	1.017 €	29,82 €
GeForce RTX 3080	29,77	TFLOPS	779 €	26,17 €
GeForce RTX 2080 Ti	26,90	TFLOPS	1.400 €	52,04 €
GeForce RTX 3070 Ti	21,75	TFLOPS	650 €	29,89 €
GeForce RTX 3070	20,31	TFLOPS	557 €	27,42 €
GeForce RTX 2080	20,14	TFLOPS	649 €	32,22 €
GeForce RTX 2070	14,93	TFLOPS	na	na
GeForce RTX 2060	12,90	TFLOPS	na	na
GeForce GTX 1080 Ti	177,2	GFLOPS	na	na
GeForce GTX 1070 Ti	127,9	GFLOPS	na	na

3 replies

jongwook Nov 16, 2022
Maintainer

Yes, most of Whisper's calculations on GPU use fp16. Meanwhile, the theoretical TFLOPS numbers likely won't translate linearly to the transcription speed because it'd become difficult to fully utilize the parallelism unless you're transcribing a batch of 32 audio segments together, for example.

Virgus Nov 17, 2022

Hello, very interesting what you posted. I'm looking forward to start testing whisper but I have to setup a new desktop PC. I have two laptops, one with NVIDIA Quadro K5000M and another with GeForce 940MX. Maybe the latter (2Gb) would work just to see the software doing some small tests but for proper use I need to buy a real GPU. My question is: what are the minimum memory requirements to have decent results ? For example would it be better to buy a 3070Ti with 24Gb for 1500€ or a RTX 3090/4080 with less RAM for the same amount of money ? Hoping to find some used GPUs and save some money... Thanks for your advice. V.

etlweather Nov 25, 2022

In some ways, it depends on what you want to acheive. On a RTX 3070 (not the Ti), I ran a 20 minutes of audio (1183 seconds) and get the output in 25 seconds using the base model. That's about 47x. And with the small model, I got about 24x for the same audio file.

Of course you're results will varry based on the audio file - if there are a lot of empty segments, etc.

I expect the 3070 Ti to perform a bit better than the 3070 but probably not leaps and bounds better. The 3090 / 4080 may also be much faster.

The VRAM on the GPU needs to be a bit bigger than the model you want to load - I don't recall of the top my head the 3090/4080 specs.

FurkanGozukara · 2022-12-04T21:59:25Z

FurkanGozukara
Dec 4, 2022

I have GTX 3060 12 gb and it can transcribe large model

you can watch my video and read the comments here :

How Good is RTX 3060 for ML AI Deep Learning Tasks and Comparison With GTX 1050 Ti and i7 10700F CPU> https://www.youtube.com/watch?v=q8Q8CCDdSKo

6 replies

FurkanGozukara Dec 5, 2022

Yep I purchased this card solely for whisper. It has the biggest vram with lowest cost.

gosuimbalyndh Dec 15, 2023

Thank you
The -large model needs at least 10GB VRAM.
I see that RTX 4000 series (e.g. RTX 4060 usually 8GB GDDR6) with higher price, lower VRAM ; comparing 3000 series with higher VRAM (e.g. RTX 3060 usually 12GB VRAM), lower price. Then should we buy 3000 series (low cost, high VRAM) or 4000 series (high cost, low VRAM)?

slapash Dec 28, 2023

Don't forget that they reduced the memory bandwidth for the 4060 and bus size. with all that info I consider the 3060 to be the better buy as an entry-level GPU for inference and maybe for some training.

gosuimbalyndh Dec 28, 2023

Thank you
What about CPU and RAM? Do they matter and affect much? What is the best hardware for optimization?

slapash Dec 29, 2023

all of this depends on a lot of factors like is it for training or inference what models what dataset size etc, before thinking about I/O ops speed, you first need to think about the capacity, so maybe going for 32 gb ddr4 might be better than 16 ddr5, if you can afford it then go for r7 or greater 7th gen with ddr5 also an nvme drive samsung ones are great, sabrent is a good choice too.
I currently have:
r52400g
rtx3060
sabrent nvme 500 (don't remember which one)
hope this helps a little
do your research, this is not my field of expertise.

leonkosak · 2023-01-05T06:26:16Z

leonkosak
Jan 5, 2023

I am wondering how many parallel audio streams can approximately transcribe Nvidia A100, H100 or Geforce 4090 in realtime (tiny and small model).
If 1070Ti finishes transcription in half a time, then the latest and greatest nvidia cards should transcribe 10-20 audio streams in parallel in realtime?

0 replies

commonism · 2023-01-05T16:16:52Z

commonism
Jan 5, 2023

A100 (PCIe Version/250W cap)

micros

using the examples from the openai homepage
split total time by model setup time & transcription time
limited to 30 seconds due to chunking
cpu: AMD EPYC 7F32 8-Core Processor / 16 threads
gpu: A100 PCIe

Inputs:

micromachines: 00:00:29.89
k-pop: 00:00:15.02
multilingual: 42.07 (capped 30s)
accent: 00:00:22

Format:
example language model device model-setup-time transcription-time total-time

micromachines    en  tiny.en   :cpu  0:00:00.350590 0:00:04.272320 0:00:04.622910
micromachines    en  tiny.en   :cuda 0:00:01.074119 0:00:02.299575 0:00:03.373694
micromachines    en  tiny      :cpu  0:00:00.343927 0:00:03.659745 0:00:04.003672
micromachines    en  tiny      :cuda 0:00:00.437226 0:00:00.810221 0:00:01.247447
micromachines    en  base.en   :cpu  0:00:00.633545 0:00:05.911928 0:00:06.545473
micromachines    en  base.en   :cuda 0:00:00.717814 0:00:01.301096 0:00:02.018910
micromachines    en  base      :cpu  0:00:00.581769 0:00:06.332075 0:00:06.913844
micromachines    en  base      :cuda 0:00:00.718092 0:00:01.304674 0:00:02.022766
micromachines    en  small.en  :cpu  0:00:01.833539 0:00:13.676799 0:00:15.510338
micromachines    en  small.en  :cuda 0:00:02.014298 0:00:02.321347 0:00:04.335645
micromachines    en  small     :cpu  0:00:01.691177 0:00:13.762879 0:00:15.454056
micromachines    en  small     :cuda 0:00:02.023415 0:00:02.329193 0:00:04.352608
micromachines    en  medium.en :cpu  0:00:05.616486 0:00:31.787294 0:00:37.403780
micromachines    en  medium.en :cuda 0:00:06.128925 0:00:04.274221 0:00:10.403146
micromachines    en  medium    :cpu  0:00:05.757793 0:00:32.090775 0:00:37.848568
micromachines    en  medium    :cuda 0:00:06.051011 0:00:04.288448 0:00:10.339459
micromachines    en  large-v1  :cpu  0:00:11.775235 0:01:01.503443 0:01:13.278678
micromachines    en  large-v1  :cuda 0:00:12.089250 0:00:05.658034 0:00:17.747284
micromachines    en  large-v2  :cpu  0:00:11.391797 0:00:59.869278 0:01:11.261075
micromachines    en  large-v2  :cuda 0:00:11.873713 0:00:05.516045 0:00:17.389758
micromachines    en  large     :cpu  0:00:11.058756 0:00:59.897912 0:01:10.956668
micromachines    en  large     :cuda 0:00:11.868703 0:00:05.568320 0:00:17.437023

k-pop            ko  tiny      :cpu  0:00:00.340334 0:00:01.219846 0:00:01.560180
k-pop            ko  tiny      :cuda 0:00:00.391878 0:00:00.445146 0:00:00.837024
k-pop            ko  base      :cpu  0:00:00.591801 0:00:01.737645 0:00:02.329446
k-pop            ko  base      :cuda 0:00:00.696187 0:00:00.538385 0:00:01.234572
k-pop            ko  small     :cpu  0:00:01.729986 0:00:01.286877 0:00:03.016863
k-pop            ko  small     :cuda 0:00:01.971681 0:00:00.334819 0:00:02.306500
k-pop            ko  medium    :cpu  0:00:05.473279 0:00:08.033170 0:00:13.506449
k-pop            ko  medium    :cuda 0:00:05.805833 0:00:01.068706 0:00:06.874539
k-pop            ko  large-v1  :cpu  0:00:11.438007 0:00:05.708135 0:00:17.146142
k-pop            ko  large-v1  :cuda 0:00:11.607291 0:00:00.443934 0:00:12.051225
k-pop            ko  large-v2  :cpu  0:00:10.850721 0:00:15.864444 0:00:26.715165
k-pop            ko  large-v2  :cuda 0:00:11.692943 0:00:01.426777 0:00:13.119720
k-pop            ko  large     :cpu  0:00:11.377958 0:00:15.828331 0:00:27.206289
k-pop            ko  large     :cuda 0:00:11.775528 0:00:01.424772 0:00:13.200300

multilingual     fr  tiny      :cpu  0:00:00.342602 0:00:02.758161 0:00:03.100763
multilingual     fr  tiny      :cuda 0:00:00.386728 0:00:00.776037 0:00:01.162765
multilingual     fr  base      :cpu  0:00:00.587395 0:00:04.026154 0:00:04.613549
multilingual     fr  base      :cuda 0:00:00.699052 0:00:00.949851 0:00:01.648903
multilingual     fr  small     :cpu  0:00:01.764393 0:00:08.213282 0:00:09.977675
multilingual     fr  small     :cuda 0:00:02.000902 0:00:01.493618 0:00:03.494520
multilingual     fr  medium    :cpu  0:00:05.523934 0:00:18.409273 0:00:23.933207
multilingual     fr  medium    :cuda 0:00:05.990190 0:00:02.547778 0:00:08.537968
multilingual     fr  large-v1  :cpu  0:00:11.405897 0:00:36.932570 0:00:48.338467
multilingual     fr  large-v1  :cuda 0:00:11.795194 0:00:03.416444 0:00:15.211638
multilingual     fr  large-v2  :cpu  0:00:11.627530 0:00:36.701948 0:00:48.329478
multilingual     fr  large-v2  :cuda 0:00:11.788199 0:00:03.432104 0:00:15.220303
multilingual     fr  large     :cpu  0:00:11.330664 0:00:37.139805 0:00:48.470469
multilingual     fr  large     :cuda 0:00:11.907341 0:00:03.435892 0:00:15.343233

accent           en  tiny.en   :cpu  0:00:00.349042 0:00:02.125041 0:00:02.474083
accent           en  tiny.en   :cuda 0:00:00.422645 0:00:00.644260 0:00:01.066905
accent           en  tiny      :cpu  0:00:00.346737 0:00:03.084465 0:00:03.431202
accent           en  tiny      :cuda 0:00:00.396487 0:00:00.768524 0:00:01.165011
accent           en  base.en   :cpu  0:00:00.611137 0:00:02.927024 0:00:03.538161
accent           en  base.en   :cuda 0:00:00.721247 0:00:00.765462 0:00:01.486709
accent           en  base      :cpu  0:00:00.584116 0:00:03.451505 0:00:04.035621
accent           en  base      :cuda 0:00:00.718422 0:00:00.828805 0:00:01.547227
accent           en  small.en  :cpu  0:00:01.784267 0:00:06.774309 0:00:08.558576
accent           en  small.en  :cuda 0:00:01.987972 0:00:01.246689 0:00:03.234661
accent           en  small     :cpu  0:00:01.740264 0:00:07.444627 0:00:09.184891
accent           en  small     :cuda 0:00:01.976114 0:00:01.336711 0:00:03.312825
accent           en  medium.en :cpu  0:00:05.704505 0:00:17.789476 0:00:23.493981
accent           en  medium.en :cuda 0:00:05.995070 0:00:02.324745 0:00:08.319815
accent           en  medium    :cpu  0:00:05.660241 0:00:16.560243 0:00:22.220484
accent           en  medium    :cuda 0:00:06.045639 0:00:02.213758 0:00:08.259397
accent           en  large-v1  :cpu  0:00:11.556083 0:00:30.412212 0:00:41.968295
accent           en  large-v1  :cuda 0:00:11.703914 0:00:02.813217 0:00:14.517131
accent           en  large-v2  :cpu  0:00:11.493825 0:00:35.675903 0:00:47.169728
accent           en  large-v2  :cuda 0:00:11.923714 0:00:03.331408 0:00:15.255122
accent           en  large     :cpu  0:00:11.630486 0:00:35.635128 0:00:47.265614
accent           en  large     :cuda 0:00:11.730015 0:00:03.335832 0:00:15.065847

podcasts

tiny/small models result severe mistakes - can not be used.
utilization is 60-75%

uab-warzone_mixdown.mp3 --language en --model … # 1:22:5h

medium: real 9m41.666s
large: real 13m0.024s

swr2wissen-20221230-tee-in-der-weltgeschichte-22-teekriege-und-die-macht-der-tee-nationen.m.mp3 --language de --model … # 0:30:58h

medium: real 4m21.119s
large: real 5m51.075s

0 replies

leonkosak · 2023-01-05T20:12:17Z

leonkosak
Jan 5, 2023

@commonism A100 could potentially transcribe 5 or 6 audio sources in parallel (in realtime) (medium model)?

0 replies

commonism · 2023-01-06T09:32:37Z

commonism
Jan 6, 2023

Using the A100 80G cards - very likely.
Using batching I'd expect even more.
But will require some beefy CPU for the ffmpeg part of it as well.

0 replies

YousufSSyed · 2023-08-30T17:59:34Z

YousufSSyed
Aug 30, 2023

@MetaAnomie What are the units on the chart for transcription time?

1 reply

IlianP Aug 30, 2023

Probably seconds

ejentos · 2023-11-14T20:22:52Z

ejentos
Nov 14, 2023

Any tests for AMD Radeon PRO W7000 Series (https://www.amd.com/en/graphics/workstations) and ROCm 7?

0 replies

BorisNes · 2023-11-15T00:04:00Z

BorisNes
Nov 15, 2023

I've been experimenting myself with a 3060 Ti (8GB VRAM, "medium" model) for inference, and i7-13700k for preprocessing as follows:
i have a 1:28:44 long file, with Hebrew speech.
Since the model uses a sliding window, i thought i should try to remove any silent parts from the recording to make the audio file shorter, thus reducing inference time.
I'm also using the model for a semi-realtime usage, so i divided the whole file into chunks of 1000 seconds to display the contents as they arrive instead of waiting for the model to finish transcribing the whole recording.
In python, for a setup of PyDub volume normalization+Remove silence+1000 sec splits:
Chunk 1 - 146.83 sec (file was 10.7 mins)
Chunk 2 - 190.16 sec (file was 9.22 mins)
Chunk 3 - 114.8 sec (file was 10 mins)
Chunk 4 - 290.87 sec (file was 10 mins)
Chunk 5 - 59.17 sec (file was 3.58 mins)
Preprocessing for each chunk took 20 seconds, so
Total time : 837.48 sec + 20*5 setup time = 937 sec

I needed it to do better.
Reduced the chunk size to 500 sec, and sped up the recording to x1.5 speed to further reduce the file length:
1 - 0.21 sec (was 1.34 mins)
2 - 13.15 sec (was 3.45 mins)
3 - 66.21 sec (was 3.28 mins)
4 - 33.27 sec (was 3.27 mins)
5 - 39.10 sec (was 3.58 mins)
6 - 43.97 sec (was 3.30 mins)
7 - 41.60 sec (was 3.58 mins)
8 - 46.15 sec (was 3.59 mins)
9 - 41.95 sec (was 3.33 mins)
10 - 75.04 sec (was 3.42 mins)
11 - 63.67 sec (was 2.50 mins)
mean time for GPU transcription = 46.41
the preprocessing time was 25 sec per chunk.
total time = 464.31 + 11*25 = 739 secs

There must be a way to make it better.
Resorting to ffmpeg for preprocessing. still using 500 sec chunks, and reduced the speeding up to x1.4 (not even sure why)
Still had 11 chunks.
preprocessing time: 5.5 seconds per chunk
1- 0.73 sec (inference time for current chunk)
2 - 35.89 sec
3 - 74.75 sec
4 - 101.22 sec
5 - 62.63 sec
6 - 56.20 sec
7 - 62.02 sec
8 - 57.52 sec
9 - 107.56 sec
10 - 158.41 sec
11 - 36.89 sec
total preprocessing time= 11*5.5 sec
total GPU inference time for all chunks = 753.58

No idea why it took longer after ffmpeg preprocessing, even though the preprocessing time was 5x faster.
Will investigate about that and try some other techniques, will edit this comment later for adding reference of inference time of the raw audio without any preprocessing, just to make sure I'm not wasting my time

edit: without preprocessing the inference time is 627.51 sec for that 1:30:00 long file.
but.. the output was complete gibberish, and i quote : "Yn ystod y cyfle hwn, mae'r cyfle hwn yn ystod y cyfle hwn. Mae'r cyfle hwn yn ystod y cyfle hwn." repeated bunch of times along with some more unique jiberish.
So, i think the chunk method is worth it

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Results #530

{{title}}

Replies: 9 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Benchmarking Results #530

Replies: 9 comments · 10 replies

jongwook Nov 16, 2022 Maintainer

A100 (PCIe Version/250W cap)

micros

podcasts

Replies: 9 comments 10 replies

jongwook Nov 16, 2022
Maintainer