Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Whisper transcription algorithm #1335

Open
lfcnassif opened this issue Sep 23, 2022 · 77 comments
Open

Evaluate Whisper transcription algorithm #1335

lfcnassif opened this issue Sep 23, 2022 · 77 comments
Assignees
Labels

Comments

@lfcnassif
Copy link
Member

lfcnassif commented Sep 23, 2022

Recently made public:
https://openai.com/blog/whisper/
https://github.com/openai/whisper

Interesting, they have some multilingual models that can be used for multiple languages without fine tuning for each language. They claim their models generalize better than models that need fine tuning, like wav2vec. Some numbers on Fleurs dataset (e.g. 4.8% WER on Portuguese subset):
https://github.com/openai/whisper#available-models-and-languages

@lfcnassif lfcnassif added the task label Sep 23, 2022
@lfcnassif
Copy link
Member Author

Preliminary result of the largest Whisper model on TEDx pt-BR dataset resulted in 20,6% WER. Numbers for other models here: https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2

The largest whisper model is more than 1 order of magnitude slower than wav2vec2 w/ 1B params on RTX 3090, so it is not usable in practice. Maybe one of the smaller whisper models could have reasonable accuracy and speed.

@lfcnassif lfcnassif changed the title Evaluate new transcription algorithm Evaluate Whisper transcription algorithm Sep 27, 2022
@lfcnassif lfcnassif assigned lfcnassif and unassigned lfcnassif Sep 27, 2022
@lfcnassif
Copy link
Member Author

I tried to transcribe ~10h of audios using the largest whisper model on RTX 3090, the estimated time to finish was 4 days, so I aborted the test, it is not feasible to use in practice. Current wav2vec2 algorithm with 1B params took about 22min to transcribe ~29h of audios using 3 RTX 3090 (in 2 nodes), so the largest whisper model is more than 2 orders of magnitude slower than what we have today.

I'll try their smallest model (36x faster) to see how the accuracy is on the test datasets.

@lfcnassif lfcnassif assigned lfcnassif and unassigned lfcnassif Sep 29, 2022
@rafael844
Copy link

Hi, is there a way we can test whisper with IPED? Is there a snapshot with it so we could use?

@lfcnassif
Copy link
Member Author

I think I didn't push the POC implementation, the 250x time cost comparing to wav2vec2 made me very acceptic to use whisper in production. I didn't test their smaller model yet, but maybe the accuracy will drop a lot.

If you really would like to try, it is easy to change script below with whisper example code in their github main page:
https://github.com/sepinf-inc/IPED/blob/master/iped-app/resources/scripts/tasks/Wav2Vec2Process.py

@lfcnassif
Copy link
Member Author

Their smaller model should still be 7x slower than wav2vec2 according to my tests and their published relative model costs.

@rafael844
Copy link

rafael844 commented Dec 1, 2022

Thanks @lfcnassif , I don't know how to program very well but I'll see if a colleague can help me. This was a request from my superiors.

@lfcnassif
Copy link
Member Author

lfcnassif commented Feb 25, 2023

Hi @rafael844,

I just found the multilanguage (crazy to me!) whisper models on huggingface: https://huggingface.co/openai/whisper-large-v2

So maybe you just need to set huggingFaceModel parameter in conf/AudioTranscriptConfig.txt to openai/whisper-large-v2 in IPED 4.1 (I didn't test it).

Jonatas Grosman also fine tunned that multilanguage model to portuguese (https://huggingface.co/jonatasgrosman/whisper-large-pt-cv11), although it is not a need, so you can also try jonatasgrosman/whisper-large-pt-cv11 if above doesn't work.

But I warn you, my past tests resulted in 250x slowdown comparing to wav2vec2. That large whisper model accuracy seems to be better and also have punctuation and caps, but I don't think the 250x cost is worth to pay on scale.

You may try smaller whisper models, but accuracy should drop down: https://huggingface.co/openai

@lfcnassif
Copy link
Member Author

Just tested, it doesn't work out of the box, needs code changes.

@rafael844
Copy link

rafael844 commented Feb 25, 2023

Thank you @lfcnassif .

Ill take a look. But with my lack of programing skills and with those results, we Will keep with wav2vec2 and the already models. It would be nice have ponctuation and caps, but as you said, 250x is not worth.

Wav2vec2 do a good job, even with our cheap and weak gpu, we can spread the job in multiple machines, wich is great, and the results are good so far.

@lfcnassif
Copy link
Member Author

You are welcome!

@lfcnassif
Copy link
Member Author

@lfcnassif
Copy link
Member Author

Price is 1/3 comparing to Microsoft/Google.

@MariasStory
Copy link

Try whisper.cpp.

@lfcnassif
Copy link
Member Author

Try whisper.cpp.

Thanks for pointing. Unfortunately they don't support GPU and transcribing a 4s audio on a 48 threads CPU took 32s using the medium size model in a first run/test (the large model should be 2x slower). Strangely the second run took 73s and a third run took 132s...

@MariasStory
Copy link

Strange. On my Ubuntu linux, in a docker container the compiled whisper.cpp ./main runs the large model ~2.9 GB on 4 CPU cores at about 4 time the recorded time.
The small model runs at about real time.

Create and use image for running with docker.

docker run --name whisper -it -v $(pwd)/:/host python /bin/bash -c "git clone https://github.com/ggerganov/whisper.cpp.git && cd whisper.cpp && make large && mv models/ggml-large.bin /host/"
docker commit whisper whisper:latest
docker rm whisper
ffmpeg -y -i test1.mp3 -ar 16000 -ac 1 -c:a pcm_s16le temp.wav && sudo docker run -it --rm --name whisper -v $(pwd)/:/host --network none whisper /whisper.cpp/main -m /host/ggml-large.bin -f /host/temp.wav -l de -oj -of /host/test1

@lfcnassif
Copy link
Member Author

Another optimized implementation to be tested, they say it is 4x times faster than the original OpenAI model on the GPU:
https://github.com/guillaumekln/faster-whisper

@lfcnassif
Copy link
Member Author

https://github.com/guillaumekln/faster-whisper

The project claims to transcribe 13min audio in ~1min using Tesla V100S (an older GPU than ours), that's just ~3x slower than the 1B parameters wav2vec2 model used by us on RTX 3090. Given the 4.5x speed up reached by them, that is incompatible to my past tests that have shown a 250x slowdown when switching from 1B wav2vec2 to whisper large model, I'll try to run again the performance tests...

@lfcnassif
Copy link
Member Author

lfcnassif commented Jun 3, 2023

Another promising one:
https://github.com/sanchit-gandhi/whisper-jax

By processing audios in batches + TPUs it can give up to 70x speed up.

@DHoelz
Copy link
Contributor

DHoelz commented Jul 18, 2023

Hi @nassif

We’ve been trying to use wav2vec2 to transcribe our audios but the results we were getting was a bit disappointing as often the transcription was barely readable, especially if compared to Azure (which we don’t have a contract with).

For that reason, we looked for other options and found OpenAi’s Whisper project.

Although slower than wav2vec the results were A LOT better, comparable to Azure’s transcription.

For our tests I tried Whisper and Faster-Whisper implementations (and probably will try Whisper-JAX latter, although we don’t have a graphic card with TPU).
The model used was pierreguillou/whisper-medium-portuguese, that according to the author has a WER of only 6.5987.

The tests were done on a HP Z4 with a Xeon W-2175, 64 GB of RAM and a QUADRO P4000.
The audio sample has only 42 seconds.

Wav2Vec2: 3,7 s
eu sei peu tenho ciência disso e...eu sei peu tenho ciência disso e você sabe dasse toda todo então eu stôu correndo é pra passar dinheiro pra você eu não tenho conversa pra você tedo é passar dinheiro eu tenho que passando dinheiro passando dinheiro passando dinheiro isso aí é porque go de falei ter uma coisa pra pra ser resolvido até acho que até o final do mês resolve não é nada grande não mas aí o cara já vai me adiantar e prca meno de mil e quinhentos e dois mil e depois eu passo pra ele mas eu sto te passando mil quinhentos e dois mil e pretendo passar mais aí logo logo tá stô falando é é logo logo mesmo não vai parar por aí não só me falei qual que é o bancoe

Whisper: 23,02 s
Eu sei, eu tenho ciência disso e você sabe da história toda. Eu tô correndo é para passar dinheiro para você, eu não tenho conversa para você, entendeu? é passar dinheiro, eu tenho que ir passando dinheiro, passando dinheiro, passando dinheiro. Isso aí é porque, como eu te falei, tem uma coisa para ser resolvida até o final do mês resolve? não é nada grande não, mas aí o cara já vai me adiantar a pelo menos mil e quinhentos dois mil e depois eu passo para ele. Mas eu tô te passando mil e quinhentos dois mil e pretendo passar mais aí logo logo, tô falando é logo logo mesmo, não vai parar por aí não, só me fale qual é o banquinho.

Faster-Whipser: 8,59 s
Eu sei, eu tenho ciência disso. E você sabe da história toda, então eu tô correndo é para passar dinheiro para você, eu não tenho conversa para você, entendeu? é para passar dinheiro, eu tenho que ir passando dinheiro, passando dinheiro, passando dinheiro, isso aí é porque, como eu te falei, tem uma coisa para ser resolvida até, acho que até o final do mês resolve, não é nada grande não, mas aí o cara já vai me adiantar, pelo menos meia quinhentos, dois mil e depois eu vou por ele, mas eu tô te passando meia quinhentos, dois mil e pretendo passar mais aí logo, logo, tô falando é logo, logo mesmo, não vai parar por aí não, só me fale qual é o banquê.

Azure:
Eu sei, pô, tenho ciência disso....Eu sei, pô, tenho ciência disso. E você sabe da história toda, então. Eu tô correndo é pra passar dinheiro para você. Eu não tenho conversa pra você, entendeu? É passar dinheiro, eu tenho que passar no dinheiro, passando dinheiro, passando dinheiro, isso aí é porque? Gosto de falei, tem uma coisa para para ser resolvida. Até acho que até o final do mês resolve. Não é nada grande, não, mas aí o cara já vai me adiantar aí pelo −1500 2000 e depois eu passo para ele. Mas eu estou te passando 1502 1000 e pretendo passar mais aí logo logo, tá, tô falando, é. É logo logo mesmo, não vai parar por aí não. Só me fala aí, qual que é o banco aí?

It would be nice to have Whisper as an option to use with IPED as it's free, runs locally (no need to send data to the cloud), has punctuation (which makes reading considerably better) and the results are comparable to Azure’s service.

@lfcnassif
Copy link
Member Author

We’ve been trying to use wav2vec2 to transcribe our audios but the results we were getting was a bit disappointing

What model have you used? Have you used jonatasgrosman/wav2vec2-xls-r-1b-portuguese ?

Although slower than wav2vec the results were A LOT better, comparable to Azure’s transcription.

Have you measured WER on your data set? How many audios do you have, what is the total duration? If you can help to compare whisper models properly to wav2vec2 models, I can send you the public data sets used in this study:
https://user-images.githubusercontent.com/7276994/183307766-cec85345-bd28-44a8-91ec-20451ff50f19.png

The model used was pierreguillou/whisper-medium-portuguese, that according to the author has a WER of only 6.5987.

On what data set?

QUADRO P4000.

So you have a GPU without TPUs, right?

The audio sample has only 42 seconds.

Well, I think it is not enough to represent the variability we can find in seized data sets... Anyway, have you computed WER on this 42 seconds audio so we can also have an objective measure instead of just feelings (which is also important).

has punctuation (which makes reading considerably better)

I understand this is an advantage not counted by traditional WER...

@DHoelz
Copy link
Contributor

DHoelz commented Jul 18, 2023

What model have you used? Have you used jonatasgrosman/wav2vec2-xls-r-1b-portuguese ?

We tried both large and small models (from jonatasgrosman and edresson). They had similiar results but the large one took a lot longer to transcript.

Have you measured WER on your data set? How many audios do you have, what is the total duration? If you can help to compare whisper models properly to wav2vec2 models, I can send you the public data sets used in this study:

We didn't measure the WER. That value was informed by the model owner.
https://huggingface.co/pierreguillou/whisper-medium-portuguese

Our tests were made using 6000 audios from an exam that roughly adds to 1000 minutes (or 16,9 hours).
All audios were all transcribed using wav2vec2 (large model), azure, whisper and faster-whisper.
Using whisper took around 11 hours and with faster-whisper around 5 hours. The 42 seconds audio transcription I sent was only an example of how better whisper is compared to wav2vec, and how close it can be to Azure’s Service.

Based on every test we made with wav2vec (on several exams), we concluded that it was simply better not to send the transcriptions, as many of the times they were unreadable.

And as a notice, I probably didn’t implement whisper and faster-whisper in the best way possible, meaning that there is probably room for improvement in speed.

About the dataset, I could try testing it. I don't know how these datasets are made and if they include the "kind" of audio we normally need to transcribe. Let’s say it’s a multitude of forms of Portuguese.

On what data set?

According to the author: common_voice_11_0 dataset

So you have a GPU without TPUs, right?

Yes, no TPU here.

Well, I think it is not enough to represent the variability we can find in seized data sets... Anyway, have you computed WER on this 42 seconds audio so we can also have an objective measure instead of just feelings (which is also important).

As I said before, unfortunately just feelings. But the general feeling here is that it's way better :-)

@leosol
Copy link
Contributor

leosol commented Jul 18, 2023

This feature would be very interesting for those who do not have a contract for audio transcription with third parties (which I believe is the majority of Brazilian states).
Even though it might be a time-consuming process, often the transcription ends up being included in the body of the report, and something more accurate would be very welcome, especially since in most frequent cases the transcription is done on a small set of chats.

@lfcnassif
Copy link
Member Author

Hi @DHoelz and @leosol, thanks for your contributions to the project.

We tried both large and small models (from jonatasgrosman and edresson). They had similiar results

Well, looking at the numbers of the tests I referenced, I think 25% less errors are a reasonable difference. Of course this can change depending on the data set...

The 42 seconds audio transcription I sent was only an example of how better whisper is compared to wav2vec, and how close it can be to Azure’s Service.

It looks better for this audio, but without the gold standard, I can't come to any scientific conclusion about which model is better. I also refer to Whisper, Faster-Whisper, Whisper-JAX, which is better? Please also notice there is an open enhancement for wav2vec2 #1312 to avoid wrong words (out of vocabulary).

Based on every test we made with wav2vec (on several exams), we concluded that it was simply better not to send the transcriptions, as many of the times they were unreadable.

Well, our users are quite satisfied, of course if we can provide better results in an acceptable response time, that's good, that's the goal of this ticket. How have you tested Wav2vec2, using IPED or externally in a standalone application?

According to the author: common_voice_11_0 dataset

Common Voice cuts are usually easy data sets, CORAA is a much more difficult portuguese one, it would be interesting to evalute the author's model on CORAA.

This feature would be very interesting for those who do not have a contract for audio transcription with third parties

We also don't have a commercial contract here, that's why I integrated Vosk and Wav2Vec2 later.

In summary, this ticket is to evaluate Whisper models using an objective metric on the same data sets we evaluated the other models. We can use a more difficult real world data set, running all models again, if you are able to share the audios and their correct transcriptions validated by humans. If we come to a fundamented conclusion it is better on different data sets without a huge performance penalty (maybe a 2x-3x would be acceptable), I'll add the implementation, when I have available time...

Of course contributions to implement it into IPED are very welcome, please send a PR, I'll be happy to test and review.

@lfcnassif
Copy link
Member Author

Thanks I was aware about the first reference, not about the second. But I didn't finish, I will try to normalize numbers and try to run wav2vec2 with a language model.

@gfd2020
Copy link
Collaborator

gfd2020 commented Nov 27, 2023

Hi @lfcnassif ,

I did some tests with fast-whisper. With that test script of yours and replacing the contents of the 'Wav2Vec2Process.py' file.
I also got the model made by @DHoelz ( dwhoelz/whisper-medium-pt-ct2 )

I got better transcriptions than wav2vec, but the performance is worse, 2x slower.

The strange thing is that when configuring the OMP_NUM_THREADS thread parameter with half of the total logical cores, I got better performance. Both locally and on the IPED transcription server.

I also managed to get the 'finalscore' on fast-whisper. Then check if it is correct.

Below, the result of a small test I did running IPED Server ( Only in CPU mode ).

Machine: 2 sockets - 24 logical cores ( 2 python process for transcription )

OMP_NUM_THREADS = num of threads

10 audios 530 seconds - 12 threads ( total cpu usage 100%)
10 audios 509 seconds - 6 threads ( total cpu usage 60%)

Perhaps the best configuration of "OMP_NUM_THREADS" is:

import psutil
logical_cores = psutil.cpu_count(logical=True)
cpu_sockets = 2 #Find a way to get this value in Python
threads = int(logical_cores/cpu_sockets/2)
os.environ["OMP_NUM_THREADS"] = str(threads)

Now a question. Would it be possible to make the wav2vec2 remote server more generic to also accept fast-whisper (Through configuration parameter)?

I was also able to get fast-whisper to work offline.

Modified script to compute the finalscore.

`
#code

    import sys
import numpy
stdout = sys.stdout
sys.stdout = sys.stderr

terminate = 'terminate_process'
model_loaded = 'wav2vec2_model_loaded'
huggingsound_loaded = 'huggingsound_loaded'
finished = 'transcription_finished'
ping = 'ping'

def main():

	modelName = 'medium'
	modelName = 'dwhoelz/whisper-medium-pt-ct2'
	#modelName = sys.argv[1]

	deviceNum = sys.argv[2]
	
	import os
	os.environ["OMP_NUM_THREADS"] = "6"

	from faster_whisper import WhisperModel
	
	print(huggingsound_loaded, file=stdout, flush=True)
	
	#import torch
	#cudaCount = torch.cuda.device_count()
	
	# Run just on CPU for now
	cudaCount = 0

	print(str(cudaCount), file=stdout, flush=True)

	if cudaCount > 0:
		deviceId = 'cuda:' + deviceNum
	else:
		deviceId = 'cpu:'
	
	try:
		model = WhisperModel(modelName, device=deviceId, compute_type="int8")

	except Exception as e:
		if deviceId != 'cpu':
			# loading on GPU failed (OOM?), try on CPU
			deviceId = 'cpu'
			model = WhisperModel(model_size_or_path=modelName, device=deviceId, compute_type="int8")
		else:
			raise e
	
	print(model_loaded, file=stdout, flush=True)
	print(deviceId, file=stdout, flush=True)
	
	while True:
		
		line = input()

		if line == terminate:
			break
		if line == ping:
			print(ping, file=stdout, flush=True)
			continue

		transcription = ''
		probs = []
		try:
			segments, info = model.transcribe(audio=line, language='pt', beam_size=5, word_timestamps=True)
			for segment in segments:
				transcription += segment.text
				words = segment.words
				if words is not None:
					probs += [word.probability for word in words]            
		except Exception as e:
			msg = repr(e).replace('\n', ' ').replace('\r', ' ')
			print(msg, file=stdout, flush=True)
			continue
		
		text = transcription.replace('\n', ' ').replace('\r', ' ')
		
		probs = probs if len(probs) != 0 else [0]
		finalScore = numpy.average(probs)        
		
		print(finished, file=stdout, flush=True)
		print(str(finalScore), file=stdout, flush=True)
		print(text, file=stdout, flush=True)

	return
	
if __name__ == "__main__":
	 main()

`

@lfcnassif
Copy link
Member Author

Thank you @gfd2020!

I got better transcriptions than wav2vec, but the performance is worse, 2x slower.

Did you measure WER or used other evaluation metric?

I also managed to get the 'finalscore' on fast-whisper. Then check if it is correct.

Thank you very much, that is very important!

Now a question. Would it be possible to make the wav2vec2 remote server more generic to also accept fast-whisper (Through configuration parameter)?

Sure. That is the goal, the final integration will use a configuration approach.

@gfd2020
Copy link
Collaborator

gfd2020 commented Nov 27, 2023

Did you measure WER or used other evaluation metric?

Unfortunately I did not measure WER. It was just a manual checking of the texts obtained.
Whisper model also have punctuation and caps and take less memory, in my case 1-1.5 GB per python process.

@lfcnassif
Copy link
Member Author

Try whisper.cpp.

Seems whisper.cpp improved a lot since last time I tested. Now they have NVIDIA GPU support:

https://github.com/ggerganov/whisper.cpp#nvidia-gpu-support

It may be worth another try, what do you think @fsicoli?

@lfcnassif
Copy link
Member Author

lfcnassif commented Nov 29, 2023

It may be worth another try

Tested the speed some minutes ago: for a 434s audio, the medium model took 35s and the large-v3 model took 65s to transcribe using 1 RTX 3090. Seems a bit faster than faster-whisper on that GPU.

@rafael844
Copy link

rafael844 commented Nov 29, 2023

Tested the speed some minutes ago: for a 434s audio, the medium model took 35s and the large-v3 model took 65s to transcribe using 1 RTX 3090. Seems a bit faster than faster-whisper on that GPU.

Is there some snapshot for testing? Or script we could put in iped as the above.

@lfcnassif
Copy link
Member Author

Is there some snapshot for testing? Or script we could put in iped as the above.

No, I just did a preliminary test of whisper.cpp directly on a single audio from command line without IPED.

@gfd2020
Copy link
Collaborator

gfd2020 commented Nov 29, 2023

I changed the parameter from beam_size=5 to beam_size=1 and the performance improved by 35% and the quality was more or less the same.

@gfd2020
Copy link
Collaborator

gfd2020 commented Nov 30, 2023

Is there some snapshot for testing? Or script we could put in iped as the above.

No, I just did a preliminary test of whisper.cpp directly on a single audio from command line without IPED.

If it is integrated into iped, would it be via java JNA and the DLL?

@lfcnassif
Copy link
Member Author

If it is integrated into iped, would it be via java JNA and the DLL?

You mean this? https://github.com/ggerganov/whisper.cpp/blob/master/bindings/java/README.md

Possibly. Since native code directly linked may cause application crashes (like I experimented with faster-whisper), there are other options too, like whisper server:
https://github.com/ggerganov/whisper.cpp/tree/master/examples/server

Or a custom server process code without the http overhead.

@hilderonny
Copy link

I also fiddled around with several Whisper solutions and ended with a simple client-server solution.

On the one hand there ist an IPED python task which pushes all audio and video files for further processing to a network share. On the other hand there is a separate background process which wathces those shares, transcribes and translates the media files and writes back a JSON file with the results. These JSON files are finally parsed by the IPED task and merged into the metadata of the files.

This gives you three advantages:

  1. You serialize the processing of the files even when you have many workers, so that you can transcribe even on a machine with low computation power and with a smaller GPU.
  2. The results are indexed by IPED and can be searched via keywords.
  3. You can start as many byckground processes on as many different network machines as you want to speed up the processing. This helped me with a case with thousands of arabic voice messages to process.

Here are the repositories for the task and the background process:

Maybe you find the solution useful.

Greetings, Ronny

@lfcnassif
Copy link
Member Author

Thanks @hilderonny for sharing your solution!

Which Whisper implementation are you using? Standard whisper, faster-whisper, whisper.cpp, whisper-jax?

@hilderonny
Copy link

I am using faster-whisper because this implementation is also able to separate speakers by splitting up the transcription into parts and is a lot faster in processing the media files.

@lfcnassif
Copy link
Member Author

I'm evaluating other 3 Whisper implementations: Whisper.cpp, Insanely-Fast-Whisper and WhisperX. The last 2 are much much faster for long audios, since they break them into 30s pieces and execute batch inference on many audio segments at the same time, at the cost of higher GPU VRAM usage. Quoting #2165 (comment):

Speed numbers over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp since it can't be set):

  • Faster-Whisper took ~36s
  • Whisper.cpp took ~31s
  • Insanely-Fast-Whisper took ~7s
  • WhisperX took ~5s

Running over the 151 small real world audios dataset with total duration of 2758s:

  • Faster-Whisper took 220s
  • Whisper.cpp 185s
  • Insanely-Fast-Whisper 358s
  • WhisperX took 171s

PS: Whisper.cpp seems to parallelize better than others using multiple processes, so its last number could be improved.
PS2: For inference on CPU, Whisper.cpp is faster than Faster-Whisper by ~35%, not sure if I will time all of them on CPU...
PS3: Using large-v3 model within Whisper.cpp, it produced hallucinations (repeated texts and a few non existing texts), it was also observed with Faster-Whisper in a lower level.

@lfcnassif
Copy link
Member Author

lfcnassif commented Apr 16, 2024

Updating WER stats with WhisperX + largeV2 model:
image

Average WER difference to Faster-Whisper + largeV2 model is just +0.0018. WhisperX was the best on TEDx data set until now, a spontaneous speak data set, also better than FasterWhisper + Jonatas Grosman's portuguese fine tuned largeV2 model.

PS: I'll review WER of Faster-Whisper + largeV2 on VoxForge, it seems an outlier.

PS2: WhisperX + largeV2 took 3h30 to transcribe all data sets (29h duration) while Faster-Whisper + largeV2 took 5h on RTX 3090.

@lfcnassif
Copy link
Member Author

lfcnassif commented Apr 16, 2024

Updating results with WhisperX + medium model:
image

It took 2h30 to transcribe all 29h data set while FasterWhisper + medium took 3h30 (both using float16 and beam_size=5) on RTX 3099.

I also fixed Faster-Whisper + largeV2 on VoxForge, it was missing a zero...

@lfcnassif
Copy link
Member Author

Updating results with WhisperX + LargeV3 model and WhisperX + JonatasGrosman's LargeV2 fine tuned model for portuguese:
image

WhisperX + LargeV3 model took 3h30m
WhisperX + JonatasGrosman's LargeV2 model took 3h45m

I'll try to prototype a Whisper.cpp integration to make its WER evaluation easier on those data sets, since it is faster on CPU and could be an option for some users.

PS: I think default Whisper + LargeV2 model WER numbers are quite strange, I'll review them too.

@lfcnassif
Copy link
Member Author

lfcnassif commented Apr 18, 2024

Revised numbers for Whisper reference impl + Largev2 model, it didn't change that much:
image

I'm running WER evaluation with Whisper.cpp implementation and will post them soon.

@lfcnassif
Copy link
Member Author

Updating stats with Whisper.cpp (medium, largeV2 & largeV3 models), Faster-Whisper + LargeV3, running time of Whisper models and number of empty transcriptions from all 22,246 audios:
image

Comments:

  • Whisper.cpp impl generally increased WER on the tested datasets. It also returned much more empty transcriptions than other implementations. So, I won't consider integrating it for now;
  • WhisperX WER was competitive with Faster-Whisper WER, generally returned less empty transcriptions and it was noticeably faster. I'll consider replacing Faster-Whisper for WhisperX if there are no objections;
  • WhisperX can be even faster, because I forgot to disable breaking audios in 59s boundaries, and it is much much faster than others with long audios;

To finish this evaluation, 2 tasks are needed:

  • Normalize number transcriptions, Whisper transcribes numbers using Arabic numbers (except Jonatas Grosman's fine tuned model), while our reference transcriptions use written text for numbers, that is unfair and is increasing all Whisper WER results, specially in SID data set, which has many numbers;
  • Evaluate all models in a real world, non public audio data set, so we will be sure results are not biased because of testing on data sets used for training (at least those in yellow background). I already have 100-200 manually transcribed real world audios sent by some colleagues (thank you @wladimirleite!) to use for this. @marcus6n I will need your help next week to validate those manual transcriptions, so we can evaluate the models on them;

@lfcnassif
Copy link
Member Author

Speed numbers over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp since it can't be set):

  • Faster-Whisper took ~36s
  • Whisper.cpp took ~31s
  • Insanely-Fast-Whisper took ~7s
  • WhisperX took ~5s

Transcribing the same 442s audio, medium model, int8 precision (except whisper.cpp since it can't be set), but on a 24 threads CPU:

  • Faster-Whisper took ~432s
  • Whisper.cpp took ~310s
  • Insanely-Fast-Whisper took ~150s
  • WhisperX took ~170s

@lfcnassif
Copy link
Member Author

lfcnassif commented Apr 29, 2024

Just found this study from March 2024: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription

It also supports our current choice for WhisperX is a good one (I'm just not very happy with WhisperX dependencies size...)

@rafael844
Copy link

Just found this study from March 2024: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription

It also supports our current choice for WhisperX is a good one (I'm just not very happy with WhisperX dependencies size...)

Cant it be put as an option? So who decides to use WhisperX instead Faster-Whisper has to download dependencies pack, as pytorch and others.

@lfcnassif
Copy link
Member Author

lfcnassif commented Apr 29, 2024

Cant it be put as an option? So who decides to use WhisperX instead Faster-Whisper has to download dependencies pack, as pytorch and others.

It could, but I don't plan to do, package size is the less important aspect to us in my opinion (otherwise we should use Whisper.cpp, it is very small), WhisperX has similar accuracy, is generally faster and much faster with long audios, that's more important from my point of view. And keeping 2 different implementations increases the maintenance effort.

@lfcnassif
Copy link
Member Author

Hi @marcus6n. How is the real world audio-transcription data set curation going? It's just 1h of audios, do you think you can finish it today or on Thursday?

@lfcnassif
Copy link
Member Author

Started the evaluation on the 1h real world non public audio data set yesterday, thanks @marcus6n for double checking the transcriptions and @wladimirleite for sending half of them! Preliminary results below (averages still not updated):
image

@lfcnassif
Copy link
Member Author

lfcnassif commented Oct 2, 2024

For those interested, Whisper recently published a new turbo model, with accuracy similar to large-v2, but up to 8x faster, depending on your hardware:
openai/whisper#2363

PS: This converted model can be used with faster-whisper: https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants