Bengali translation uses Devnagari insted of "Bangla / Bengali" fonts #203
Replies: 2 comments
-
It could be due to the lack of sufficient training data. The exact same thing happens with Marathi. Both Hindi and Marathi use the same script, so most of the times the transcription of Marathi audio ends up having a bunch of of Hindi words in it. Or as you said, sometimes longer words are written "phonetically" correct, but that's not how they are actually spelt. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately Whisper won't work at all on Bengali, even the largest one. You can see the rough performance breakdown in Table 13 of the paper, where Bengali stays above 100% WER: It is worse than other supported languages of the Indian subcontinent such as Hindi, Urdu, Marathi, Nepali, and Tamil which showed decreasing WERs as the model gets larger. The table also shows above-100% WERs on Gujarati, Punjabi, and Sindhi. This is primarily because the lack of sufficient training data, as @athu16 pointed out. You may want to try fine-tuning the model, since the model is probably encoding the phonetics of the Indo-Aryan languages, and what it needs is a good language model to decode them. |
Beta Was this translation helpful? Give feedback.
-
Hi I was experimenting with whisper's Bengali STT and found that it outputs Devanagari font (same font used for Hindi) though Bengali has it's own letters and characters. Output is phonetically correct as if when spelled in Devangari itself it isn't too far off from the original Bengali sound (as Devanagari and Bengali are closely related). It is like writing Spanish or German in plain English substituting Ü Ö (umlaut) with the most similar sounding English phonemes.
Beta Was this translation helpful? Give feedback.
All reactions