Replies: 2 comments
-
it's the hallucination problem #679, the cause is, more or less as you said, from training data #928 for now the solution is to extract vocals using source separation models like Spleeter or Demucs if you want more accurate sound classification, it requires fine-tuning |
Beta Was this translation helpful? Give feedback.
0 replies
-
update: u can try https://github.com/YuanGongND/whisper-at |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Just a quick signpost having run this across a few thousand hours of mixed materials that whenever orchestral instrumental music is present, whisper has a strong bias towards inferring incorrect titles for a handful of works, but especially Edward Elgar's "Pomp and Circumstance." Assume there are some graduations mixed into the training set as there are a fair number of variations on it including movements, key, and named attribution to Elgar.
These have been my biggest offenders but there are hundreds more, YMMV. Typically identifiable in a post-process regex search surrounded by brackets (but not always).
A generic
(music playing)
sound atmospheric normalizer would be a super helpful addition, and/or -- a boy can dream-- a genre recognizer (jazz-music playing), (orchestral music playing) (vocal music playing). In the same general space, inclusion of non-verbal atmospherics for which I assume there must be data floating around (phone ringing), (siren), (car horn), would be a welcome option as well.Beta Was this translation helpful? Give feedback.
All reactions