-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisperX with diarization - KeyError: 'speaker' #117
Comments
@foolishgrunt, The command works without any issue on my end. |
Wiped the old venv, reinstalled, and receive the same error. I've double and triple checked that my token is correct, and that I've accepted the terms for At this point I have to believe there's some other required |
In that case let us debug it together. |
Thanks for taking a look at this. The audio file in question is pretty long - it's a recording of a city council meeting, and it's 2.5 hours long. Just now it occurred to me that the length might be part of the issue, so I got a second, much shorter file and ran it with the same parameters as before:
The process finishes, and I get my .srt file. So... might the diarization model have a length limit that the first file violates? However, I don't see any special formatting in the second file: nothing like what I would expect after having set "speaker_labels" to "True." In fact, the formatting seems to be exactly the same as in the .srt file I got from the first audio file when I ran with no speaker_labels:
Am I missing something? EDIT: I answered the second question on my own: I re-exported as .ass instead of .srt, and lo and behold, the file has speaker labels. |
Yes, you need The shorter file worked without any issues on my end as well. My suggestion now is to try with the Otherwise, I would suggest to cut your long file into smaller pieces (you'll need though to find the maximum length that can be processed without issues) and run the command on the folder containing all the files to batch process them. Another approach is to use something similar to what I did in this example
You can take a look at the colab notebook to see how it works. I hope this helps! |
Thanks for the tips. As a quick test, I ran the longer file in one of the demo instances linked to from the whisperX main page, and it also failed with speaker labels selected. So since it now seems pretty clear that this is not a But on a final note, since I learned that the speaker labels option requires selecting |
Thanks @foolishgrunt for the quick test. At least we are sure now it's not a bug in the project. I hope the Regarding the speaker labels, the problem lies in the
Or
I think this way you'll see the speaker labels in the srt file as well, What do you think ? Which format do you think is better ? |
Excellent idea! Your first proposal is exactly the format I waa planning to do by hand. |
Perfect. Please update the package to the latest version and give it a try? |
Done, tested with the same file - looks good here! As for my long file, I've decided to split it into progressively smaller pieces until I find something short enough that it doesn't error out. Hopefully I find that the limit is still long enough that splitting it into batches doesn't become unworkable. EDIT: The process just completed for a 53 minute segment, so it looks like ~1 hour is the limit for diarization. |
Sounds good! Anyways, let me know if you find any other issues. |
I'm not 100% confident this is a bug rather than a user error, but I've dug through all the relevant documentation I can find and can't find any clues. I've accepted the user terms at https://huggingface.co/pyannote/segmentation-3.0 and https://huggingface.co/pyannote/speaker-diarization-3.1, and I'm passing my access token, so I don't know why it returns this error.
subsai audio.m4a --model m-bain/whisperX --model-configs '{"model_type": "base.en", "speaker_labels": "True", "HF_TOKEN": "[token]"}' --format srt
If I run the same command without the
speaker_labels": "True"
argument, then I get a nicely formatted .srt file. But whenever I get greedy and try to label the speakers, this is the output I get:The text was updated successfully, but these errors were encountered: