-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add XTTSv2 #4673
add XTTSv2 #4673
Conversation
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
It seems that this implementation fails with ZeroDivisionError when there are unpronounceable sequences in the generation.
|
do you know what the text was? |
Nice job! I've noticed XTTSv2 also supports streaming. Do you think its possible to use it conjunction with token streaming or have it generated immediately after one sentence is finished? Since the TTS model keeps being in VRAM, using it simultaneously with text generation should be possible. |
It was a stop token '</s>' after the asterisk '*' causing the problem. It does work normally when there is a non asterisk prefixed stop token though.
|
I made the structure more similar to silero_tts and made some various fixes. I think that this looks pretty good now and it's working reliably. @kanttouchthis I ended up removing the narrator feature for simplicity and will accept your PR to text-generation-webui-extensions for people who want to try it. The only remaining issue is that the TTS library apparently re-downloads the model every time instead of using the existing cache. I'll merge this PR and try to find a solution to that in a future one. |
The model cache issue was fixed in TTS 0.20.6 |
I'm seeing some oddity with the asterisk issue mentioned above. It causes the TTS to generate 2-4 seconds of audio strange sounds or sometimes cut out some of the speech, before restarting a sentence or two later. What you see in the web interface What you see if you look at the command prompt/text generation I've listened to quite a few generations now and looked at quite a lot of the command prompt/terminal and best I can tell, its when that asterisk gets split/broken out. Im not sure if its specific to some models or just a general issue. I have a suspicion that its also badly impacting generation time, as generations that seem to suffer this issue, seem to take a bit longer to process, even though the actual audio output isn't specifically any longer. I'm on the current build of the coqui_tts extension (at time of writing). |
Coqui also supports using different voices for the narrator etc. Can this feature be added? Said feature already exists in the extension located here: https://github.com/kanttouchthis/text_generation_webui_xtts |
I'd also love to see that, but I think there is more to it then just calling the tts engines streaming mode. I added a feature request in regards of this topic and also describes the difference in text-generation streaming and tts streaming which needs to be made compatible: #4706 Maybe the text-generation team can comment on it but I guess it makes sense to have a dedicated issue for this topic. |
@oobabooga @kanttouchthis Please could you have a look at this #4712 I have found a solution to speeding up speech generation for people who have a low VRAM situation. I've written some code (badly) that works, but, not actually being a coder, someone would need to integrate it properly into the script.py properly (and tidy up the code). Thanks |
@oobabooga I'm also curious about the source of the voice files. |
Checklist:
Description
adds XTTSv2 for multilingual TTS with voice cloning.
Installation needs to be tested further but seems to work on windows. Dependencies may cause conflicts.
Edit: example