Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add XTTSv2 #4673

Merged
merged 19 commits into from
Nov 21, 2023
Merged

add XTTSv2 #4673

merged 19 commits into from
Nov 21, 2023

Conversation

kanttouchthis
Copy link
Contributor

@kanttouchthis kanttouchthis commented Nov 20, 2023

Checklist:

Description

adds XTTSv2 for multilingual TTS with voice cloning.
Installation needs to be tested further but seems to work on windows. Dependencies may cause conflicts.
Edit: example

@TeuMasaki
Copy link

It seems that this implementation fails with ZeroDivisionError when there are unpronounceable sequences in the generation.

['She pauses, watching you make your way over to the chair and collapse into it with relief.']
Processing time: 1.938103199005127
Real-time factor: 0.2928350478159128
Text splitted to sentences. 

Processing time: 0.0
Traceback (most recent call last):
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\queueing.py", line 407, in call_prediction
output = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1550, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1199, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 519, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 512, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\anyio_backends_asyncio.py", line 807, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 495, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 649, in gen_wrapper
yield from f(*args, **kwargs)
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 342, in generate_chat_reply_wrapper
for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True)):
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 310, in generate_chat_reply
for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message):
File "D:\oobabooga\text-generation-webui\modules\chat.py", line 278, in chatbot_wrapper
output['visible'][-1][1] = apply_extensions('output', output['visible'][-1][1], state, is_chat=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\modules\extensions.py", line 224, in apply_extensions
return EXTENSION_MAP[typ](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\modules\extensions.py", line 82, in _apply_string_extensions
text = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\extensions\XTTSv2\script.py", line 153, in output_modifier
return tts_narrator(string)
^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\extensions\XTTSv2\script.py", line 135, in tts_narrator
tts.tts_to_file(text=turn,
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\api.py", line 403, in tts_to_file
wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\api.py", line 341, in tts
wav = self.synthesizer.tts(
^^^^^^^^^^^^^^^^^^^^^
File "D:\oobabooga\text-generation-webui\installer_files\env\Lib\site-packages\TTS\utils\synthesizer.py", line 492, in tts
print(f" > Real-time factor: {process_time / audio_time}")
~~~~~~~~~~~~~^~~~~~~~~~~~
ZeroDivisionError: float division by zero```

@kanttouchthis
Copy link
Contributor Author

do you know what the text was?

@Dampfinchen
Copy link

Dampfinchen commented Nov 20, 2023

Nice job! I've noticed XTTSv2 also supports streaming. Do you think its possible to use it conjunction with token streaming or have it generated immediately after one sentence is finished? Since the TTS model keeps being in VRAM, using it simultaneously with text generation should be possible.

@TeuMasaki
Copy link

TeuMasaki commented Nov 20, 2023

do you know what the text was?

It was a stop token '</s>' after the asterisk '*' causing the problem. It does work normally when there is a non asterisk prefixed stop token though.

*Mishka explains her understanding of the Chinese city based on your description.*</s>

> Text splitted to sentences.
['Mishka explains her understanding of the Chinese city based on your description.']
Processing time: 1.7906074523925781
> Real-time factor: 0.3192755719146748
Text splitted to sentences.
> Processing time: 0.0
Traceback (most recent call last):
...

@oobabooga
Copy link
Owner

I made the structure more similar to silero_tts and made some various fixes. I think that this looks pretty good now and it's working reliably.

@kanttouchthis I ended up removing the narrator feature for simplicity and will accept your PR to text-generation-webui-extensions for people who want to try it.


The only remaining issue is that the TTS library apparently re-downloads the model every time instead of using the existing cache. I'll merge this PR and try to find a solution to that in a future one.

@oobabooga oobabooga merged commit 8dc9ec3 into oobabooga:dev Nov 21, 2023
@kanttouchthis
Copy link
Contributor Author

The model cache issue was fixed in TTS 0.20.6

@erew123
Copy link
Contributor

erew123 commented Nov 22, 2023

I'm seeing some oddity with the asterisk issue mentioned above. It causes the TTS to generate 2-4 seconds of audio strange sounds or sometimes cut out some of the speech, before restarting a sentence or two later.

What you see in the web interface
*This is a narrative description.* "This is the character speaking."

What you see if you look at the command prompt/text generation
"*This is a narrative description.", '*', '"This is the character speaking."'

I've listened to quite a few generations now and looked at quite a lot of the command prompt/terminal and best I can tell, its when that asterisk gets split/broken out. Im not sure if its specific to some models or just a general issue.

I have a suspicion that its also badly impacting generation time, as generations that seem to suffer this issue, seem to take a bit longer to process, even though the actual audio output isn't specifically any longer.

I'm on the current build of the coqui_tts extension (at time of writing).

@ElhamAhmedian
Copy link

Which loader should be used in the extension?

image

Thanks

@allenhs
Copy link

allenhs commented Nov 22, 2023

Coqui also supports using different voices for the narrator etc. Can this feature be added? Said feature already exists in the extension located here: https://github.com/kanttouchthis/text_generation_webui_xtts

@aios-ai
Copy link

aios-ai commented Nov 22, 2023

Nice job! I've noticed XTTSv2 also supports streaming. Do you think its possible to use it conjunction with token streaming or have it generated immediately after one sentence is finished? Since the TTS model keeps being in VRAM, using it simultaneously with text generation should be possible.

I'd also love to see that, but I think there is more to it then just calling the tts engines streaming mode. I added a feature request in regards of this topic and also describes the difference in text-generation streaming and tts streaming which needs to be made compatible: #4706

Maybe the text-generation team can comment on it but I guess it makes sense to have a dedicated issue for this topic.

@erew123
Copy link
Contributor

erew123 commented Nov 24, 2023

@oobabooga @kanttouchthis Please could you have a look at this #4712 I have found a solution to speeding up speech generation for people who have a low VRAM situation. I've written some code (badly) that works, but, not actually being a coder, someone would need to integrate it properly into the script.py properly (and tidy up the code).

Thanks

@morozig
Copy link

morozig commented Mar 23, 2024

Hi guys! Can you please tell where those voices came from? Are they creative commons licensed in any way? I'm wandering if I can use them in a video game.
image

@101100
Copy link

101100 commented Apr 30, 2024

@oobabooga I'm also curious about the source of the voice files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants