Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Text to Speech #209

Open
zolrath opened this issue May 9, 2023 · 13 comments
Open

Support Text to Speech #209

zolrath opened this issue May 9, 2023 · 13 comments
Labels
kind:feature New feature or request

Comments

@zolrath
Copy link

zolrath commented May 9, 2023

Hello!
As Speech to Text models such as Whisper are added having access to some of the impressive AI Text to Speech models would be a nice way to close the loop!

My current suggestion for a model to support would be bark.

@fredwu
Copy link

fredwu commented May 11, 2023

+1

Would also love to see the support for Coqui TTS.

@jonatanklosko jonatanklosko added the kind:feature New feature or request label Dec 13, 2023
@bartekupartek
Copy link

bartekupartek commented Jan 24, 2024

It would be great to run Bark in Elixir, also recently this TTS model brought a lot of attention https://github.com/collabora/WhisperSpeech

@Jdyn
Copy link

Jdyn commented Mar 18, 2024

I hate to reiterate what's already been said but TTS in Bumblebee using Bark would be super valuable. Any chance of supporting it?

Hugging face: https://huggingface.co/suno/bark

@josevalim
Copy link
Contributor

Pull requests are always welcome. Starting with one of the models in Hugging Face Transformers is probably the easiest way to get started: https://huggingface.co/docs/transformers/en/tasks/text-to-speech

@nickkaltner
Copy link

Just adding this as an interesting model to support too https://huggingface.co/coqui/XTTS-v2

@bartekupartek
Copy link

bartekupartek commented Apr 5, 2024

I tried to port Bark and later on WhisperSpeech, they use multiple models to convert text to semantics, semantics to audio and encode... anyway there are more promising models recently released https://huggingface.co/parler-tts/parler_tts_mini_v0.1 or
https://github.com/jasonppy/VoiceCraft
or https://github.com/myshell-ai/OpenVoice
After reviewing their architectures they might be easier to integrate

@michelson
Copy link

@bartekupartek, do you have your implementation open? I'm trying to do the same I've read the docs but not sure where to start.

@bartekupartek
Copy link

bartekupartek commented Apr 10, 2024

@michelson not yet but working on it, this models aren't using standard layers or if at all they are in pickle format, I needed to move back to understand simpler models with axon first

@bartekupartek
Copy link

bartekupartek commented Apr 23, 2024

I'm currently playing around Tacotron 2 text-to-speech and since it's simplest TTS I've found I'm trying to reproduce it in Elixir, I used nx_signal to process audio files and generate Mel spectrograms but during my research I noticed there is no support for a vocoder in Elixir ecosystem to convert spectrograms back to audio or am I missing something?
Vocoders are typically another models so I think they could be integrated in bumblebee. I found all TTS models are utilizing vocoders to encode audio from theirs outputs, but they are yet another layer of complexity.

@josevalim
Copy link
Contributor

Correct. We would need to implement them in Elixir. Maybe @polvalente knows of an implementation that could be ported, otherwise we need to look if there are any Jax implementations. If not, maybe it needs to be a separate library we invoke.

@polvalente
Copy link
Contributor

There are many kinds of vocoders. I think the best way to approach this would be to choose a specific model we want to support and work towards porting the one it uses.

@bartekupartek
Copy link

bartekupartek commented Apr 24, 2024

I was thinking it might be one of torchaudio vocoders like Griffin-Lim(outputs sounds robotic) or WaveRNN(most likely this) or Nvidia Waveglow to turn mel spectograms into audio, but I just read trough VALL-E paper Bark is based on:

We propose VALL-E, the first TTS framework with strong in-context learning capabilities as
GPT-3, which treats TTS as a language model task with audio codec codes as an intermediate
representation to replace the traditional mel spectrogram

It would be fun to have Tacotron 2 working end to end or hear how mel spectrograms sounds but it looks like it doesn't make sense for any recent models mentioned above that are using facebook/encodec to turn outputs into audio codes directly 🙇‍♂️

@username14415
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants