Replies: 22 comments 79 replies
-
Thanks for your detailed notes! I have included it in README. But does your RAM usage include the SLM adversarial training run? |
Beta Was this translation helpful? Give feedback.
-
I have question about the following point:
I tried this quickly on Colab by installing the phenomizer using this how-to and used it on the following 3 lines of text:
However, the output of all 3 lines was the same:
Did I miss something? I used epitran to transliterate before and the output from it was a little bit different. Also, epitran actually left those quotes intact during its own phenomization:
I'm also wondering how much of a difference there is between epitran's and espeak's output? English is not my primary language and I have very little knowledge of phonemization, so I can't tell what the difference between these 2 outputs really is. I'd be glad if anyone could shed a bit of light into this for me, too :) |
Beta Was this translation helpful? Give feedback.
-
Thank you for the detailed write up! It would be nice to get the default max_len 400 with everything turned on to train on a 24Gb card. Any idea if Batch_size=1 would work with gradient accumulation? |
Beta Was this translation helpful? Give feedback.
-
Thanks so much for the super useful notes @Kreevoz - much appreciated! We are currently dealing with artifacts at the end of generated audio.
We did a single speaker fine-tune with max_len=100 and got artifacts at the end: https://voca.ro/1fxqUN6Tj2US (~4s) We were able to get rid of those with max_len=400: https://voca.ro/193XDB1sOpaU (~4s) However, for longer audio at ~10s we still get them: https://voca.ro/1odqL5a7BC89 (~10s) Wondering if this would be fixed if we fine-tune on max_len=800, but then wondering if the issue will pop up again for even longer audio like ~15s Would that be something to expect based on how the model learns to produce the audio within the max_len window, or is there a sweet spot to get rid of most of the artifacts no matter the length of the audio? Thx 🙏 |
Beta Was this translation helpful? Give feedback.
-
I used 4 V100 16GB gpu with a batch size of 8 and max len of 200 with my dataset. OOM to the second stage training. |
Beta Was this translation helpful? Give feedback.
-
Is it possible to grow and add new styles to the multi speaker model by repeatedly fine tuning with a small data set? For example, adding a variety of British speakers of different accents with separate fine tuning sessions. |
Beta Was this translation helpful? Give feedback.
-
Is it okay if the batch_size and max_len values of 'first_stage' and 'second_stage' are different? |
Beta Was this translation helpful? Give feedback.
-
Can VRAM usage be lowered by lowering batch_percentage, and is there a compromise in quality? |
Beta Was this translation helpful? Give feedback.
-
With a 4090 card we are limited to a max_length of about 280. Does that mean that only the first 3.5 seconds of audio are utilized? And then it wouldn't make any sense to include longer training sentences? 3.5 seconds is a really small context window for text styling. Or perhaps longer samples are split up? |
Beta Was this translation helpful? Give feedback.
-
How much data is required when fine-tuning a fully learned pre-learning model? |
Beta Was this translation helpful? Give feedback.
-
This system is very robust at generating a large amount of text. There are a few issues with the pops at the end. My speculation is that when the system tries to generate a voice sample and there is no data, it will generate noise or static. |
Beta Was this translation helpful? Give feedback.
-
I was wondering is it worth trying FP8 for training? I struggled to build the wheel on WSL and wonder if it's worth the effort anyway? Are there any benefits? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the write-up, I'll post my experiences here. I tried fine tuning on a single A100 80gb with a much smaller dataset (~30 mins), settings of Went OOM within a couple epochs. With I'm unsure if you can finetune on multiple GPUs so I would be cautious with your settings if you haven't prepared your data yet. If I make another dataset I'll limit my clips to 6-8 seconds each. The VRAM requirements for training are immense, but I have seen people use much longer clips so there may be a variable I missed. The inference quality is very good, I didn't focus on emotion transfer but it still came out great, excellent vocal quality and prosody. There are some unnatural pauses but some of that may be related to the punctuation in my training set, I included a few too many dashes and the model seemed to associate those with pauses. The only issue is occasional artifacts at the end of generated clips as others have noted, I assume this is learned from the Libri dataset because I included ample leading and trailing silence and kept all my clips under the |
Beta Was this translation helpful? Give feedback.
-
So if I do the same and tag my data with different styles/emotion dose that mean I'd have to select a specific speaker from the model at inference time and that whole inference will only output that style? If I tag everything under the same speaker, would the model just try to get an average all the styles? |
Beta Was this translation helpful? Give feedback.
-
Something I noticed while using inference with the base LibriTTS model. If I denoise the reference audio with https://github.com/resemble-ai/resemble-enhance Reference audio |
Beta Was this translation helpful? Give feedback.
-
Does anyone know why SLM adverserial training doesn't start when finetuning as shown in this issue? #227 I think it may have to do with the batch size and max_length, as I and the OP of this issue both had a batch size of 2 with max_length 400, but I haven't done any further testing on this yet. Here's my full config:
|
Beta Was this translation helpful? Give feedback.
-
tracing back further, just a quick look: |
Beta Was this translation helpful? Give feedback.
-
I don't understand max_len number when converted to seconds. Like how much is max_len 190 when converted to seconds and milliseconds. |
Beta Was this translation helpful? Give feedback.
-
Hello, we created our own character model through English model Fine Tuning. What should I do for emotional tts? (alpha = 0.1, beta = 0.5, diffusion=10) same with demo |
Beta Was this translation helpful? Give feedback.
-
@Kreevoz could you please guide me how to train styletts2 model in python? |
Beta Was this translation helpful? Give feedback.
-
@Kreevoz Thanks for the detailed notes. I have a question regarding the following comment:
Since the model in the paper was trained using LibriTTS as-is with excellent results, does this mean that this issue might not significantly affect the audio quality? I'd appreciate your insights on this. |
Beta Was this translation helpful? Give feedback.
-
Thanks also for all the info! We are actually looking for someone to dockerize this whole finetuning process for a project of ours, so if anyone who had some successful finetunes is interested in work with us on this for a small compensation, please email me at [email protected] :) |
Beta Was this translation helpful? Give feedback.
-
I've made a few notes during finetuning runs and figure we could maybe pool our insights into one discussion to help everyone iterate efficiently. I don't claim these to be anything more than my own observations/compilation of useful notes. Take them with a grain of salt, especially since there is such rapid development happening as of writing this. I am not affiliated with the authors of this lovely TTS model.
Also take alook through closed issues if you're running into trouble. There is some useful information in them.
Teaching the model new features
Text dataset quality
Robustness
Artifacts
max_len
(in addition to poor audio cleanup and a low quality transcription of course)max_len
of100
(=1.25 seconds), finetuning is possible, but the start and end of generated audio may accumulate distortion and pops.max_len
of800
(=10 seconds), quality is excellent even after one epoch and improves on subsequent iterations. This length covers the majority of audio datasets (as you know the free standard datasets adhere to the duration limitations that autoregressive models like tacotron 1/2 established years ago - due to their attention-mechanism imploding after 10-12 seconds)max_len
of400
and600
also works well.Finetune training Stages
Base
Style Diffusion
diff_epoch
0
. For example to start diffusion training on epoch 5, set this parameter to (5-1)4
diff_epoch
to a value that is larger than your total number of epochs.SLM Adversarial Training
joint_epoch
0
. For example to start SLM adversarial training on epoch 10, set this parameter to (10-1)9
joint_epoch
must be set to a higher number thandiff_epoch
, or you will encounter an error. You cannot run SLM Adversarial Training before you begin running Style Diffusion training.joint_epoch
to a value that is larger than your total number of epochs.batch_percentage
Defaults to
0.5
. Adds (batch_size * 0.5
) number of batches with SLM Adversarial samples.(For example the previous batch size was
6
without SLM adv. training, the new batch size is4
(since 4*0.5 =) 2 batches will get added, totaling 6 again)batch_size
, for a few extra epochs.min_len
andmax_len
under theslmadv_params
section in the config file within reason.Errors, Crashes
RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size
One or both of the following conditions are present:
max_len
of less than100
UnicodeDecodeError: 'charmap' codec can't decode byte
.. )Check the Operating Systems section below.
RuntimeError: The expanded size of the tensor (SOME NUMBER HERE) must match the existing size (512) at non-singleton dimension 1.
The input text is too long. If this is happening during training, check your dataset and split up extremely long sentences into more manageable ones. Make sure that if you use a custom OOD text, you split sentences on punctuation and ensure they don't become entire paragraphs. Anything that would take you longer than 10 seconds to speak is probably a candidate for splitting in half.
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
batch_size
must be2
or greater, or you will run into this error.UnboundLocalError: local variable 'ref' referenced before assignment
If this appears when your finetuning is trying to begin the SLM Adversarial Training, then your
diff_epoch
is set to a later epoch thanjoint_training
. for example:diff_epoch = 5
,joint_training = 4
is not valid. You would wantjoint_training
to be the bigger number.RuntimeError: Given groups=1, weight of size [1, 1, 3], expected input[1, 221, 1] to have 1 channels, but got 221 channels instead
If you are running finetuning across multiple GPUs, your chosen
batch_size
may be too small and result in each GPU only getting a batch of1
. Increase thebatch_size
.Mixed-precision Training
If you don't want to run training in full precision, you can now run finetuning at mixed-precision.
accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Path/To/Your/config_ft.yml
max_len
a little bit, you essentially get a small quality improvement for free.VRAM Usage
batch_size: 4
,max_len: 100
= ~22GB VRAM. Fits onto a 4090 without problems. Training took ~3 hours.batch_size: 6
,max_len: 800
= ~74GB VRAM. Fits onto an A100 without trouble. Training took ~2 hours.batch_size: 4
,max_len: 100
= ~23.1GB VRAM. Fits onto a 4090. Training took <4 hours.batch_size: 4
,max_len: 100
, using acceleratemixed_precision=fp16
= ~21GB VRAM. 4-5% speed boost.batch_size: 4
,max_len: 100
= ~28GB VRAM. Impossible on a 24GB card at this batch-size.batch_size: 4
,max_len: 100
, using acceleratemixed_precision=fp16
= ~26.6GB VRAM. Still not feasible.batch_size: 2
,max_len: 175
, using acceleratemixed_precision=fp16
= <19GB VRAM. Fits onto a 4090. Only ran this for the Joint Training epochs.batch_size: 4
,max_len: 800
= ~76.5GB VRAM. Fits onto an A100 without trouble. Training took <3 hours.VRAM Usage Strategies
max_len
and can benefit from a relatively speedy initial training run, and suffer through less epochs with reduced batch_size to finish training.batch_size
and double themax_len
to stay roughly within the same amount of utilized VRAM, if you keep all other parameters the same. VRAM usage grows a little on subsequent epochs, so keep some spare capacity for long runs.batch_size
will roughly double the time it takes to run an epoch, but won't negatively impact quality.max_len
will negatively impact quality. If you can, reduce the batch_size instead.max_len: 100
orbatch_size: 2
ever.batch_size
that provides each GPU with at least a batch of size2
. (For example if you have 4 GPUs, then8
is the minimal possiblebatch_size
)Checkpoints
0
.epoch_2nd_00000.pth
is your completed first epoch, and not an empty checkpoint.Resuming Finetuning
pretrained_model:
"Models/YourModelName/epoch_2nd_00123.pth"
load_only_params:
false
Logging
log_dir:
to point to a folder of your choice."Models/MyCoolTTSModel"
Operating system support
PYTHONUTF8=1
either system-wide, or in the terminal session you're using, before invoking the finetuning script.set PYTHONUTF8=1
$Env:PYTHONUTF8 = 1
import sys
print(sys.flags.utf8_mode)
1
if it is enabled./
forward slashes, rather than the backward\
slash notation common for windows just to be on the safe side.Hardware Requirements
max_len
andbatch_size.
Quality comparisons
Dataset: Custom dataset for Garrus from Mass Effect. 30 emotions/styles tagged as speaker IDs, total duration about 5h50min. Custom text preprocessing. Audio includes flanger, this is not a model error but quite desired for a Turian voice. Using a custom OutOfDomain text dataset for SLM AT.
Epoch:
5
, for all examples. There is still room for improvement with more epochs.Sampling:
alpha=0.3, beta=0.7, diffusion_steps=10, embedding_scale=1
Text:
You are reading a discussion page on Github, imagine that! I think the human saying is: "Git good!" Wonder why they didn't choose "that" name.
(The "quoted" words are used for extra emphasis in my dataset)
Phoneme version:
juː ɑːɹ ɹˈiːdɪŋ ɐ dɪskˈʌʃən pˈeɪdʒ ˌɔn ɡˈɪthʌb , ɪmˈædʒɪn ðˈæt ! ˈaɪ θˈɪŋk ðə hjˈuːmən sˈeɪɪŋ ɪz : `` ɡˈɪt ɡˈʊd '' ! wˈʌndɚ wˌaɪ ðeɪ dˈɪdnt tʃˈuːz `` ðˈæt '' nˈeɪm .
batch_size: 4
,max_len: 100
, without style diffusion finetuning, without slm adversarial finetuning :https://voca.ro/11DDidEhJac5
batch_size: 6
,max_len: 800
, without style diffusion finetuning, without slm adversarial finetuning :https://voca.ro/18PaQ8F248Hu
batch_size: 4
,max_len: 100
, with style diffusion finetuning, without slm adversarial finetuning :https://voca.ro/1bulPARTI2mn
batch_size: 2
,max_len: 175
, with style diffusion finetuning, with slm adversarial finetuning :https://voca.ro/1aqmfuqHS51N
Since running in batchsize 2 takes forever, I only trained for 2 final epochs with slm adversarial finetuning, and prior to that, up to epoch 4 with batchsize of 4 with 100 max_len.
batch_size: 4
,max_len: 800
, with style diffusion finetuning, with slm adversarial finetuning :https://voca.ro/11QGDDMhWsNU
These quality examples don't reflect the maximum quality possible and are just for illustration purposes. :>
Hope this is useful.
Beta Was this translation helpful? Give feedback.
All reactions