Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] in XTTSv1 and XTTSv2 i get error ❗ len(DataLoader) returns 0 in not-english #3229

Closed
lpscr opened this issue Nov 15, 2023 · 25 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@lpscr
Copy link

lpscr commented Nov 15, 2023

Describe the bug

hi first i want to thank you for all your amazing work !

please if you have little time please check this , i have also notebook easy to debug the problem

i follow the steps for recipes in xttsv1,xttsv2

https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1
https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v2

AssertionError: ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.

if i test on vits,glow working just fine with not any problem and i can complete the train
https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/vits_tts

To Reproduce

i have create simple notebook for easy to see the problem ,
i have also fix the dataset i use from kaggle

testTTSV1_2.zip

Expected behavior

No response

Logs

> Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Num. of CPUs: 2
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=/content/GPT_XTTS_LJSpeech_FT_new-November-15-2023_03+10PM-0000000

 > Model has 518128803 parameters

>> DVAE weights restored from: /content/XTTS_v1.1_original_model_files/dvae.pth
 | > Found 1844 files in /content/el


 > EPOCH: 0/1000
 --> /content/GPT_XTTS_LJSpeech_FT_new-November-15-2023_03+10PM-0000000
 ! Run is removed from /content/GPT_XTTS_LJSpeech_FT_new-November-15-2023_03+10PM-0000000

 > Filtering invalid eval samples!!
 > Total eval samples after filtering: 0
Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1762, in _fit
    self.eval_epoch()
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 1610, in eval_epoch
    self.get_eval_dataloader(
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 976, in get_eval_dataloader
    return self._get_loader(
  File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 900, in _get_loader
    len(loader) > 0
AssertionError:  ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.

Environment

{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": "11.8"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu118",
        "TTS": "0.20.5",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#1 SMP Wed Aug 30 11:19:59 UTC 2023"
    }
}

Additional context

No response

@lpscr lpscr added the bug Something isn't working label Nov 15, 2023
@erogol
Copy link
Member

erogol commented Nov 16, 2023

How many samples do you have?

@stlohrey
Copy link

I would guess the problem is that the tokenizer does not support greek which causes error in loading your data, resulting in an empty data loader.

@lpscr
Copy link
Author

lpscr commented Nov 16, 2023

How many samples do you have?

thank for your quick reply

first test like use in notebook

wav format
Channels: 1
Sample Width: 4
Frame Rate: 22050

files 
Minimum Duration: 1.237 seconds
Maximum Duration: 26.4 seconds
Total Duration: 04:03:40 

Total Files: 1820

second test 2

audio format
wav format
Channels: 1
Sample Width: 2
Frame Rate: 22050

files 
Minimum Duration: 1.0 seconds
Maximum Duration: 18.443 seconds
Total Duration: 09:20:49 

Total Files: 8526

this data i use with vits model and working fine with not any problem

the problem it's when i use with in xttsv1,xttsv2 i test both same problem

i try also to phonemizer or unidecode in the metadata.csv like say
@stlohrey so it's not greek characters same problem not help :(

from this

Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?

and also i try unidecode from Greek to English so like this i have only English character and same problem again

from this
Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|- Ti einai pali oi phones;|- Ti einai pali oi phones;

thank you for your time

let me know if you want to test something else

@AIFSH
Copy link

AIFSH commented Nov 16, 2023

#3206 (comment)

same as my issue

@stlohrey
Copy link

stlohrey commented Nov 16, 2023

i just reproduced your error by changing the language code on my working dataset to "grc" (as set in you notebook). So, try to change the language code together with the english characters, it should work.

@lpscr
Copy link
Author

lpscr commented Nov 16, 2023

@stlohrey thank you ! look like this the problem with language code and characters Greek like you say
this working but the problem it's i need phonemizer i don't think this be good to train like if i covert only to unidecode ,

@erogol can you please check this problem if possible to fix
if i use grc or el language like say @stlohrey dont working , but in vits working fine can check this please

the problem it's here like you say

config_dataset = BaseDatasetConfig(
    formatter="ljspeech",
    dataset_name="ljspeech",
    path="/home/lpc/appPython/ttsNew/lora",
    meta_file_train="metadata_fix.csv",
    language="en", # <<<  if i use grc code or el dont working problem
) 

it's possible some how use phonemizer for better train like i do with vits ?

if use like this don't working again same problem

        text_cleaner="phoneme_cleaners",
        use_phonemes=True,
        phoneme_language="grc",
        phoneme_cache_path="phoneme_cache",    

method with unidecode working pass

Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|- Ti einai pali oi phones;|- Ti einai pali oi phones;

method with phonemizer error len
like this i use also with vits if possible use also with xttsv1,xttsv2 ?

Paramythi_horis_onoma_0065|- Τι είναι πάλι οι φωνές;|- Τι είναι πάλι οι φωνές;
to
Paramythi_horis_onoma_0065|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?|tˈi ˈeːnaɪ pˈali ˈoɪ fɔːnˈes?

thank you

@stlohrey
Copy link

i think for adding a new language, you would have to work on the tokenizer script to introduce text cleaners for the new language, but also on the vocab, because the language code is tokenized together with the text, and that means that you would need to train the gpt model on the new tokens. i also dont think ipa phonemes represented i the vocab.json file provided.

@lpscr
Copy link
Author

lpscr commented Nov 16, 2023

how i can train gpt model with the new tokens ? can please give me more info about this

@stlohrey
Copy link

stlohrey commented Nov 16, 2023

you would need to change the tokenizer vocab and model config, maybe add a text cleaner for your language, and then run the trainer. I don't know if transfer learning from the pretrained checkpoint with new or modified input tokens works or makes sense. I also dont know if you would have to finetune the quantizer and the hifidecoder with the new language audio. maybe @erogol can give some hints on that topic.

@lpscr
Copy link
Author

lpscr commented Nov 16, 2023

@stlohrey thank you very much for all your the help :)
understand i need wait some help from the

@erogol know you're working hard, and I really appreciate it. Any hints would be great how add new language or finetune in Greek with the power of new model xtts . i know there is already support in vits i just need to know , any chance Greek in xtts could be supported in the next release? Thanks again for your amazing work.

@AIFSH
Copy link

AIFSH commented Nov 16, 2023

config_dataset = BaseDatasetConfig( formatter="ljspeech", dataset_name="ljspeech", path="/home/lpc/appPython/ttsNew/lora", meta_file_train="metadata_fix.csv", language="en", # <<< if i use grc code or el dont working problem )

I assign meta_file_val can run with Chinese dataset, maybe you can try.

@lpscr
Copy link
Author

lpscr commented Nov 16, 2023

@AIFSH thank you , i test like you say and does not working same problem again .

@brambox
Copy link

brambox commented Nov 17, 2023

You can train over any other that has the closest to your over time it mostly clear accents.
I was kind of able to add bulgarian [bg].
By editing [TTS/tts/configs/xtts_config.py] then [TTS/tts/layers/xtts/tokenizer.py] and add it in vocab.json
The problem it seems to still use some of existing and has some little accents.
We need an ID with as cleaner of accents as possible. Maybe there is one if the developers can tell us?
Im not sure if we ever get a train from 0 model.

@lpscr
Copy link
Author

lpscr commented Nov 17, 2023

@brambox hi thank you for the info and your help,

1 . i go to [TTS/tts/configs/xtts_config.py] i add the language here in dictionary
2 . i go to [TTS/tts/layers/xtts/tokenizer.py] and here i change a lot stuff and everything look ok
3 . i go to vocab.json in the model download in folder v2 and when i try change something here i get error len because i change the dictionary so cant change anything there , i stuck here

now this i found and try to do
it's with out change any file for original script

i just use already symbols in register inside in vocab.json

"a": 14,
"b": 15,
"c": 16,
etc...

and i make simple

dictionary map for replace the character i need for the text i use

"α" : "a"
"β" : "v" 
"γ" : "g"
etc...

i wonder how you change the vocab.json because when i try add or change i get error let me know

and yes be great if someone of developers give more help ,

@brambox
Copy link

brambox commented Nov 17, 2023

Try just adding the language without anything more put some random id and start training see the result that way.

@lpscr
Copy link
Author

lpscr commented Nov 17, 2023

@brambox you mean to go inside to vocab.json ? and add the language ?
can just write me this you add and where;line index to understand better if i change something or add i get error i use xttsv2

size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6153, 1024]) from checkpoint, the shape in current model is torch.Size([6154, 1024]).
        size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6153, 1024]) from checkpoint, the shape in current model is torch.Size([6154, 1024]).
        size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6153]) from checkpoint, the shape in current model is torch.Size([6154]).

@brambox
Copy link

brambox commented Nov 18, 2023

It seems we are kind of doing work around to use new lang but it will still need to train over an existing one.
So if for example do a change at {
"id": 6152,
"special": true,
"content": "[ko]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false
} and then go to "[ko]": 6152 and change
both to 'gr'
it will start train
Now you can use any other of languages ids same way. So just use the one closest to your lang accent. Maybe you can too add custom symbols but i haven't tested it.
Also if for example you remove "[ko]": 6152
then you can use any id 6152+ here {
"id": 6152,
"special": true....
but it will still use the lang you removed even id is diff
We need developers to actually support adding complete new lang. Or atleast give us some accent free ids we can train over.

@lpscr
Copy link
Author

lpscr commented Nov 18, 2023

ok this working you can replace the lang like you say and train start. but also need to replace in metadata.csv the character with already in register vocab.json , other get error len(DataLoader) returns 0 this because there is not Greek characters in vocab.json and if you try add,or replace already symbols like "a": 14 with "α": 14 you get error, so i guest no need to change anything in script in my case don't working so the only i can do it's just replace in metadata.csv only with English character and the train start with lang "en" like in my preview messages like i say with unidecode

i wonders with this method you try
how it's the results in your case the lang accent it's clear and how many hours dataset you use
and can you complare with vits ? to see if this working

i guest with need wait someone help for this like you say .

@brambox
Copy link

brambox commented Nov 19, 2023

Overall its pretty decent. The accents mostly clear over time just little bit.
The bigest problem i find for bulgarian is the stress for words is sometimes applied wrong even if the word is in the dataset. Sometimes it say it right some time wrong or sometimes its always wrong specially if its not in the dataset.
I wish there is some way to hard force stress position. Also we have words that are writen same but can mean different things based on stress so some kind of easyer control for it will be big help.
It is definetly much more natural than vits.
I tested different sizes but i stuck to around 2 and half to 3 hours. And noticed the longer datasets not always result in better model.

@lpscr
Copy link
Author

lpscr commented Nov 19, 2023

yes same here more hours make worst the stuff i try with 4 hours and then 12 hours also i have a lot time repeat same stuff some time listen with English accents sometime, very slow and fast and noise , so like i see this no working for my Languase and look like also like you say in yours , and yes sometime when it's working it's listen natural better for the model vits this i dont know if possible somehow to make work in this version xtts or need wait next release ,
i hope someone help more on this

i have some question to know i do correct the stuff

1 how many hours need to finetune ( i try 4 - 12 hours)
2 the audio file min duration and max duration with can use (for me i use min 1 sec max 11 sec 22055 mono with clear speak with not noise)
3 how many steps about need  (300k about 1 day train in my case something)
4 with can use multi speakers this help the train or single speaker better only to use , what best to use (i try single and multi-speakers  ) 
5 can replace the symbol English with  the character like unidecode (i use unidecode method and also replace some character to work correct in my lang  )
6 what model it's best to use xtts v1,v2 (i use only last model v2)
7 can train for the begin with out use any checkpoints  like in vits

thank you for your time

@78Alpha
Copy link

78Alpha commented Nov 23, 2023

Actually ran into this issue with an english dataset. Always filters down to 0.

@arbianqx
Copy link

I ran into the same problem aswell. Tried the upwards manually fixings, with no luck.

@Edresson Edresson self-assigned this Nov 27, 2023
@Edresson
Copy link
Contributor

Hi @lpscr,

Hi @arbianqx,

This message "> Total eval samples after filtering: 0" indicates that you don't have any eval samples that meet the training requeriments. It can be caused by three reasons:

  1. The Eval CSV that you provided is empty;
  2. The samples on the eval CSV that you provided are bigger than the max_wav_len and max_text_len defined on the recipe (https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/xtts_v2/train_gpt_xtts.py#L86C1-L87C29). Note that you do not recommend the changes of these values for fine-tuning;
  3. You do not provide an Eval CSV and all the samples automatically selected are bigger than max_wav_length and max_text_length.

In all these scenarios, you need to change (or create) your eval CSV to meet the requirements for training.

In your case looks like the issue is 3. You didn't provide a eval CSV and all the samples automatically selected are too long. In that way, all the evaluation samples will be filtered. I would recommend you create an eval CSV file using part of your data (like 15%), making sure that the samples respect the max_wav_length (11 seconds) and max_text_length (200 tokens).

Alternatively, the PR #3296 implements a gradio demo for data processing plus training and inference for XTTS model. On the PR, have also have a Google Colab and soon we will do a video showing how to use the demo.

@erogol
Copy link
Member

erogol commented Nov 28, 2023

Feel free to reopen if the comment above doesn't help.

@barghavanii
Copy link

@erogol @Edresson Hi , I intended to finetune xtts_v2 for Persian language and for this according to above comment they mentioned 2-4 hours of dataset is ok to fine tune the model but you didn't answer whether it is fine or not .can you advise me asap this is very Urgent for me ##3229 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants