Skip to content
This repository has been archived by the owner on Jan 18, 2024. It is now read-only.

Commit

Permalink
Readme, code structure, refactoring.
Browse files Browse the repository at this point in the history
  • Loading branch information
Tomiinek committed May 26, 2020
1 parent 861c6e9 commit ca00959
Show file tree
Hide file tree
Showing 5 changed files with 101 additions and 724 deletions.
62 changes: 62 additions & 0 deletions CODE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
## Code Structure

```
Root
╠═ train.py <- main file used for training of models
╠═ synthesize.py <- file for synthesis of spectrograms using checkpoints
╠═ gta.py <- script for generating ground-truth-aligned spectrograms
╠═ utils
║ ├── __init__.py <- various useful rutines: build model, move to GPU, mask by lengths
║ ├── audio.py <- functions for audio processing, e.g., loading, spectrograms, mfcc, ...
║ ├── logging.py <- Tensorboard logger, logs spectrograms, alignments, audios, texts, ...
║ ├── samplers.py <- batch samplers to produce balanced batches w.r.t languages, speakers
║ └── text.py <- text rutines: conversion to IDs, punctuation stripping, phonemicization
╠═ params
║ ├── params <- definition of default hyperparameters and their description
║ ├── singles <- multiple files with parameter settings for training of monolingual models
║ └── ... <- multiple files with parameter settings for training of multilingual models
╠═ notebooks
║ ├── analyze.ipynb <- basic dataset analysis for inspection of its properties and data distributions, with plots
║ ├── audio_test.ipynb <- experiments with audio processing and synthesis
║ ├── encoder_analyze.ipynb <- analysis of encoder outputs, speaker and language embeddings with plots
║ ├── code_switching_demo.ipynb <- code-switching synthesis demo
║ └── multi_training_demo.ipynb <- multilingual training demo
╠═ modules
║ ├── attention.py <- attention modules: location-sensitive att., forward att., and base class
║ ├── cbhg.py <- CBGH module known from Tacotron 1 with simple highway layer (not convolutional)
║ ├── classifier.py <- adversarial classifier with gradient reversal layer, cosine similarity classifier
║ ├── encoder.py <- multiple encoder architectures: convolutional, recurrent, generated, separate, shared
║ ├── generated.py <- meta-generated layers: 1d convolution, batch normalization
║ ├── layers.py <- regularized LSTMs (dropout, zoneout), convolutional block and highway convolutional blocks
║ └── tacotron2.py <- implementation of Tacotron 2 with all its modules and loss functions
╠═ evaluation
║ ├── code-switched <- code-switching evaluation sentences
║ ├── in-domain <- in-domain (i.e., from CSS10) monolingual evaluation sentences
║ ├── out-domain <- in-domain (i.e., from Wikipedia) monolingual evaluation sentences in ten languages
║ ├── asr_request.py <- script for scraping transcription of given audios from Google Cloud ASR
║ ├── cer_computer.py <- script for calculating character error rate between transcripts pairs
║ └── mcd_request.py <- script for getting mel cepstral distortion between two spectrograms (includes DTW)
╠═ dataset_prepare
║ ├── mecab_convertor.py <- romanization of Japanese script
║ ├── pinyin_convertor.py <- romanization of Chinese script
║ ├── normalize_comvoi.sh <- basic shell script for downloading, extracting and cleaning od some Common Voice data
║ ├── normalize_css10.sh <- set of regular expressions for cleaning CSS10 dataset transcripts
║ └── normalize_mailabs.sh <- probably not complete set of reg. exp.s for cleaning M-AILABS dataset transcripts
╠═ dataset
║ ├── dataset.py <- TTS dataset, contains mel and linear spec., texts, phonemes, speaker and language IDs and a
║ │ function for generating proper meta-files and spectrograms for some datasets (see loaders.py)
║ └── loaders.py <- methods for loading popular TTS datasets into standardized python list (see dataset.py above)
╚═ data
├── comvoi_clean
│ ├── all.txt <- prepared meta-file for cleaned Common Voice dataset
│ └── silence.sh <- script for removal of leading or trailing "silence" of Common Voice audios
├── css10
│ ├── train.txt <- prepared meta-file for training set of cleaned CSS10 dataset
│ └── val.txt <- prepared meta-file for validation set of cleaned CSS10 dataset
├── css_comvoi
│ ├── train.txt <- prepared meta-file for training set of dataset which is mixture of CSS10 and CV
│ └── val.txt <- prepared meta-file for validation set of dataset which is mixture of CSS10 and CV
└── prepare_css_spectrograms.py <- ad-hoc script for generating linear and mel spectrograms, see README.md
```
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<p align="center">
<a href="https://colab.research.google.com/github/Tomiinek/Multilingual_Text_to_Speech/blob/master/notebooks/code_switching_demo.ipynb"><b>Interactive synthesis demo</b></a><br>
<a href="http://tts.neqindi.cz"><b>Website with samples</b></a>
<a href="https://tomiinek.github.io/multilingual_speech_samples/"><b>Website with samples</b></a>
</p>

<p>&nbsp;</p>
Expand All @@ -27,9 +27,9 @@ We provide synthesized samples, training and evaluation data, source code, and p

**Interactive demos** introducing code-switching abilities and joint multilingual training of the generated model (trained on an enhanced CSS10 dataset) are available [here](https://colab.research.google.com/github/Tomiinek/Multilingual_Text_to_Speech/blob/master/notebooks/code_switching_demo.ipynb) and [here](https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/master/notebooks/multi_training_demo.ipynb), respectively.

Many **samples synthesized using the three compared models** are at [this website](http://tts.neqindi.cz). It contains also a few samples synthesized by a monolingual vanilla Tacotron trained on LJ Speech with the Griffin-Lim vocoder (a sanity check of our implementation).
Many **samples synthesized using the three compared models** are at [this website](https://tomiinek.github.io/multilingual_speech_samples/). It contains also a few samples synthesized by a monolingual vanilla Tacotron trained on LJ Speech with the Griffin-Lim vocoder (a sanity check of our implementation).

Our best model supporting code-switching or voice-cloning can be downloaded [here](https://www.dropbox.com/s/hjrlg5d11er0u0c/generated_switching.pyt) and the best model trained on the whole CSS10 dataset without the ambition to do voice-cloning is available [here](https://www.dropbox.com/s/0vlz1fu2c6k1zfy/generated_training.pyt).
Our best model supporting code-switching or voice-cloning can be downloaded [here](https://github.com/Tomiinek/Multilingual_Text_to_Speech/releases/download/v1.0/generated_switching.pyt) and the best model trained on the whole CSS10 dataset without the ambition to do voice-cloning is available [here](https://github.com/Tomiinek/Multilingual_Text_to_Speech/releases/download/v1.0/generated_training.pyt).

<p>&nbsp;</p>

Expand Down Expand Up @@ -142,5 +142,10 @@ echo "01|Dies ist ein Beispieltext.|00-fr|de" | python3 synthesize.py --checkpoi

## Vocoding

We used the WaveRNN model for vocoding. You can download [WaveRNN weights](https://www.dropbox.com/s/ydep8fdzbplaamu/wavernn_weight.pyt) pre-trained on the whole CSS10 dataset.
We used the WaveRNN model for vocoding. You can download [WaveRNN weights](https://github.com/Tomiinek/Multilingual_Text_to_Speech/releases/download/v1.0/wavernn_weight.pyt) pre-trained on the whole CSS10 dataset.
For examples of usage, visit our interactive demos ([here](https://colab.research.google.com/github/Tomiinek/Multilingual_Text_to_Speech/blob/master/notebooks/code_switching_demo.ipynb) and [here](https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/master/notebooks/multi_training_demo.ipynb)) or [this repository](https://github.com/Tomiinek/WaveRNN).


## Code Structure

Please, see [this file](https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/master/CODE.md) for more details about the contained source-code and its structure.
Loading

0 comments on commit ca00959

Please sign in to comment.