PyTorch implementation of the method described in the Voice Synthesis for in-the-Wild Speakers via a Phonological Loop.
VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled in the wild. Some demo samples can be found here.
Follow the instructions in Setup and then simply execute:
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
Results will be placed in models/vctk/results
. It will generate 2 samples:
- The generated sample will be saved with the gen_10.wav extension.
- Its ground-truth (test) sample is also generated and is saved with the orig.wav extension.
You can also generate the same text but with a different speaker, specifically:
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth
Which will generate the following sample.
Here is the corresponding attention plot:
Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14.
Finally, free text is also supported:
python generate.py --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
Requirements: Linux/OSX, Python2.7 and PyTorch 0.1.12. The current version of the code requires CUDA support for training. Generation can be done on the CPU.
git clone https://github.com/facebookresearch/loop.git
cd loop
pip install -r scripts/requirements.txt
The data used to train the models in the paper can be downloaded via:
bash scripts/download_data.sh
The script downloads and preprocesses a subset of VCTK. This subset contains speakers with american accent.
The dataset was preprocessed using Merlin - from each audio clip we extracted vocoder features using the WORLD vocoder. After downloading, the dataset will be located under subfolder data
as follows:
loop
├── data
└── vctk
├── norm_info
│ ├── norm.dat
├── numpy_feautres
│ ├── p294_001.npz
│ ├── p294_002.npz
│ └── ...
└── numpy_features_valid
The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.
Pretrainde models can be downloaded via:
bash scripts/download_models.sh
After downloading, the models will be located under subfolder models
as follows:
loop
├── data
├── models
├── vctk
│ ├── args.pth
│ └── bestmodel.pth
└── vctk_alt
Finally, speech generation requires SPTK3.9 and WORLD vocoder as done in Merlin. To download the executables:
bash scripts/download_tools.sh
Which results the following sub directories:
loop
├── data
├── models
├── tools
├── SPTK-3.9
└── WORLD
Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:
python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90
Then, continue training the model using noise level of 2, on full sequences:
python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90
If you find this code useful in your research then please cite:
@article{taigman2017voice,
title = {Voice Synthesis for in-the-Wild Speakers via a Phonological Loop},
author = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprinttype = {arxiv},
eprint = {1705.03122},
primaryClass = "cs.CL",
year = {2017}
month = July,
}
Loop has a CC-BY-NC license.