This is a speech synthesis composite model that simultaneously reconstructs mel-spectrogram and wave form from text. The model generates wave form from symbol sequences separated by space. The model is built on top of the modified ForwardTacotron and modified MelGAN frameworks.
Metric | Value |
---|---|
Source framework | PyTorch* |
The text-to-speech-en-0001-duration-prediction model is a ForwardTacotron-based duration predictor for symbols.
Metric | Value |
---|---|
GFlops | 15.84 |
MParams | 13.569 |
Sequence, name: input_seq
, shape: 1, 512
, format: B,C
where:
- B - batch size
- C - number of symbols in sequence
Sequence, name: input_mask
, shape: 1, 1, 512
, format: B, D, C
where:
- B - batch size
- D - extra dimension for multiplication
- C - number of symbols in sequence
Mask for input sequence, name: input_mask
, shape: 1, 1, 512
, format: B, D, C
where:
- B - batch size
- D - extra dimension for multiplication
- C - number of symbols in sequence
Mask for relative position representation in attention, name: pos_mask
, shape: 1, 1, 512, 512
, format: B, D, C, C
where:
- B - batch size
- D - extra dimension for multiplication
- C - number of symbols in sequence
- Duration for input symbols, name:
duration
, shape:1, 512, 1
, formatB, C, H
. Contains predicted duration for each of the symbol in sequence.- B - batch size
- C - number of symbols in sequence
- H - empty dimension
- Processed embeddings, name:
embeddings
, shape:1, 512, 256
, formatBxCxH
. Contains processed embeddings for each symbol in sequence.- B - batch size
- C - number of symbols in sequence
- H - height of the intermediate feature map
The text-to-speech-en-0001-regression model accepts aligned by duration processed embeddings (for example: if duration is [2, 3] and processed embeddings is [[1, 2], [3, 4]], aligned embeddings is [[1, 2], [1, 2], [1,2], [3, 4], [3, 4]]) and produces mel-spectrogram.
Metric | Value |
---|---|
GFlops | 7.65 |
MParams | 4.96 |
Processed embeddigs aligned by durations, name: data
, shape: 1x512x256
, format: BxTxC
where:
- B - batch size
- T - time in mel-spectrogram
- C - processed embedding dimension
Mask for 'data' by time dimension, name: data_mask
, shape: 1x1x512
, format: BxDxT
where:
- B - batch size
- D - extra dimension for multiplication
- T - time in mel-spectrogram
Mask for relative position representation in attention, name: pos_mask
, shape: 1x1x512x512
, format: BxDxCxC
where:
- B - batch size
- D - extra dimension for multiplication
- C - number of symbols in sequence
Mel-spectrogram, name: mel
, shape: 80x512
, format: CxT
where:
- T - time in mel-spectrogram
- C - number of rows in mel-spectrogram
The text-to-speech-en-0001-generation model is a MelGAN based audio generator.
Metric | Value |
---|---|
GFlops | 48.38 |
MParams | 12.77 |
Mel-spectrogram, name: mel
, shape: 1x80x128
, format: BxCxT
where:
- B - batch size
- C - number of rows in mel-spectrogram
- T - time in mel-spectrogram
Audio, name: audio
, shape: 32768
, format: T
where:
- T - time in audio with sampling rate 22050 (~1.5 sec).
[*] Other names and brands may be claimed as the property of others.