-
Notifications
You must be signed in to change notification settings - Fork 35
/
README.md~
135 lines (104 loc) · 7.08 KB
/
README.md~
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# A TensorFlow implementation of DeepMind's WaveNet paper
This is a TensorFlow implementation of the [WaveNet generative neural
network architecture](https://deepmind.com/blog/wavenet-generative-model-raw-audio/) for <b>text</b> generation.
# Previous work
<table style="border-collapse: collapse">
<tr>
<td>
<p>
Originally, the WaveNet neural network architecture directly generates a raw audio waveform,
showing excellent results in text-to-speech and general audio generation (see the
DeepMind blog post and paper for details).
</p>
<p>
This network models the conditional probability to generate the next
sample in the audio waveform, given all previous samples and possibly
additional parameters.
</p>
<p>
After an audio preprocessing step, the input waveform is quantized to a fixed integer range.
The integer amplitudes are then one-hot encoded to produce a tensor of shape <code>(num_samples, num_channels)</code>.
</p>
<p>
A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.
</p>
<p>
The core of the network is constructed as a stack of <em>causal dilated layers</em>, each of which is a
dilated convolution (convolution with holes), which only accesses the current and past audio samples.
</p>
<p>
The outputs of all layers are combined and extended back to the original number
of channels by a series of dense postprocessing layers, followed by a softmax
function to transform the outputs into a categorical distribution.
</p>
<p>
The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.
</p>
<p>
In this repository, the network implementation can be found in <a href="./wavenet/wavenet.py">wavenet.py</a>.
</p>
</td>
<td width="300">
<img src="images/network.png" width="300"></img>
</td>
</tr>
</table>
## New approach
This work is based in the implementation of the original WaveNet model ([Wavenet](https://github.com/ibab/tensorflow-wavenet)), but with some modifications.
That's because we are going to use the WaveNet model as a <b style="color: red;">text generator</b>. We'll use raw text data (characters), instead of raw audio files, and once the network is trained, we'll use the conditional probability finded to generate samples (character) into an autogenerative process.
Only printable ASCII characters (Dec. 0 up to 255) is supported right now.
## Results
Pretty interesting results are reached!! Feeding the network with enough text and training, the model is able to memorize the probability of the characters disposition (in a lenguage), and generate later even a very similar text!!
For example, using the <a href="./data/texts/ptb.train.txt">Penn Tree Bank</a> (PTB) dataset, and only after 15000 steps of training (with low parameters setting) this was the autogenerated output (the final loss was around 1.1):
"Prediction is: put yearan that the buy-out foods secrection the neversation including common spending in <unk> roship losses facilities of billion which about <unk> to reston was N computers we world own grace a febels the declining with though newly by of genety your with the index and review tore trading of helped one of listration in younger 's turn genering to still at an up a bo was effected capital back of the strategist institution of net posts and average wrading of the our yet a letter to acquester lowers for his prebalisting months and investment shrough the additiones markete was budget whites of mostly <unk> into dealing tripled as the assession produced on exchange was grade and the buy-out invested thy proveage dould <unk> there in the hardle of a prevoling <unk> by told in <unk> to that upward benefited the ratch before six became they day who evergn as <unk> even far over the chief effective lower includes but they were <unk> took which bos N N a years to taker <unk> history and <unk> say other mortality him its been as $ N million by need we 're as he revised third-quarter third-quarter the <unk> the gaining wall spends that creations and costs requests ' growtiner by mortality was conful a conversully down to heath over tobal specifie treasuph the like he <unk> they its N N creation death of the u.s. should <unk> as cashight have intented frehen 's <unk> and substastalional producer corp. and solicod househ block lehman such as include him they would followed"
This is really wonderful!! We can see that the original WaveNet model has a great capacity to learn and save long codified text information inside its nodes (and not only audio or image information). This "text generator" WaveNet was able to learn how to writte English words just by predicting characters one by one, and sometimes was able even to learn what word to use based on context.
This output is far to be perfect, but It was trained in a only CPU machine (without GPU) using a low parameter configuration in just two hours!! I hope somebody with a better computer can explore the potential of this implementation.
If you want to check this results, you just have to type this in a command line terminal (this will use the trained model checkout I uploaded to the respository):
```bash
python generate.py --text_out_path=output.txt --samples 2000 ./logdir/train/2016-10-02T10-45-10/model.ckpt-14999
```
## Requirements
TensorFlow needs to be installed before running the training script.
TensorFlow 0.10 and the current `master` version are supported.
## Training the network
You can use any text (`.txt`) file.
In order to train the network, execute
```bash
python train.py --data_dir=data
```
to train the network, where `data` is a directory containing `.txt` files.
The script will recursively collect all `.txt` files in the directory.
You can see documentation on each of the training settings by running
```bash
python train.py --help
```
You can find the configuration of the model parameters in [`wavenet_params.json`](./wavenet_params.json).
These need to stay the same between training and generation.
## Generating text
You can use the `generate.py` script to generate audio using a previously trained model.
Run
```
python generate.py --samples 16000 model.ckpt-1000
```
where `model.ckpt-1000` needs to be a previously saved model.
You can find these in the `logdir`.
The `--samples` parameter specifies how many characters samples you would like to generate.
The generated waveform can be stored as a
`.txt` file by using the `--text_out_path` parameter:
```
python generate.py --text_out_path=mytext.txt --samples 1500 model.ckpt-1000
```
Passing `--save_every` in addition to `--text_out_path` will save the in-progress wav file every n samples.
```
python generate.py --text_out_path=mytext.txt --save_every 2000 --samples 1500 model.ckpt-1000
```
Fast generation is enabled by default.
It uses the implementation from the [Fast Wavenet](https://github.com/tomlepaine/fast-wavenet) repository.
You can follow the link for an explanation of how it works.
This reduces the time needed to generate samples to a few minutes.
To disable fast generation:
```
python generate.py --samples 1500 model.ckpt-1000 --fast_generation=false
```
## Missing features
Currently, there is no conditioning on extra information.