update readme

Roboy · Sep 21, 2019 · 00e7c59 · 00e7c59
1 parent a5a73d9
commit 00e7c59
Show file tree

Hide file tree

Showing 2 changed files with 165 additions and 271 deletions.
diff --git a/README.md b/README.md
@@ -1,106 +1,197 @@
-# Roboy Sonosco
-Roboy Sonosco (from Lat. sonus - sound and nōscō - I know, recognize) - a library for Speech Recognition based on Deep Learning models
+![# Sonosco](./docs/imgs/sonosco_3.jpg)
+<br>
+<br>
+<br>
+<br>
 
-## Installation
+Sonosco (from Lat. sonus - sound and nōscō - I know, recognize) 
+is a library for training and deploying deep speech recognition models.
 
-The supported OS is Ubuntu 18.04 LTS (however, it should work fine on other distributions).
-Supported Python version is 3.6+.
-Supported CUDA version is 10.0.
-Supported PyTorch version is 1.0.
+The goal of this project is to enable fast, repeatable and structured training of deep 
+automatic speech recognition (ASR) models as well as providing a transcription server (REST API & frontend) to 
+try out the trained models for transcription. <br>
+Additionally, we provide interfaces to ROS in order to use it with 
+the anthropomimetic robot [Roboy](https://roboy.org/).
+<br>
+<br>
+<br>
 
----
+___
+### Installation
 
-Install CUDA 10.0 from [NVIDIA website](https://developer.nvidia.com/cuda-10.0-download-archive). Make sure that your local gcc, g++, cmake versions are not older than the ones used to compile your OS kernel.
-
-You will need to download the latest [cuDNN](https://developer.nvidia.com/rdp/cudnn-archive) for CUDA 10.0. 
-Unzip it:
-```
-tar -xzvf cudnn-9.0-linux-x64-v7.tgz
-```
-Run
+#### Via pip
+The easiest way to use Sonosco's functionality is via pip:
 ```
-sudo cp cuda/include/cudnn.h /usr/local/cuda/include
-sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
-sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
+pip install sonosco
 ```
----
+**Note**: Sonosco requires Python 3.7 or higher. 
 
-**All of the following steps you may perform inside [Anaconda](https://www.anaconda.com/) or [virtualenv](https://virtualenv.pypa.io/en/latest/)**
+For reliability, we recommend using an environment virtualization tool, like virtualenv or conda.
 
-Install [PyTorch](https://pytorch.org/get-started/locally/). For your particular configuration, you may want to build it from the [sources](https://github.com/pytorch/pytorch).
+<br>
+<br>
+#### For developers or trying out the transcription server
 
-Install SeanNaren's fork for Warp-CTC bindings. **Deprecated**: will be updated to use [built-in](https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss) functions.
-```
-git clone https://github.com/SeanNaren/warp-ctc.git
-cd warp-ctc; mkdir build; cd build; cmake ..; make
-export CUDA_HOME="/usr/local/cuda"
-cd ../pytorch_binding && python setup.py install
+Clone the repository and install dependencies:
 ```
+# Create a virtual python environment to not pollute the global setup
+conda create -n 'sonosco' python=3.7
 
-Install pytorch audio:
-```
-sudo apt-get install sox libsox-dev libsox-fmt-all
-git clone https://github.com/pytorch/audio.git
-cd audio && python setup.py install
+# activate the virtual environment
+conda activate sonosco
+
+# Clone the repo
+git clone https://github.com/Roboy/sonosco.git
+
+# Install normal requirements
+pip install -r requirements.txt
+
+# Link your local sonosco clone into your virtual environment
+pip install .
 ```
+Now you can check out some of the [Getting Started]() tutorials, to train a model or use 
+the transcription server.
+<br>
+<br>
+<br>
+____________
+### High Level Design
+
+
+![# High-Level-Design](./docs/imgs/high-level-design.svg)
+
+The project is split into 4 parts that correlate with each other:
+
+For data(-processing) scripts are provided to download and preprocess 
+some publicly available datasets for speech recognition. Additionally, 
+we provide scripts and functions to create manifest files 
+(i.e. catalog files) for your own data and merge existing manifest files
+into one.
+
+This data or rather the manifest files can then be used to easily train and 
+evaluate an ASR model. We provide some ASR model architectures, such as LAS, 
+TDS and DeepSpeech2 but also individual pytorch models can be designed to be trained.
+
+The trained model can then be used in a transcription server, that consists 
+of a REST API as well as a simple Vue.js frontend to transcribe voice recorded 
+by a microphone and compare the transcription results to other models (that can
+be downloaded in our [Github](https://github.com/Roboy/sonosco) repository).
+
+Further we provide example code, how to use different ASR models with ROS
+and especially the Roboy ROS interfaces (i.e. topics & messages).
+
+<br>
+<br>
+
+
+______
+### Data (-processing)
+
+##### Downloading publicly available datasets
+We provide scripts to download and process the following publicly available datasets:
+* [An4](http://www.speech.cs.cmu.edu/databases/an4/) - Alphanumeric database
+* [Librispeech](http://www.openslr.org/12) - reading english books
+* [TED-LIUM 3](https://lium.univ-lemans.fr/en/ted-lium3/) (ted3) - TED talks
+* [Voxforge](http://www.voxforge.org/home/downloads)
+* common voice (old version)
+
+Simply run the respective scripts in `sonosco > datasets > download_datasets` with the
+output_path flag and it will download and process the dataset. Further, it will create 
+a manifest file for the dataset.
+
+For example
 
-If you want decoding to support beam search with an optional language model, install [ctcdecode](https://github.com/parlance/ctcdecode):
 ```
-git clone --recursive https://github.com/parlance/ctcdecode.git
-cd ctcdecode && pip install .
+python an4.py --target-dir temp/data/an4
 ```
+<br>
+<br>
 
-Clone this repo and run this within the repo:
+##### Creating a manifest from your own data
+
+If you want to create a manifest from your own data, order your files as follows:
 ```
-pip install -r requirements.txt
+data_directory    
+└───txt
+│   │   transcription01.txt
+│   │   transcription02.txt
+│   
+└───wav
+    │   audio01.wav
+    │   audio02.wav
 ```
+To create a manifest, run the `create_manifest.py` script with the data directory and an outputfile 
+to automatically create a manifest file for your data.
 
-### Mixed Precision
-If you want to use mixed precision training, you have to install [NVIDIA Apex](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/):
+For example:
 ```
-git clone --recursive https://github.com/NVIDIA/apex.git
-cd apex && pip install .
+python create_manifest.py --data_path path/to/data_directory --output-file temp/data/manifest.csv
 ```
 
-## Usage
+<br>
+<br>
 
-### Dataset
+##### Merging manifest files 
 
-To create a dataset you must create a CSV manifest file containing the locations of the training data. This has to be in the format of:
+In order to merge multiple manifests into one, just specify a folder that contains all manifest 
+files to be merged and run the ``` merge_manifest.py```.
+This will look for all .csv files and merge the content together in the specified output-file.
+
+For example:
 ```
-/path/to/audio.wav,/path/to/text.txt
-/path/to/audio2.wav,/path/to/text2.txt
-...
+python merge_manifest.py --merge-dir path/to/manifests_dir --output-path temp/manifests/merged_manifest.csv
 ```
-There is an example in examples directory.
 
-### Training, Testing and Inference
+<br>
+<br>
 
-Fundamentally, you can run the scripts the same way:
-```
-python3 train.py --config /path/to/config/file.yaml
-python3 test.py --config /path/to/config/file.yaml
-python3 infer.py --config /path/to/config/file.yaml
-```
-The scripts are initialised via configuration files.
 
-#### Configuration
+___
+### Model Training
 
-Configuration file contains arguments for ModelWrapper initialisation as well as extra parameters. Like this:
-```
-train:
-  ...
-  log-dir: 'logs' # Location for log files
-  def-dir: 'examples/checkpoints/', # Default location to save/load models
-  model-name: 'asr_final.pth' # File name to save the best model
-  sample-rate: 16000 # Sample rate
-  window: 'hamming' # Window type for spectrogram generation
-  batch-size: 32 # Batch size for training
-  checkpoint: True # Enables checkpoint saving of model
-  ...
-```
-More configuration examples with descriptions you may find in the config directory.
+One goal of this framework is to keep training as easy as possible and enable 
+keeping track of already conducted experiments. 
+<br>
+<br>
+
+#### Analysis Object Model
+
+For model training, there are multiple objects that interact with each other.
+
+![# Analysis Object Model](./docs/imgs/aom.svg)
+
+For Model training, one can define different metrics, that get evaluated during the training
+process. These metrics get evaluated at specified steps during an epoch and during
+validation.<br>
+Sonosco provides different metrics already, such as [Word Error Rate (WER)]() or
+ [Character Error Rate (CER)]().  But additional metrics can be created in a similar scheme. 
+ See [Metrics](). 
+
+Additionally, callbacks can be defined. A Callback is an arbitrary code that can be executed during
+training. Sonosco provides for example different Callbacks, such as [Learning Rate Reduction](), 
+[ModelSerializationCallback](), [TensorboardCallback](), ... <br>
+Custom Callbacks can be defined following the examples. See [Callbacks](). 
+
+Most importantly, a model needs to be defined. The model is basically any torch module. For 
+(de-) serialization, this model needs to conform to the [Serialization Guide]().<br>
+Sonosco provides already existing model architectures that can be simply imported, such as 
+[Listen Attend Spell](), [Time-depth Separable Convolutions]() and [DeepSpeech2]().
+
+We created a specific AudioDataset Class that is based on the pytorch Dataset class. 
+This AudioDataset requires an AudioDataProcessor in order to process the specified manifest file.
+Further we created a special AudioDataLoader based on pytorch's Dataloader class, that 
+takes the AudioDataset and provides the data in batches to the model training.
+
+Metrics, Callbacks, the Model and the AudioDataLoader are then provided to the ModelTrainer.
+This ModelTrainer takes care of the training process. See [Getting Starter]().
 
-## Acknowledgements
+The ModelTrainer can then be registered to the Experiment, that takes care of provenance.
+I.e. when starting the training, all your code is time_stamped and saved in a separate directory, 
+so you can always repeat the same experiment. Additionally, the serialized model and modeltrainer,
+logs and tensorboard logs are saved in this folder.
 
-This project is partially based on SeanNaren's [deepspeech.pytorch](https://github.com/SeanNaren/deepspeech.pytorch) repository.
+Further, a Serializer needs to be provided to the Experiment. This object can serialize any
+arbitrary class with its parameters, that can then be deserialized using the Deserializer.<br>
+When the ```Èxperiment.stop()``` method is called, the model and the ModelTrainer get serialized, 
+so that you can simply continue the training, with all current parameters (such as epoch steps,...)
+when deserializing the ModelTrainer and continuing training.