What would it mean to generate images from the sound of speech, not from the semantics of the words themselves? I built this with a generative model called a Variational Auto Encoder, for Fundamentals of Speech Recognition (E6998), a graduate level speech recognition class at Columbia.
The project was inspired by this incredible paper by Zach Leiberman: https://github.com/alex-calderwood/phonaesthesia/blob/master/papers/leiberman_paper.pdf
I never really got it to generate anything interesting, but I think that was a finding in itself, and I want to return to it one day.
Requires:
- spacy
- tensorflow
- opencv (for image processing)
- requests (for scraping ImageNet data)
-
To obtain images, run:
python image_processing.py
Run until you have enough images. You will probably want to run this script in parrallel, or on multiple instances.
-
To transform your data into 32 x 32 pixel images run:
python resize.py
-
To generate new images based on the phonemes of the filenames in data/imagenet/scaled/ run:
python vae_gan.py
Because other KALDI code is not a dependency for this project, I've only included the relavant directories from tedlium: lang/ and local/ to reduce the size of the upload. The idea is that this can integrate with KALDI/TEDLIUM by putting this code into the tedlium directory and running from there.