For audio the data requires different preprocessing, the course cs224s can be good starting point.
- Fundamentals of Speech Recognition [$] by L. Rabiner and B-H Juang 1st Edition
- Statistical Methods for Speech Recognition - Language, Speech, and Communication [$] by F. Jelinek Fourth Printing Edition.
- Automatic Speech Recognition: A Deep Learning Approach - Signals and Communication Technology [$] by D. Yu and L. Deng 2015th Edition
- Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, G. Hinton et al., 2012
- Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, G. Dahl et al., 2012
- Acoustic modeling using deep belief networks, A. Mohamed et al., 2012
- Deep speech 2: End-to-end speech recognition in English and Mandarin, D. Amodei et al., 2015
- End-to-end attention-based large vocabulary speech recognition, D. Bahdanau et al., 2016
- Speech recognition with deep recurrent neural networks, A. Graves, 2013