Try out the model on the LangSonic Website.
LangSonic is a reliable Convolutional Neural Network (CNN) model designed for rapid spoken-language classification. It processes audio files into log-mel spectrograms to accurately identify the language within.
Read the paper!
LangSonic supports the following languages:
- English 🇺🇸
- German 🇩🇪
- French 🇫🇷
- Spanish 🇪🇸
- Italian 🇮🇹
- Efficient Processing: LangSonic employs a high-speed CNN architecture for swift and efficient analysis of audio spectrograms.
- Precision: The model is finely tuned to ensure accurate language classification, prioritizing reliability in diverse applications.
- Training Time: The chosen CNN architecture is simple and quick to train. If you want to modify it or add a language, just add and process the data. On an M1 Pro chip, it takes approximately 20 minutes to train.
LangSonic achieves an accuracy of 76% when classifying between English, German, French, Spanish, and Italian—comparable to a similar CNN model evaluated by Sergey Vilov. The confusion matrix highlights common misclassifications, while accuracy metrics for each language provide insights into the model's performance.
Training on a dataset of 450,000 spectrograms (5 languages at ~90,000 clips per language) for 10 epochs took approximately 20 minutes on an Apple M1 Pro chip.
To run the model locally, follow these steps:
-
Clone the repository:
git clone https://github.com/thabnir/LangSonic
-
Navigate to the
flask/
directory inside the repo:cd LangSonic/flask
-
Install the required packages:
pip install -r requirements.txt
-
Run the app:
python app.py
This will serve the site on localhost:5678/ for you to try out locally.
-
Download the mp3 data for each language from Hugging Face. Store it in the project in the format
/data/<langname>/<filename>.mp3
(e.g.,data/mp3/en/common_voice_en_73382.mp3
). -
Run
data_processing.ipynb
to process the audio files into spectrograms which will be cached in/data/spectrogram/<langname>_train_new/<filename>.png
. -
Run
training.ipynb
to train the model.
This repository contains both the scripts used to train the model and the code required to build the web app. The dataset used to train the model is a subset of Mozilla Common Voice, comprising the first three .zip files in the dataset for each language, each containing 40,000 audio samples. It can be obtained from Mozilla Common Voice Dataset.
Reports for the associated MAIS 202 project can be found in reports/
.
Code for the website is located in flask/
.
Data processing code for the model's training is in data_processing.ipynb
Model training code is in training.ipynb
.
Our technical report on the model can be found at paper.pdf.
This project is licensed under the MIT License - see LICENSE.txt for details.