This project focuses on classifying IMDB movie reviews as positive or negative using various sentiment analysis techniques. Leveraging a dataset of 50,000 reviews, the analysis involves comprehensive data preprocessing, feature extraction using TF-IDF vectorization, and experimentation with multiple machine learning models. The highlight is the implementation of a CNN-LSTM hybrid model that demonstrated superior performance with an accuracy of 90%.
- Data preprocessing including cleaning, tokenization, and vectorization.
- Exploration of 11 different machine learning models.
- Detailed analysis and comparison of model performances.
- Implementation of a CNN-LSTM hybrid model showcasing the effectiveness of deep learning in NLP.
To replicate the analysis or apply the models to new data, follow the notebooks provided in the repository.
The dataset comprises 50,000 IMDB movie reviews, evenly split between positive and negative sentiments. It's publicly available and was prepared by Stanford University's AI Lab.
The CNN-LSTM hybrid model achieved the highest accuracy at 90%, outperforming other models. Detailed performance metrics and analysis are provided for each model tested.
Further improvements could explore transformer-based models like BERT and GPT, ensemble methods, and application to other domains or languages.
If you find this project useful, please consider citing:
- The original dataset from Stanford University's AI Lab.
- Relevant publications and resources listed in the
References
section of the project report.
For any inquiries or contributions, please contact [[email protected]].