Skip to content

Latest commit

 

History

History
100 lines (50 loc) · 1.91 KB

File metadata and controls

100 lines (50 loc) · 1.91 KB

Amazon-review-sentiment-analysis

Data : https://www.kaggle.com/bittlingmayer/amazonreviews (400K text reviews)

Preprocessing: TFIDF-Vectorizer

Selected Models: Decision Tree, Logistic Regression, Neural Network(10 * 10), Random Forest, Ensemble(hard voting: LR+NN+RF)

Evaluate the result

Performance:

Decision Tree:

  • training accuracy: 93.4%

  • testing accuracy: 76.2%

  • precision: 75.6%

  • recall: 77.0%

  • f1: 76.3%

Logistic Regression:

  • training accuracy: 93.2%

  • testing accuracy: 90.1%

  • precision: 91.8%

  • recall: 88.1%

  • f1: 89.9%

Neural Network:

  • training accuracy: 92.0%

  • testing accuracy: 90.4%

  • precision: 90.3%

  • recall: 90.5%

  • f1: 90.4%

Random Forest:

  • training accuracy: 99.4%

  • testing accuracy: 78.5%

  • precision: 83.1%

  • recall: 71.4%

  • f1: 76.8%

Ensemble:

  • training accuracy: 95.8%

  • testing accuracy: 90.4%

  • precision: 91.3%

  • recall: 89.3%

  • f1: 90.3%

Training time:

ROC:

Word counting vs. TF-IDF

Summary

  1. High frequency words are not necessarily important

  2. Tree base model is likely to overfitting

  3. For sentiment analysis, logistic regression is a good model to try first. (fast and high accuracy)

  4. Good feature transformation can have a better prediction. (Tfidf > wordCount)

  5. Further improvement? Maybe try Sequence model