Background of this project: Take different transfromation methods(BOW,TFIDF,DOC2VEC) and algorithms to classfiy and cluster five books-chesterton-brown,austen-emma,edgeworth-parents,milton-paradise,bible-kjv
Data preprocessing: Convert all letters into lower case Remove punctuations Tokenize the documents to remove stopwords (nltk library) Lemmatization Transform text into vector
Classification:
Support Vector Machines (SVM) K-Nearest Neighbors (KNN) Decision Tree Random Forest Logistic Regression
Clustering: K-means Hierarchical Expectation-Maximization (EM)